Abstract
Research in human genetics and genetic epidemiology has grown significantly over the previous decade, particularly in the field of pharmacogenomics. Pharmacogenomics presents an opportunity for rapid translation of associated genetic polymorphisms into diagnostic measures or tests to guide therapy as part of a move towards personalized medicine. Expansion in genotyping technology has cleared the way for widespread use of whole-genome genotyping in the effort to identify novel biology and new genetic markers associated with pharmacokinetic and pharmacodynamic endpoints. With new technology and methodology regularly becoming available for use in genetic studies, a discussion on the application of such tools becomes necessary. In particular, quality control criteria have evolved with the use of GWAS as we have come to understand potential systematic errors which can be introduced into the data during genotyping. There have been several replicated pharmacogenomic associations, some of which have moved to the clinic to enact change in treatment decisions. These examples of translation illustrate the strength of evidence necessary to successfully and effectively translate a genetic discovery. In this review, the design of pharmacogenomic association studies is examined with the goal of optimizing the impact and utility of this research. Issues of ascertainment, genotyping, quality control, analysis and interpretation are considered.
Keywords: Epistasis, genotyping, personalized medicine, pharmacogenomics, quality control, statistics, study design
1. INTRODUCTION
The term pharmacogenomics describes the study of how genetic variation modulates the response of the human body to therapy. The goal in conducting pharmacogenomic research is to find variants which account for the high level of interindividual variability in response to drug treatment, whether that response is an adverse drug reaction (ADR), blood drug level, or general efficacy of the treatment. Prevention of serious adverse reactions is paramount to improving treatment outcomes, as it is estimated that such reactions are responsible for 4–5% of hospital fatalities [1]. Pharmacogenomic research holds the possibility of shaping personalized medicine by allowing medical decisions of treatment to be based upon the private suite of variation present within the patient’s genome. By elucidating the genetic determinants of variation in drug response, safe and effective treatment decisions can be made without the necessity of trial and error. The study of pharmacogenomics has been made possible by the large expansion of the field of human genetics over the past several decades. This expansion has been driven by the discovery of novel genetic variation coupled with innovations in genotyping technology. From the discovery of the restriction fragment length polymorphism (RFLP) in 1980 [2] to the that of the microsatellite [3] and, more recently, widespread utilization of single nucleotide polymorphisms (SNPs) [4], genotypic markers have served to bolster our understanding of the segregation of disease loci and the impact of human genetic variation.
A SNP is a single base-pair change in the genetic code. SNPs are distributed in the genome at a density of one every 1000 base-pairs on average [5]. The most recent addition to the genetic epidemiology toolbox, copy number variants (CNVs) have become a focus of intense interest during the last few years [6]. CNVs are genomic regions of greater than 1000 base-pairs which are represented at a different number of copies than are found in the reference genome [7]. The terms marker and variant are used interchangeably and can refer to any indicator of genetic variation. Although the discovery of new markers of genetic variation has been paramount to advancing the field of human genetics, the true catalysts have come in the form of large-scale research projects such as the Human Genome Project [8, 9] and the International HapMap project [10]. The HapMap project, by documenting the patterns of variation across multiple populations, introduced two important concepts essential to establishing the currently-used large-scale genetic association studies, namely that there are long stretches of linkage disequilibrium (LD) across the genome and that these blocks of LD vary between populations [11].
Linkage disequilibrium is a phenomenon whereby there is an association between alleles at multiple genetic loci [12]. Due to the patterns of LD within the genome, it was found that even though millions of variants exist, nearly all common variation could be assayed in a European population by genotyping approximately a half million specifically selected SNPs known as tag SNPs [11]. These tag SNPs became the key to performing genome wide association studies (GWAS), which rely on the common disease common variant (CDCV) hypothesis.
The CDCV hypothesis states that most of the heritability of common disease is the result of risk alleles commonly found in the population [13]. Heritability describes the proportion of the variance in a trait or disease that is explained by genetics. In many cases, our current estimates of trait heritability derive from twin studies in which the concordance in phenotype between monozygotic (identical) twins is compared with that between dizygotic (fraternal) twins [14]. If the trait or disease has a significant genetic etiology, we should expect to see higher concordance between monozygotic twins. GWAS rely on LD between SNPs which are typed on genotyping platforms and the true susceptibility loci which modulate the trait of interest. Although it can be argued that GWAS has experienced great success in finding novel biology related to the underpinnings of many diseases [15, 16, 17], it has not been the panacea that was promised in some reviews written prior to widespread use of the GWAS design [18, 19]. Most genetic variants which have been associated with disease outcomes have remarkably small effects in terms of increasing the risk of disease or explaining the heritability of the trait [20]. As a result of these small effect sizes, the majority of associated variants discovered by GWAS have diminishing value for use in disease prediction. There are, however, a few examples of common diseases for which large genetic effects have been found, both from linkage studies and genetic association studies. Linkage studies use family pedigree information to look for co-segregation of a genetic locus with disease status [14]. The link between Alzheimer’s disease and the APOE gene is a good example of a strong effect found from linkage analysis [21, 22] while the effect of polymorphism in the CFH gene on risk for Age-related macular degeneration (AMD) was found simultaneously through genetic linkage and genetic association [23, 24, 25].
This paper examines the design of pharmacogenomic association studies with the goal of optimizing the impact and utility of this research. In particular, the issues of ascertainment, genotyping, quality control, analysis and interpretation are considered.
2. GENETIC ASSOCIATION STUDIES
Discoveries such as those in the case of APOE, for which the ε4 allele is associated with a 4-fold increase in risk for late-onset Alzeimer’s disease [26], and CFH, where individuals homozygous for the Y402H polymorphism have a 7.4-fold increase in risk for AMD [25], seem to be the exception rather than the rule when variants predisposing risk to common disease are concerned. There are several hypotheses as to why variation discovered up to this point has explained such a low proportion of the heritability for common diseases and ideas concerning where the remaining heritability could lie. One such explanation lies in the common disease, rare variant (CDRV) hypothesis [27] which states that the genetic burden of common diseases is most likely the result of multiple rare variations in common genes or pathways. This is a hypothesis which cannot currently be rigorously tested due to the dearth of knowledge about rare variation. Although GWAS data will not answer questions regarding the CDRV hypothesis due to its focus on common variation, whole-exome and whole-genome sequencing technologies are currently experiencing expanded use and [28] should allow the veracity of the CDRV hypothesis to be assessed. Another explanation for the lack of heritability accounted for so far with genetic association study results is that such association studies have failed to account for the complex genetic architecture which is likely to underlie common diseases [20, 29]. Most genetic association studies - particularly GWAS - have focused their analysis on the discovery of monogenic risk. By restricting analysis to single genetic variants at a time, these association studies have discarded the possibility of interactions between risk loci – also known as epistasis – which could be the key to at least part of the risk for disease.
The term epistasis refers to the phenomenon of gene-gene interactions and has three general definitions which have been proposed by Phillips in his review of epistasis [30]. The first definition, referred to as functional epistasis, describes the situation in which two proteins or molecular components interact within the cell. This type of epistasis is detected by molecular techniques such as yeast two-hybrid assays and chromatin immunoprecipitation and should be influential in complex drug metabolism networks. However, because functional epistasis implies considerable knowledge about the relationship of molecular components, it is often not possible to draw conclusions about its presence without downstream functional analysis in a model organism system such as mouse, yeast or cell culture.
The first description of epistasis in the literature is in the form of an allele at one locus masking the effect of an allele at a second locus. Bateson observed this phenomenon as a departure from Mendelian inheritance ratios in mouse coat color in 1909 [31] and it has now become known as compositional epistasis. While compositional epistasis can be clearly defined in model organisms with a common genetic background, in human populations with diverse genetic profiles it is difficult to discern an event of compositional epistasis. The definition of epistasis most relevant to the study of genetic epidemiology and pharmacogenomics is that of statistical epistasis, in which the risk predisposed by having two genetic variations simultaneously is not additive on a log scale from the risk of having the two risk variants separately. Due to the extensive complementation of genes and the complexity of gene networks involved in drug metabolism and clearance, it is entirely reasonable to assume that in many cases more than one genetic variation will be required to observe a phenotype such as the breakdown of a drug’s efficacy or an adverse drug reaction (ADR). This hypothesis is supported by the lack of phenotype seen in many knock-out mice [32, 33], loss-of-function yeast strains [34, 35, 36] and null mutants of Arabidopsis [37]. The importance of gene-gene interactions in nature is implicated by the results of extensive two-hybrid experiments conducted in yeast [38, 39].
Despite the seemingly pervasive nature of epistasis, it has been difficult to demonstrate the presence of influential gene-gene interactions in humans. This could be due, in part, to the lack of emphasis placed on searching for epistatic effects during the analysis phase of genetic association studies. This lack of interest in epistasis is prevalent despite theoretical work which shows that interactions could explain large proportions of trait heritability [40].
Other postulates for the unclaimed heritability in common disease include such phenomena as genetic heterogeneity, gene-environment interactions, and inflated heritability estimates [29]. Genetic heterogeneity refers to the presence of multiple separate genetic factors underlying the etiology of a trait. For example, two patients might experience adverse effects from the same drug where one results from a variant in the metabolizing enzyme which renders it unable to properly process the drug, leading to upstream buildup of a toxic intermediate and the other from failure to transplant that same toxic intermediate to the metabolizing enzyme as a result of a mutation in a membrane-bound transporter. Alternatively, there could be multiple mutations within the same gene which will cause a similar phenotype. Even Cystic fibrosis, which has a Mendelian inheritance pattern, has been found to result from over 1700 different mutations in the CFTR gene in different patients [41]. The complexity of human genetic architecture and our distinct genetic backgrounds makes it highly probable that genetic heterogeneity will be a significant basis of the etiology of heritable human diseases and traits. It might also be reasonable to blame the phenotypic heterogeneity and the misspecification of disease status for some of the inability to discover the underlying genetic causes of common disease [29].
3. GENETIC ASSOCIATIONS AND PHARMACOGENOMICS
While many genetic association studies of common disease have failed to uncover variation of attributing large effects, pharmacogenomic association studies have comparatively realized a great deal of success. Many influential variants have been discovered through association studies, particularly those in cytochrome P450 (CYP) genes. The CYP genes encode for a family of drug metabolism enzymes which are located largely in the liver and account for approximately 90% of drug oxidation in human metabolism pathways [42]. Multiple strong associations have been found between variation in CYP genes and pharmacogenomic outcomes. Of particular note are links found between SNPs in CYP3A5, CYP2B6 and CYP2C19 and response to the drugs tacrolimus [43], efavirenz [44], and clopidogrel [45, 46], respectively. Possibly the most influential associations exist between SNPs in the CYP2C9 and VKORC1 genes and warfarin pharmacokinetics. Warfarin is a commonly prescribed anticoagulant which has high interindividual variability in effective dose [47] and a narrow therapeutic window. Individuals with concentrations of warfarin over the target international normalized range (INR) have an increased risk of major bleeding events. Conversely, warfarin levels below the target INR will not be effective in treating the thromboembolism and systemic embolism conditions for which warfarin is prescribed [48]. The CYP2C9*2 and CYP2C9*3 alleles which result in amino acid change in the enzyme are associated with reduced warfarin metabolism rate [49] and this association explains as much as 10% [50] of the variance in warfarin dose response. The VKORC1 gene encodes vitamin K epoxide reductase complex 1, for which the function is to activate vitamin K [51] and in turn modulate proteins involved in blood clotting. Variation in VKORC1 is responsible for up to 25% of the variation in the dose response of warfarin [50].
There is potential for comparatively fast translation from bench to bedside with pharmacogenomic research, which is also illustrated by HLA-B*5701 testing for hypersensitivity reaction (HSR) resulting from abacavir treatment in HIV-infected patients [52]. HLA-B*5701 testing is an ideal diagnostic due to its 100% negative predictive value (NPV) [52]. Although such an ideal pharmacogenomic test is not likely to be seen again in the near future, we can learn from the translation of the HLA-B*5701 discovery so as to accelerate utilization of future discoveries. In particular, points about HLA-B*5701 testing which drove its acceptance include the easily observed improvement in treatment outcomes which accompanied the prevention of the sometimes fatal HSR and the easily interpretable monogenic (i.e. the result of a single genetic variation) nature of the test. It is important to note here that this does not restrict future genetic tests to be monogenic; the key point is interpretability of results which can be translated to improve patient outcomes. Even a complex genetic test could achieve these goals if the correct architecture were put in place to relay the results to a clinician in such a way as to allow for an easy adjustment of their decision for medical treatment. Therefore, it is likely that bioinformatics infrastructure will become essential to maximize the utility and accuracy with which pharmacogenomic results are utilized. The question now should be posed: how do we optimize a pharmacogenomic study in order to most effectively answer the scientific question and maximize the potential downstream effect? The answer will, of course, depend on the question asked by the study however common considerations can be made. The first issue to come under consideration must be the design of the study. While a carefully designed pharmacogenomic study has the potential to yield novel knowledge about the treatment or phenotype in question, a poorly designed study can waste time, money, and generate spurious results which will incorrectly direct future research. It is not to say that a well-designed study does not have the potential to suffer failure, but bearing in mind aspects of study design and analysis minimizes the chances of such a failure.
4. STUDY DESIGN
4.1. Phenotype Definition
The first aspect of study design to solidify is that of phenotype definition. In general, there are two possible types of traits to study in pharmacogenomics: pharmacokinetics and pharmacodynamics. Pharmacokinetics is the term used to describe drug processing including absorption, distribution, metabolism and excretion (ADME) [53]. An example of a pharmacokinetic trait studied for genetic associations is the concentration-dose ratio [54, 55, 56], the plasma concentration of a drug normalized to the dosage given and often corrected by the weight of the study participant. Other pharmacokinetic parameters [57] include drug clearance and excretion rates. Pharmacokinetic outcomes provide the capability to assay the function and performance of the ADME process. While pharmacokinetics attempts to describe the actions the body performs on a drug, pharmacodynamics refers to the manner in which the drug acts on the body [58]. In its narrowest sense, pharmacodynamics looks at the ligand-receptor activity of the drug. Because ligand-receptor interaction dynamics are difficult to study on a large-scale basis, it is significantly more reasonable to study side effects and efficacy.
Beyond classification as pharmacodynamic or pharmacokinetic, the outcome is defined and studied according to the form of its measurement. The four types of outcome measurement typically used in association studies are binary, continuously distributed, ordinal Poisson-distributed, and time-to-event.
Binary case-control dichotomization is the most commonly used phenotype in association studies. Common binary phenotypes include ADR and events such as myocardial infarction (MI) for which the drug under study was prescribed to prevent. While binary traits are convenient for association studies, they often have reduced statistical power - where power is defined as the probability of rejecting the null hypothesis of no association between the variant and the outcome when this hypothesis truly is false (i.e. when there exists a true association) - when compared to other outcomes [59]. A binary definition of ADR, for example, ignores differences which might exist in the severity of the reaction between individuals which could itself be influenced by genetic variation. As a result, a continuously distributed trait is preferable to a dichotomization when available [60, 61]. Change in blood lipid levels in the study of statin use, for example, would be preferable to simply identifying individuals with hyperlipidemia. Utilizing a continuous trait will also tend to reduce phenotypic heterogeneity and misclassification [62]. It could be that the least severe cases as defined by adverse drug reactions are not significantly different from some of the borderline controls who do not experience the necessary threshold of severity or that controls could go on to later become cases. These potential binary misclassification issues serve to muddy the waters and reduce statistical power to find genetic variants associated with the outcome. A continuously distributed trait which is often available from clinical trials and other studies in patient populations is dose-normalized drug concentration. This is particularly true for drugs with a narrow therapeutic index such as warfarin, and the immunosuppressant, tacrolimus, for which reaching the target drug level is imperative to the success of drug treatment.
As opposed to binary- or continuously-defined outcomes, event counts and rates, which follow a Poisson distribution, are not often used outside of traditional epidemiological studies. An example of the use of such an outcome is demonstrated in the 2010, Crum-Cianflone et al. study of factors modulating hospitalization rate in HIV-1 infected patients [63]. In this study, the researchers examined the rate of hospitalization in a cohort of HIV-1 patients over a fixed period of time and how rates varied as a function of variables such as CD4 T cell count and presence of concurrent infections. Outcomes defined by a rate attempt to examine an ordinal measure such as a count normalized over a period of time. Utilizing a rate as the phenotypic measure in pharmacogenomic association studies also has an advantage over binary classification when possible. The rate of ADR occurrences or even hospitalizations due to ADR in a participant would be more descriptive than an indicator of the presence or absence of a reaction. While the use of rate-based outcomes in pharmacogenomic association studies is still rare, survival/time-to-event measures are commonly assessed. This is particularly true when examining the potential for long-term benefit from guiding drug treatment by genetic testing [64, 65]. Example study endpoints include survival time of individuals with colorectal carcinoma [66] and time to drug resistance in HIV-infected individuals on ART [67]. The dimension of time lends valuable information which would be lost if the binary clinical phenotype were taken at the end of follow-up and also alleviates issues of loss to follow-up through the use of censoring [68].
4.2. Treatment-Dependent and Treatment-Differential Study Designs
Pharmacogenomic association studies can be conducted in two ways with respect to treatment regimen: treatment-dependent and treatment-differential. The treatment-dependent study looks only at individuals on the drug of interest and examines variation among these individuals with respect to the outcome under investigation. In contrast, a treatment-differential study assesses two groups of individuals differing in their treatment and is useful for exploring gene-treatment interactions. Some outcomes, such as ADR, require the use of the drug or treatment being researched and therefore must be pursued with a treatment-dependent study. Other endpoints can be pursued in either a treatment-dependent or treatment-differential manner. Associations between HIV antiretroviral drug efficacy and genetics, for example, could be tested either by looking across drug treatment regimens, as is done in many AIDS Clinical Trials Group (ACTG) trials through assessment of mechanisms by which genetics and regimen interact to modulate a decrease in HIV viral load, or by scrutinizing patients on a single regimen and correlating patterns of genetic variation with efficacy. Treatment-dependent and treatment-differential study strategies to answer an example research question are provided in Figure 1.
Figure 1.
Example of treatment dependent and treatment differential study designs answering a harmacogenomic question related to antiepileptic drug efficacy.
4.3. Study Designs
Three study designs form the basis on which most pharmacogenomic association studies are conducted: the case-control study, the cohort study and the randomized clinical trial (RCT) (Table 1).
Table 1.
Major study designs utilized in genetic association studies
Study Design | Case-Control | Cohort | Randomized Clinical Trial |
---|---|---|---|
Advantages | Good for rare outcomes | Reduced recall and selection bias | Reduced recall and selection bias |
More cost effective | Availability of longitudinal measures | Availability of longitudinal measures | |
Can be performed within a cohort or RCT using nested case-control or nested case-cohort | Possible availability of repeated measures | Possible availability of repeated measures | |
Flexibility to pursue multiple outcome types | Flexibility to pursue multiple outcome types | ||
Direct estimate of risk from exposure | Direct estimate of risk from exposure | ||
Randomization reduces confounding | |||
Disadvantages | Recall bias -Treatment and adherence data may not be accurate -Accuracy can differ by case status |
Difficult to study rare outcome | Difficult to study rare outcome |
Usually costly due to prospective nature | Costly due to prospective nature | ||
Suffers from loss to follow-up | Suffers from loss to follow-up | ||
Selection bias -Survival time bias -Cases and control not from same population |
|||
Controls could develop outcome in the future | |||
Estimating odds of exposure given outcome | |||
Possible Study Outcomes | Binary | Binary | Binary |
Continuous | Continuous | ||
Time-to-Event | Time-to-Event | ||
Rate/Count | Rate/Count |
The most prevalent study design used in genetic association studies is the case-control design. A case-control study involves the ascertainment of participants by their outcome status followed, typically, by a retrospective look at exposures of interest [69]. The term exposure, in this case, can refer to environmental factors such as smoking and diet or to genetic factors. In a pharmacogenomic case-control study, the frequency of a genotype or allele is compared between cases and controls in the interest of elucidating genetic markers predisposing a change in the odds of experiencing the outcome of interest [70]. The risk estimate derived from a case-control study is the odds ratio (OR). The OR describes the odds of an exposure in cases with respect to that in controls. In most situations, cases are defined by the presence of an ADR or other drug-dependent event and the exposure refers to genotype. For treatment-differential studies the exposure is often a treatment-genotype interaction. This would be true for a study of MI-prevention through the use of statins in comparison with aspirin, where the presence of a MI event would define cases and the exposure to test for association would be the interaction between statin use and genetic variation. Although the OR is not equivalent to the risk ratio (RR), an estimate of disease risk in individuals with the exposure as compared to those without, it is a reasonable approximation if the outcome is rare.
Benefits of the case-control study design include the potential to study rare diseases for which other designs would not be able to collect sufficient sample sizes in addition to cost-efficiency, as a case-control design alleviates the expense of follow-up [69]. There are also several potential weaknesses to the case-control study design [69]. The most concerning issue is the potential for selection bias attributed to the ascertainment process of case-control studies. Selection bias can result from differential survival if the cases are not all newly incident, meaning that the cases ascertained represent a subset of those who have survived to the time at which enrollment became possible or a subset still capable of enrolling. Selection bias could also arise from a failure to select cases and controls representative of the same population. In addition to selection bias, case-control studies can suffer from information bias a s a consequence of differential ascertainment not of the participants themselves, but of study variables. Information on exposures could be collected asymmetrically by researchers possibly due to knowledge about case-control status. Sometimes information bias is completely unintentional, as with recall bias. Recall bias results from cases being more or less likely to correctly recall details of drug treatment or environmental exposures.
While the case-control study ascertains participants based on outcome, a cohort study enrolls participants predicated upon exposure status and follows participants to evaluate study endpoint(s) [71]. Participants would be recruited based on their drug treatment in a pharmacogenomic cohort study either as a single treatment cohort or as part of multiple treatment cohorts in the interest of exploring gene-treatment interactions. Traditionally, cohort studies are conducted prospectively, with information on exposures collected prior to appearance or measurement of study endpoints although it is also possible to perform the study retrospectively [72]. The key defining feature of the cohort study is that ascertainment proceeds on the basis of the exposure - the drug treatment - instead of the endpoint. Cohort studies have several advantages over case-control studies. First of all, because individuals are recruited directly for their exposure, the relative risk associated with an exposure can be directly estimated [72]. Although the enrollment of study participants does not typically rely upon genetic information, cohort studies are still capable of deriving an accurate estimate of the relative risk of genetic variants. In addition, the longitudinal nature of cohort studies provides the ability to research survival time and rates. The issues of recall bias in prospective cohort studies are diminished as are those of selection bias, although information bias can exist with respect to treatment status [73]. For example, researchers might collect more complete information on individuals of a certain treatment group because of implicit assumptions regarding differential values of the study endpoint within that group. While many biases are reduced with the cohort design, the challenge of loss to follow up is introduced, placing emphasis on the importance of DNA collection early in the study to avoid bias in the form of a non-representative participant subset with DNA available for a pharmacogenomic association study [73]. Two particular areas in which cohort studies have a disadvantage in comparison with case-control studies are in cost and the capability to study rare outcomes [71]. Following individuals over the long periods of time - sometimes years - necessary to properly assess the study question is costly, however cohort studies are the best equipped to answer such questions regarding longitudinal outcomes provided that the outcome is sufficiently common. If the outcome were not common and only occurred in 1% of the population, 1000 study participants would be required on average to observe 10 who experience that outcome, making a well-powered study unfeasible. While prospective cohort studies are preferred for studying events with a dimension of time, continuously distributed traits benefit from the use of a retrospective cohort study. For a continuous trait, a cohort of individuals is enrolled based on their prior drug exposure status and then information on the trait is collected. Through the use of medical records, it should also be possible to study the change in a trait over time from the initiation of drug treatment through the time of ascertainment.
The gold standard of study designs in treatment-related research is usually considered to be the randomized clinical trial (RCT). An RCT randomizes the individual enrolled to receive one of multiple treatment arms [74]. This randomization process is useful in reducing selection bias and confounding [75]. Confounding is a situation whereby a factor is associated with both the outcome and the exposure but is not in the causal pathway [76]. Failure to account for confounding can lead to both false positive and false negative associations. The issue of confounding is exemplified by an epidemiology study looking at an association between alcohol consumption and lung cancer [77]. Without controlling for cigarette smoking, it appears that an association exists between alcohol consumption and lung cancer. After adjusting for cigarette use, however, it becomes clear that this association is due to confounding.
The most prevalent confounder in genetic association studies is genetic ancestry; this type of confounding is referred to as population stratification [78]. Population stratification is likely to be a significant issue in pharmacogenomic association studies, for which drug response can have strong ethnic disparities, and will be considered in depth during the discussion of quality control (QC). Randomizing individuals to treatment status reduces the likelihood that there will be an excess of individuals from a certain genetic ancestry within one treatment group and so the potential for confounding from this factor is reduced. Under the same rationale, the RCT design alleviates confounding from other factors as well.
Another advantage of the RCT is that it is typically conducted double-blinded, meaning that neither the researcher collecting data nor the study participant is aware of the treatment the participant has been randomized to receive [75]. The benefit of blinding is that it prevents information bias from data collection disparities across treatment groups. The assumption behind the use of blinding is that a participant’s knowledge of the study drug they receive might change their response or adherence and complicate comparability between treatment groups. Double-blinding is used to prevent the same effect on the part of the clinician, assuming that un-blinded knowledge might result in more carefully monitoring of participants on a particular treatment regimen. The double blinding procedure eliminates the heterogeneity and bias that these factors could contribute to a study. There are, however, inherent issues with the RCT design. In an RCT, the treatment designation for an individual typically refers to the intent-to-treat - the treatment that the participant was randomized to - and might not be the treatment which that participant received for the majority of the study due to treatment-related complications [73]. The RCT design also shares with the cohort design the issues of loss to follow up and is ineffective for the study of rare outcomes. It is also more difficult to study long-term outcomes with an RCT than a cohort study due to the large costs of maintaining an RCT [79, 80].
An additional concern when performing a genetic association study using RCT data is selection bias related to collection of DNA samples. Providing DNA within an RCT is encouraged but not required and only a subset of participants is likely to give consent for DNA. As a result, it is possible for bias to enter the study at this point due to the systematic exclusion of participant groups and it might not be reasonable to assume that randomization still holds. Comparing the demographics of the subset providing DNA samples to those of all study participants can determine the severity of the bias. Another challenge concerns the sample size of RCTs. Single-center trials often fall below the sample size requirement for a well-powered genetic association study and thus the inclusion of either multiple centers or multiple studies is necessary to achieve sufficient sample size. Combining multiple centers and/or studies has the potential to introduce heterogeneity with respect to both phenotype and genetic ancestry [73]. A benefit of RCTs with respect to DNA is that centralized repository sample storage is sometimes conducted for future use. The ACTG, for example, has gathered DNA in a sample repository for many years in the interest of implementing pharmacogenomic association studies [81].
Within data sets gathered from case-control, cohort and RCT study designs, specialized analysis schema are available to answer targeted scientific questions. For example, a binary outcome can be analyzed in RCT or cohort study data through the use of either a nested case-control or nested case-cohort design. The nested case-control study [73] defines cases as individuals within the larger cohort or RCT study who developed the outcome of interest, and controls as a random sample of the participants who did not develop the outcome. During the analysis, age is used to adjust for time-dependent differences between the cases and controls.
A variation of the nested case-control study, the nested case-cohort study matches one or more controls to each case on the basis of age or other time-related variables [73]. Another option, the case-only study, can be performed within case-control or nested case-control data sets and is useful for exploring gene-treatment interactions. In the case-only study, a measure of association between genotype and treatment group is explored [82, 73, 83]. This resulting risk estimate is a measure of the interaction between the genotype and the drug treatment. An assumption of case-only study is that the genotype and treatment group are independent. Due to the stringency of the assumption of independence, it is ideal to perform a case-only study within cases taken from an RCT.
5. GENOTYPING
5.1. Genome-Wide Association Studies
Recent innovations in genotyping technology have continually expanded the toolbox of the pharmacogenomic association study. Four genotyping schemes are typically used: candidate gene genotyping, specialized chip platform genotyping (candidate chip), genome-wide genotyping, and whole exome/whole genome sequencing. Each method has advantages, disadvantages and appropriate applications (Table 2).
Table 2.
Advantages and disadvantages of major genotyping approaches
GWAS | Candidate Chip | Candidate Gene | Sequencing | |
---|---|---|---|---|
Number of markers | Half million to Millions | Thousands to Hundred Thousand + | Tens to Hundreds | Millions |
Use | Genome wide exploration of common genetic variation | Focused analysis of common or rare variation in many biologically plausible genes, | Focused analysis of common or rare variation in one or handful of genes, | Genome wide exploration of common and rare variation, |
Hypothesis generation | Hypothesis generation or testing | Hypothesis testing | Hypothesis generation | |
Advantages | Cost-effective genotyping | Somewhat alleviated multiple testing issues | Alleviated multiple testing issues | Detection of all types of variation within sequenced regions including ability to assess rare and functional variation |
Unbiased coverage of common variation across genome | Biological plausibility of significant results | Focused biological hypothesis | Unbiased coverage of common variation across genome. | |
Many markers allows for PCA-based substructure correction and more extensive QC | When more than 100,000 markers typed, PCA-based substructure correction possible | Potential to target functional and rare variants | Many markers allows for PCA-based substructure correction and more extensive QC | |
Low overall cost | ||||
Disadvantages | Not equipped to assess rare variation | Less cost effective than GWAS | Higher per-genotype cost | Currently cost-prohibitive for many samples |
Millions of markers causes severe multiple-testing issues | When less than 100,000 markers typed, difficult to control for population substructure | Difficult to control for population substructure | Must sequence to high coverage to sequence all areas and offset error rate | |
Relies on accurate biological knowledge | Limited QC possible | Millions of markers causes severe multiple-testing issues |
Genome-wide genotyping arrays employed in genome-wide association studies (GWAS) have become a standard for exploring the genetic etiology of complex human disease [84]. These arrays are designed to genotype 500,000 or 1 million tag SNPs by which it is possible to capture nearly all of the common variation in the human genome of European populations due to LD [11]. It should be noted that LD patterns vary between ethnicities [12] and therefore a tag SNP derived from a European population should not be expected to tag in an African population. The LD in African-derived populations tends to extend across much smaller genomic regions [85] and therefore typing more SNPs is required to achieve comparable genome-wide coverage to that for a European population. By some measures, GWAS has experienced great success in discovering genetic variation associated with risk for disease; approximately 1290 polymorphisms for nearly 250 outcomes have been found to be genome-wide significant [86] (p < 5 × 10−8).
As stated previously, however, nearly all of these associated variants explain only modest amounts of heritability. Issues with small effect size could reflect, in part, phenotypic and genetic heterogeneity. For many diseases which have been probed repeatedly with GWAS, significant variation exists in the phenotype definition and this variation is often ignored when categorizing individuals by disease status [87]. For multiple sclerosis (MS) for example, there are 4 different subtypes of the disease for which the neurodegenerative progression of the disease vary [88]. Due to sample size issues these 4 subtypes are nearly always combined [89] despite the possibility of different genetic etiology underlying each subtype. Genetic heterogeneity has been implicated in many other complex phenotypes which have been studied with GWAS [90]. Both genetic and phenotypic heterogeneity in a study population will tend to reduce statistical power and push risk estimates towards the null hypothesis of no association. Pharmacogenomic association studies have an advantage when phenotypic heterogeneity is concerned. The imposition of clinical standards for diagnosis of drug reactions and the measurement of pharmacokinetics will diminish heterogeneity in the outcome. Many pharmacogenomic studies have focused only on genomic regions with a priori ties to drug metabolism, such as the cytochrome P450 (CYP) family of genes [91]. Due to the historically targeted nature of these studies, the use of GWAS has the potential to uncover novel population variation associated with the etiology of variable drug response and add to the predictive ability which currently-established associations already lend. Such novel biological discoveries could make it possible to design safer, more effective drugs [92] or to guide the prescription of drugs with known safety or efficacy concerns.
Although GWAS is useful for assaying common genetic variation, it is not yet equipped to handle rare variation. That may change soon, as development of genotyping arrays like those of the Illumina Omni series promise to enable genotyping of up to 5 million markers as well as the continued evolution of high throughput sequencing platforms (which will be discussed further in Section 5.4).
There are considerations to be made prior to performing a GWAS. In 2007, the NCI-NHGRI Working Group on Replication in Association Studies laid out requirements for replication prior to reporting GWAS results [93]. Replication will be discussed in detail in the post-analysis section of this review. One of the most significant challenges associated with GWAS is that of multiple testing. Generally in a GWAS, an uncorrected p-value of 5x10−8 is required to consider a result to be genome-wide significant. This stringent p-value cutoff requires a large sample size - often in the thousands - to detect a modest genetic effect. The issue of multiple testing and p-value corrections is covered in more depth later in the review.
5.2. Candidate Chip Genotyping
Alternative to using a genome-wide genotyping platform, specialized genotyping assays and arrays containing polymorphisms specific to a given trait are becoming popular. Examples of these specialized genotyping chips include the Affymetrix DMET Plus Premier Pack [94] and the Illumina HumanCVD BeadChip [95]. The Affymetrix DMET microarray chip is designed to genotype approximately 2000 markers in over 200 absorption, distribution, metabolism and elimination (ADME) genes which are not well represented on genome-wide genotyping arrays. The population frequency of the polymorphisms on the chip range enormously and there is a significant focus on those mutations which are rare in the population. In personal experience with the DMET chip, of the 1936 markers which were typed on about 500 individuals, approximately 33% showed no variation in our study (unpublished results). It seems that the DMET chip would be better served by a larger study population to increase the probability of observing variation in low-frequency polymorphisms.
Another option is the HumanCVD chip, which is designed by Illumina to genotype 50,000 markers in 2100 genes implicated in functions and disorders tied to cardiovascular disease such as blood pressure, insulin resistance, metabolic disorder, lipid disorder, and inflammation. The markers on the HumanCVD chip are categorized into three groups of genes ranging from with high probability of playing a functional role in disease risk to those with only marginal ties to disease pathology.
In addition to widely available designer chips already created, many genotyping companies allow for the specification of custom-made chips. The Illumina ImmunoChip is an example of a genotyping array which was tailor-made to fit the purpose of looking at candidate variation in particular, potentially disease-linked genes1. The ImmunoChip was created as a custom chip in response to interest in looking at variation in genes linked to auto-immune disorders such as Lupus, Multiple Sclerosis and Rheumatoid Arthritis; the chip provides the capability to assess variation at approximately 200,000 markers finely distributed across the genes and genomic regions concerned1. Illumina provides a service, iSelect, which allows for the customized creation of chips with a number of variants ranging from either 3000 to 68000 or 68001 to 200,000. The Illumina VeraCode technology, in contrast, is designed to examine fewer SNPs and is utilized in the ADME Core Panel. The ADME Core Panel assays 184 ADME variants in 34 important drug metabolism genes.
There are multiple advantages to the application of these custom assays and arrays. First, employing prior biological knowledge to focus genotyping efforts yields a higher probability of discovering associated polymorphisms [96], especially when compared with genome-wide genotyping platforms which might not adequately assess the variation in regions of particular interest [97]. The biological knowledge which is used to decide the variants added to the platforms is a primary concern, though. In terms of the currently available arrays, extensive research was done to select genes to include and to ensure that these genes are sufficiently covered by variants on the chip [98, 95]. Another advantage which is specific to custom arrays is the ability for the user to add variants of their choosing, including those variants which have been discovered subsequent to the release of standardized genotyping platforms. There are several methods which are used in the selection of genes and polymorphisms to add to a genotyping array [99, 100, 101, 102]. The major disadvantages of the customized genotyping chips are increased ‘per-genotype’ costs and the initial work which is necessary to select markers for a new array. In addition, genotyping efficiency is likely to decline in custom-made chips designed by researchers when compared with arrays and assays designed by genotyping companies which contain SNPs validated for genotyping accuracy and efficiency.
5.3. Candidate Gene Genotyping
The benefit of studying pharmacogenomics is the abundance of prior knowledge regarding pathways of drug metabolism, transport and elimination. Most currently-known pharmacogenomic associations are the result of candidate genotyping founded upon knowledge of these pathways and employed through the use of a candidate gene study. The candidate gene study uses focused genotyping to assess variation at only a handful of genes [103]. In place of a genotyping array, a low-throughput technology such as Sequenom [104], TaqMan [105], or Illumina GoldenGate [106] is used. Although the era of GWAS has seen a decrease in the use of candidate gene association studies, the approach has had great success discovering variants with substantial effect in pharmacogenomics and complex disease [44, 23, 107, 108, 109]. The same general principles of the candidate chip apply here to the candidate gene study. The correct biological knowledge is essential to appropriately focus genotyping. The user can directly assay markers with proven functional impact at the transcript, protein or trait level or those which cause non-synonymous coding changes instead of relying on tag SNPs to detect indirect associations. One of the most significant advantages of a candidate gene study is the alleviation of the issue of multiple testing.
5.4. Sequencing
Genotyping technology has progressed in waves over the last several years as the cost per genotype has steadily fallen. Now, we stand on the precipice of affordable whole-genome sequencing technology. There are many factors driving the progress in genome sequencing, first of all being the 1000 Genomes project [28], an initiative to uncover 95% of genetic variants in the “accessible genome” with a population frequency of 1% or greater [5]. The accessible genome is the portion of the genome for which sequencing reads can be unambiguously mapped back to the NCBI reference genome and constitutes approximately 85% of the entire genomic sequence in the current build [5]. The NCBI reference genome is the work originally of the Human Genome Project [110] and more recently the Genome Reference Consortium, which has been filling gaps in the sequence and updating inaccurate sequence reads. In order to realize the goal of the 1000 genomes project, three sequencing pilot phases have already been conducted and the main phase is currently under way.
The three pilot phases consisted of performing low-coverage (2–6×) whole-genome sequencing on 179 individuals, deep whole-genome sequencing (42X average) coverage of 2 mother-father-adult trios and finally sequencing 906 genes at approximately 50X coverage in 697 individuals. The full 1000 Genomes project data will consist of deep sequencing of coding regions, low coverage whole-genome sequencing, and genome-wide genotyping of 2500 samples [5]. When sequencing the genome, 1× coverage corresponds to the generation of approximately 3 billion base-pairs of DNA and means that statistically each base in the haploid human genome was sequenced one time on average. Although significantly higher coverage is necessary to fully sequence the diploid human genome without appreciable gaps and identify genetic variants [111], by sequencing many individuals at a relatively low coverage the hypothesis of the 1000 Genomes project is that novel variation can be detected by examining multiple samples from the same population [5].
In addition to the proposed goal of uncovering novel variation, the 1000 Genomes project has placed an emphasis on describing variation in diverse populations. In all, 27 populations representing 5 major geographical regions are represented in the 2500 samples collected for the 4 phases of the 1000 Genomes project [5]. As a result, the 1000 Genomes project should significantly advance our knowledge about population-specific human genetic variation.
Beyond the 1000 genomes project, an incentive to improve genome sequencing technology exists in the form of a $10 million prize being offered by Archon X Genomics for the development of technology which can sequence 100 individuals in 10 days at a cost of no more than $10,000 per individual [112]. The eventual goal is to bring the cost of whole-genome sequencing down to $1000 per individual [113]. Although this goal may not be too far off, current prices for whole-genome sequencing are around $50,000 per sequence [114]. For most researchers, this cost precludes the ability to perform sequencing on a large sample of individuals. As such, large-scale whole-genome sequencing in genetic association studies has not yet become a reality.
In lieu of whole-genome sequencing, a cheaper and more focused alternative is whole-exome sequencing [115, 116, 117]. The exome describes the set of all exons in the human genome; all the regions of the genome which will be expressed. Exon-capture technology can be used to selectively sequence all protein-coding regions in the genome for the cost of about $5000 [114]. While this cost is much greater than that of genome-wide genotyping, it is also significantly less than whole-genome sequencing but still enables rare variant discovery. One of the most challenging aspects of sequencing is data management. The space required to store the data from a single sequencing run can be 1 terabyte (1024 gigabytes) or more [118]. In addition, after the sequencing data is gathered, bioinformatics infrastructure is necessary to extract genetic variants from the raw sequence. Because these variants are comingled with sequencing errors, this is a non-trivial task [119]. Allele frequency thresholds are required to distinguish a genetic variant from sequencing errors. Assessing structural variation such as CNVs adds an additional level of complexity, as the “break points” - genomic base-pair positions where the repeat begins and ends - of a CNV can vary between individuals [120].
If the challenges and expense associated with sequencing technologies can be overcome, they lend the possibility of capturing rare and novel genetic variants which cannot be discerned through the use of genome-wide genotyping arrays and other platforms designed with the current knowledge regarding mostly common variation. As the price of sequencing falls and data management procedures improve, sequencing will become an increasingly viable option to capture nearly all coding information in the human genome. The challenge, then, will be applying analytical methodology to determine how this genetic information maps onto complex pharmacodynamic and pharmacokinetic traits.
6. QUALITY CONTROL
Prior to the analysis stage of a pharmacogenomic association study, it is important to ensure that the data collected for analysis have been standardized under a set of quality control (QC) standards. The genetic data, in particular, require several checks of quality. Weale [121] describes, in depth, the potential QC measures for large-scale genetic data. Further information about conducting genotype data QC is given by Laurie et al. [122] on behalf of the GENEVA consortium or by Turner et al. [123] on behalf of the eMERGE network. Almost all of the QC discussed below can be conducted in the infrastructure of PLINK [124], developed at the Broad institute or in PLATO [125], developed at Vanderbilt University (http://chgr.mc.vanderbilt.edu/ritchielab/PLATO).
The rationale for conducting checks of quality surrounds the attempt to remove systematic error from the data which could otherwise bias the results of the analysis. Genotype QC standards and procedures can be broken down into those performed on samples and those performed on genetic markers (Table 3).
Table 3.
Recommended QC steps prior to association analysis
QC Process | Software | |
---|---|---|
Marker QC | Genotyping Completion Rate Filter | PLINK, PLATO, genABEL |
Hardy-Weinberg Equilibrium Check | ||
Minor Allele Frequency Filter | ||
Sample QC | Genotyping Completion Rate Filter | |
Heterozygosity Check/Filter | ||
Gender Error Check | ||
Relatedness Correction | ||
Plate Effect Check | ||
Population Stratification Adjustment | Genomic Control, EIGENSTRAT | |
Population Outlier Removal | EIGENSTRAT, STRUCTURE |
If a stratified analysis will be conducted, QC should be conducted separately within each stratum if the stratifying variable is likely to be related to genetic variation, as is the case with race ethnicity. Three general approaches exist when considering the order by which sample and marker QC will proceed. Performing marker QC first will maintain the maximum number of participants in the study while placing sample QC first will maintain the variants over samples. A recently developed program, genABEL [126], allows QC on markers and samples to be done iteratively to balance low-quality losses from each. The choice of genotyping should be a consideration when deciding which route of QC to pursue. If a candidate gene study has been conducted, it would be prudent to do QC on samples prior to that on markers so as to ensure the minimal discard of functionally relevant genetic variants. Performing QC on markers first is reasonable for GWAS, though, as it is hypothesized that the vast majority of variants genotyped do not participate in the genetic etiology of the outcome. It is also possible with GWAS that multiple markers will be in LD with the functional variant they are tagging and thus the discard of one will not preclude the identification of a genomic locus associated with the outcome. One of the primary considerations for GWAS QC concerns the large sample size required for a GWAS to be sufficiently well-powered.
There are many QC checks and exclusion standards which could be applied to samples and markers [121]; the ones which will ultimately be used must depend on the data available. The first step is to check for gender errors and examine relatedness regardless of whether the focus is on maintaining samples or markers. If sex chromosome data is available, gender errors can be identified by checking the heterozygosity of the X chromosome back to the gender classification for each individual. For potential errors, it is desirable to determine whether the cause is misclassification or an aneuploidy - abnormality in chromosomal number - such as Kleinfelter syndrome through examination of raw genotype calls for X and Y chromosome marker intensities. If gender errors cannot be resolved - potentially through consulting medical records - as misclassification due to error in sample handling or data entry, it is advised to exclude these individuals from analysis.
Relatedness between individuals in the study is addressed next because interrelatedness between study participants can result in spurious associations when using statistical tests which assume independent observations. If participant relationships are known, as in pedigree data, software such as genABEL [126] or MQLS [127] can be used to adjust by utilizing a kinship matrix describing relatedness between all individuals. It is also possible to derive a kinship matrix using genABEL if relationships are unknown [128], as could be the case in association studies of unrelated individuals where cryptic relatedness - relatedness between individuals which is not known to the researchers [129] - could be an issue.
An alternative to correcting for relatedness is to remove one individual at random from each pair for which there is a high estimated identity by descent (IBD), which can be done with PLINK [124]. Identity by descent is a term which is used to describe the presence of the same allele at a genomic region as a result of inheritance from a common ancestor. Note that estimating the inbreeding coefficient and IBD or kinship coefficient can only be performed on samples for which there is sufficient genotype data and that these techniques are designed to be used with whole genome data.
To exclude low quality genotype data, the next step is the removal of samples and markers with a low genotyping completion rate or, put another way, a high rate of missing genotype data. Each sample and marker is passed through a filter to remove those which fall below a certain genotyping completion threshold; a threshold which is sometimes used is 95%, indicating that those samples and markers for which at least 5% of the genotype data is missing are removed. For GWAS, this threshold is often more stringently placed at 98–99%. Typically, markers are filtered first, followed by samples. This allows for the minimal exclusion of precious study samples. The second step is the exclusion of samples with levels of heterozygosity well outside of that expected due to chance variation (determined by inbreeding coefficient above or below certain thresholds). Low heterozygosity can be the result of inbreeding driving down the level of individual genetic variation and high heterozygosity might indicate contamination of the sample by foreign DNA, which would lead to excessively high levels of individual genetic variation [121]. For genetic marker data, it is recommended to remove variants with a low minor allele frequency, as there is significantly reduced statistical power to detect associations for these markers. The minor allele frequency threshold for removing variants will depend upon the sample size for the study; Weale suggests a cutoff of 10/N, where N is the sample size. Filtering SNPs by Hardy Weinberg equilibrium (HWE) is sometimes performed as well. SNPs for which the test of HWE yields a p-value below a certain value are removed from analysis.
Notably, there are debates as to whether a check of HWE should result in marker exclusion or whether it would be more appropriate to flag markers with a strong deviation from HWE so that they can be followed up if they are found to be associated with the outcome [130]. The argument for removing SNPs in disequilibrium is that they are likely to be the result of genotyping error and might cause spurious associations. On the other hand, one of the assumptions of HWE is that there is no selection acting on the population [131]. If a SNP is under positive or negative selection it will violate HWE but should not be removed as that selection could indicate a role of the variant in disease etiology. The final recommended QC check is the assessment of “plate effects” across genotyping chips [132], which can be done by comparing MAF [133] or genotyping completion rate for each plate to that of all others utilizing the genetic data which has been filtered by MAF and sample genotyping completion rate. Plate effects constitute a particular type of batch effect, which are discussed in depth by Leek et al [134]. Identifying plate effects is important, as they can indicate incorrect handling of DNA samples or errors during the genotyping process. The issue of batch effects recently garnered general attention in the scientific community when associations discovered in a study of human longevity [135] were demonstrated to likely be the result of batch effect and QC problems [136, 137].
After performing general QC exclusions, it is imperative to examine the presence of population substructure in the data. Population substructure refers to the phenomenon whereby multiple distinct subpopulations are present within the overall sample population [138]. A frequent population substructure issue in genetic association studies is population stratification, in which the distribution of subpopulations differs systematically by outcome [139]. This should be distinguished from the concept of admixture [140], which refers to the ancestral mixing of two populations, as is present in populations such as Mexican and African Americans. Population substructure is important to consider prior to analysis because it can confound the analysis and cause spurious associations under specific conditions. The conditions under which substructure is confounding are that the prevalence - or distribution in the case of a quantitative trait - of the outcome is different between the subpopulations in the dataset and these subpopulations also differ in allele frequency at markers unlinked to the trait of interest [61]. Due to frequent differences in drug response between ethnicities [141, 142, 143], discerning the presence of population stratification is particularly relevant for pharmacogenomic association studies.
An extreme example of population stratification would be looking at risk for an ADR in a mixed sample of European Americans and African Americans for which 90% of the cases with the ADR are African American and 90% of the controls are European American. In this case, even though the true prevalence of the ADR might be equal between the ethnic groups, an artificial difference is introduced as a result of the poor study design. Any marker which differed significantly in allele frequency between European and African Americans would be associated with the presence of an ADR, even though it might be that none of these markers are truly associated with risk of the outcome. Fortunately, many tools have been developed to mitigate the effect of population stratification [144, 145, 78, 146, 147, 148].
Historically, epidemiology studies have relied on stratification to eliminate confounding by race and other factors. Although this alleviates stratification to a degree, there are multiple disadvantages to race-stratification, the most significant of which is the drop in statistical power which results from sub-dividing the data set prior to analysis. In addition, self-described race might not be as accurate or precise as desired [149], particularly with individuals of an admixed population where the degree of mixed genetic ancestry can vary between individuals. Even in a strictly European population, where ancestry is often assumed to be mostly homogeneous, there can be genetic differences between individuals of southern and northern or eastern and western Europe [150]. Although race-stratification has serious disadvantages, it is often the only choice for the correction of population stratification in small-scale genetic association studies if ancestry-informative markers (AIMs) are not also typed. Alternative to stratifying by race, a study can adjust for the presence of the type of systematic differences between individuals in the study that are present with population stratification. The Genomic Control method [145] was proposed as a manner of performing this adjustment. Genomic Control attempts to identify an inflation factor, λ, using markers which are unlinked to the trait of interest. This inflation factor, which describes the inflation of genetic association test statistics as a result of systematic bias, is then used to correct the test statistics of all genetic association tests.
A more recently introduced correction for population stratification involves principal components analysis, which can be conducted through use of the EIGENSTRAT software [146]. EIGENSTRAT uses principal components vectors describing the overall population level genetic variation between individuals to provide an ancestry-corrected association statistic. In order to appropriately use EIGENSTRAT without correcting out the effect of truly associated genetic markers, it is recommended that EIGENSTRAT be run with AIMs or with a set of at least 100,000 genetic variants. Whether correcting for substructure by stratifying or through the use of software such as EIGENSTRAT, it is recommended that population outliers - individuals who do not cleanly group with any of the identified populations in the principal components analysis - are removed [121]. Population outliers can be visualized either by plotting principal components vectors [121] or by using the STRUCTURE program [151].
Principal components is used to describe systematic variation within data as attributable to a particular factor. When applied to genetic data in the attempt to account for race, one or more principal component vectors are generated where each vector places each study participant along a line for which the ends of the line describe the study participants differing most systematically by genetics related to race. If a principal component vector successfully describes systematic genetic differences attributed to race, plotting that vector should separate individuals out into two distinct clusters corresponding to a genetic split between ethnicities. Typically the two principal components explaining the most systematic variation are plotted (Figure 2a). When individuals cannot be cleanly clustered through the use of vectors, these outlier individuals should be removed (Figure 2b). Visualization of population outliers can be enhanced by the inclusion of data from relevant HapMap Phase 3 populations [121], which can clarify the axes of variation in a plot of principal components vectors by labeling individuals within the plot according to race. Including HapMap samples can also be used to map study participants more robustly to an ancestral population.
Figure 2.
The plots of the top two principal component vectors resulting from principal components analysis before and after removal of outliers and with true race coded by color. A) Prior to exclusion of outliers, there are several individuals who do not cluster cleanly. B) After outlier removal, the presence of two general clusters can be visualized. Open Diamond = African American; Open Circle = European American; Boxed Plus Sign = Asian; Boxed X = Hispanic; Plus Sign = Other; Filled Triangle = Unknown.
Performing appropriate QC on genotype data prior to association analysis is essential to providing accurate and robust results in a genetic association study. Failure to remove systematic error can result in a dramatic increase in false positive rate. Noise in the data can be an issue, in particular, due to the small effect sizes seen in many genetic association studies to date. It only takes a modest amount of noise to overpower the true associations in the data if they are of small effect size.
The use of quantile-quantile (Q-Q) plots can help determine whether systematic error is present in the data [121]. Q-Q plots are generated by performing a test of association for each marker in the data and then graphing the resulting distribution of p-values - often using a negative log-transformation - against a set of the same number of p-values generated from a uniform distribution under the null hypothesis. Example Q-Q plots before and after quality control are shown in Figure 3a and 3b, respectively. If there is a systematic error in the data, it will often appear as a strong deviation from the diagonal line which would be expected under the null hypothesis of no association. After QC, it can be seen that no true association exists and systematic error was driving inflation in test statistics and p-values smaller than expected. Such plots act as a useful diagnostic and illustrate the need for thorough QC.
Figure 3.
Q-Q plots showing the negative log-transformed p-values for all SNPs in an association study. A) Before quality control, many association results have low p-values. B) Subsequent to quality control, it can be seen that the presence of a departure from expected p-values was likely the result of systematic bias from low quality data. The line demonstrates what would be expected if the null hypothesis of no association holds.
7. ANALYSIS
Analysis of pharmacogenomic data has the potential to proceed to many ends due to the complex nature and extensive amount of data available in most pharmacogenomic association studies. It is essential to carefully define an analysis plan designed to answer the question of interest in order to obviate statistical invalidity and minimize false positives. If enough analytical tests are performed on the data, statistically significant results will be found. Because performing a fishing expedition in a data set casts doubt on the results, the object of the analysis plan should be to focus statistical testing towards answering a defined scientific question.
Several questions should be considered when defining the analysis plan which will be used to answer the study question. Are only monogenic associations expected to be significant predictors of the outcome or could there exist significant gene-gene, gene-environment, or gene-treatment interactions? Should the initial set of genetic variants be filtered to a more interesting subset prior to performing statistical tests of association? Would prior biological knowledge be useful to include? Does a replication dataset exist which could be used to validate significant associations detected? What factors could be considered as confounders? If race might be a confounder, how should it the analysis be adjusted?
The answers to these questions should guide the analysis. As the study design varies according to the phenotype and the QC process varies according to the form of available genotype data, the analysis plan should be tailored around both of these factors. In this section, statistical tests for exploring monogenic associations, advanced methodology for identifying complex genetic effects, and methods for filtering the marker set will be discussed. Figure 4 demonstrates a simplified schematic for determining the statistical methodology which might be useful to employ during analysis.
Figure 4.
A flow chart of potential analysis designs to pursue based on the form of the study question.
7.1. Single SNP Analysis Methods
Analysis in pharmacogenomic association studies can utilize any of a large array of statistical methods [152] such as Chi-square test [153, 154], Armitage trend test [155, 156], Kaplan-Meier survival curves [157, 158], Bayesian statistics [159], or data mining methods [160] but is commonly performed in the framework of regression [161, 162, 163, 164]. This paper will focus primarily on the use of regression techniques for single-SNP analysis. The outcome under study dictates the type of regression to perform. Most forms of regression work off of the backbone of generalized linear models [165] but use different link functions. A link function merely describes the relationship between the dependent variable (i.e. the outcome) and the independent variable(s) (i.e. genetic markers or covariates). The link functions of the four commonly used types of regression - linear, logistic, Poisson and proportional hazards - are shown in Figure 5. For simplicity, examples given in this section of our paper refer to a regression equation testing the effect of a single genetic variant on the outcome with exception of the discussion of regression parameters and assumptions provided for linear regression.
Figure 5.
Regression equations used in genetic epidemiology. A) The identity link function for linear regression predicts the mean trait value, Y. B) The logit link for logistic regression predicts the odds, [p/(1-p)], of the outcome. C) Poisson regression uses the log function to predict the rate, λ, of events. D) Cox proportional hazards regression also uses the log function but predicts the hazard, h(t), at time t. Each regression function assumes an intercept B0 - the value of the mean, odds, rate or hazard when all independent variables are coded to 0 - and fits a coefficient Bn to describe the effect of each independent variable xn.
When the outcome is continuously distributed linear regression should be employed. Linear regression utilizes what is known as the identity link function - a linear relationship with no transformation of the outcome - to assess the effect of the variant on the mean outcome/trait value. The parameter provided by linear regression is an estimate of the mean difference in the continuous trait across groups differing by one genotypic level/unit [166]. A level with respect to the genotype will depend on the coding used in the regression equation. Different genotype coding schemes as applied to the framework of a regression equation are shown in Figure 6.
Figure 6.
Coding conventions for genetic markers with two alleles when performing regression.
Typically, an additive encoding is used to statistically test each SNP for association. Coding the three genotypes of a genetic variant with two alleles in an additive manner involves defining one genotype as the reference genotype and then establishing genotypes of intermediate and strong effect. The reference genotype is usually set to be the major allele homozygote and then the intermediate and strong effect genotypes are the heterozygote and minor allele homozygote, respectively. An additive coding assumes added risk of each additional variant allele and is utilized primarily because it yields the highest statistical power when the underlying genetic model is unknown. Although the additive encoding is most common, dominant, recessive or custom genotype encodings might be advantageous based on available knowledge of the genetic etiology. Each independent variable in the regression equation has a parameter which describes the size of its affect on the study outcome. For the estimates of the parameters associated with each independent variable to be valid in linear regression, there are distributional assumptions which must be met. These assumptions include homogeneity of variance of the residuals across groups defined by genotype or other variables, normality of the distribution of residuals, and independence of observations [166].
The term residual refers to the difference for each individual between the individual’s true outcome value and the value which is predicted by the linear regression equation. Homogeneity of variance and normality of the distribution of residuals requires that these residual errors follow a normal distribution overall and that the variation or spread of the residuals is roughly equivalent across each independent variable group (e.g., across each of the three genotype groups). Independence of observations describes the requirement that the outcome value for each study participant is uncorrelated with that of every other participant. The assumptions of any statistical method must be fulfilled for that method to provide accurate estimates of the effect size and associated p-value for the parameter of each independent variable. To protect from deviations in distributional and variance assumptions, some recommend the use of robust standard error estimates [167].
For a binary pharmacogenomic outcome, logistic regression should be used. The link function for the logistic regression equation transforms is the logit function. The logit function transforms the outcome to be the natural log of the odds of the outcome, where the odds are found by taking the probability of experiencing the outcome divided by that probability subtracted from one. The parameter derived from a logistic model describes the effect of the variant as the change in log odds between groups of participants differing by one in the coding of the genotype [168]. When the parameter estimate is exponentiated by e, it becomes the ratio of the odds of the outcome between genotype groups (i.e. odds ratio). For genetic markers in logistic regression, an additive encoding implies the OR between minor allele homozygotes and heterozygotes is equivalent to the OR between major allele homozygotes and heterozygotes. Although logistic regression generally makes fewer distributional assumptions than other statistical methods, it does presume independent observations [166]. This assumption usually holds true unless there is cryptic or familial relatedness in the study; assuming that independence of related participants during analysis often results in an increase in false positives.
With prospective cohort studies and randomized clinical trials, it is possible to obtain rates of events and time-to-event or survival data. Associations with a rate measure can be examined through Poisson regression. Poisson regression uses a log link which constitutes a natural log transformation of the rate. The parameter reported by Poisson regression, when exponentiated, is the rate ratio comparing the rate of an event between groups [168]. To analyze time-to-event outcomes, proportional hazards regression (PHR) is the analytical method often used. While the PHR equation can be displayed in a form analogous to the other forms of regression discussed with respect to link function, PHR is not directly derived from generalized linear models. Instead, PHR employs what is known as a hazard function to perform the transformation of the outcome. The parameter estimate associated with PHR is the hazard ratio - the ratio of “instantaneous” risk (hazard) of an event over time between genotype groups. The major assumption of PHR is that the hazard ratio is constant over time. This assumption is known as proportionality of hazard [166]. A benefit of hazards regression is that it provides information on significant differences in long-term outcome within treatment or genotype groups.
Regression is usually the favored approach for analyzing data in genetic association studies due to multiple advantages afforded by the framework of the regression equation. First, effects of secondary variables such as confounders and precision variables - independent variables associated with the outcome but not with the predictor of interest - can be adjusted out through their inclusion as covariates in a multivariate regression equation. Including covariates probes the effect of differences in the genotype upon the outcome among individuals who have equivalent or similar values for the covariates [168]. A nested case-control study design, for example, would be analyzed with a logistic regression equation including participant age as a covariate. A further benefit of regression is the flexibility of modeling provided. Although the use of different coding schema for the modeling of genotypic effects has been briefly discussed, the issue bears repeating as different coding conventions change the scientific interpretation of the risk parameter estimates. Additive encoding, for example, assumes a linear increase in effect with the addition of each risk allele. Alternatively, two of the three genotypes could be coded with dummy variables (i.e., 0/1 indicator variables) [168], using the third genotype as the referent genotype to simulate an analysis of variance (ANOVA) model in which each genotype is tested against the other two with no assumption of a linear trend in effect across genotype groups [169]. This encoding is often termed the genotypic encoding [125] and has the benefit of being able to detect non-traditional genotype risk models such as the interference model [170], in which the heterozygote genotype is associated with either the largest or smallest effect. The problem with the genotypic encoding is that it has less statistical power than the additive encoding if there truly is a linear increase in the size of the effect with each additional variant allele.
It is important to be cognizant of the effects which coding has on regression analysis. Errors can result if a statistical package does not recognize the coding used for missing values of a variable, as this will cause the missing coding to be interpreted as an additional level of the independent variable and the resulting parameter estimates will be inaccurate. In addition to the potential to model a single genetic effect in multiple ways, non-additive interactions between multiple variables can be explored through the addition of one or more interaction terms. A significant gene-gene interaction would indicate that the effect of one variant differs according to the genotype of the other variant and thus the risk from one marker cannot be assessed without knowing the genotype at the other [171].
While regression offers the capability of performing most statistical analyses which would be required for a pharmacogenomic association study, it has the disadvantage of assumptions which is synonymous with parametric analysis methodology. Small deviations from assumptions can be overcome with large sample sizes and the use of robust statistics, but nonparametric statistical methods are sometimes necessary in the instance of severe deviations. One such nonparametric test is the Kruskal-Wallis test [172], which can be used to circumvent the distributional assumptions of linear regression in the analysis of a continuously distributed phenotype. The Kruskal-Wallis test is the nonparametric form of the ANOVA, which tests for a difference in mean trait values across groups of a nominal independent variable such as a SNP. Neither the ANOVA or Kruskal-Wallis test assumes linear trend in the genotypes of the variant. Nonparametric regression methods are also available [173, 174, 175]. Other nonparametric methods are covered at length by Hastie, Tibshirani and Friedman [176].
7.2. Epistasis Analysis Methods
While there have been a few examples of tremendous success in which monogenic associations predict the outcome with accuracy, as is the case with the HLA-B*5701 association to abacavir HSR [52], it is likely that in many cases the consideration of more complex genetic models could improve accuracy [177]. The idea that gene-gene interactions might play a large role in pharmacogenomics is supported by the knowledge of extensive and interconnected drug metabolism networks [178]. In many cases, multiple enzymes can metabolize the same drug [179, 180, 181]; complementation among components in the pathways of drug action indicates that genetic variation at multiple steps could act in synchrony to modulate drug response. The presence of non-additive gene-gene interactions (i.e. statistical epistasis) can be investigated through the use of many statistical methods [182, 183, 184] (Figure 7).
Figure 7.
Analysis methods used to explore gene-gene interactions.
Parametric methods such as regression can be useful in particular cases, but they tend to suffer from the “curse of dimensionality” [185], a phenomenon whereby there are few individuals possessing the rarest multi-locus genotypes. This in turn can cause inaccurate estimates for the size of the interaction effect. One method which has been developed to overcome the curse of dimensionality is Multifactor Dimensionality Reduction (MDR) [186, 187]. The MDR algorithm takes a set of markers and looks at the intersection of genotypes to determine if, for a particular multi-locus genotype, there are more cases than controls with that genotype. The multi-locus genotypes with more cases than would be expected are labeled as “high-risk” and the others as “low-risk”. The interaction between these variants is then collapsed to a single dimension by pooling “high-risk” and “low-risk” cells separately. A measure of classification error is used to judge the accuracy of the method. To prevent overfitting - a phenomenon whereby increasing the number of variables used for prediction decreases classification error only within the data set from which the model is built - the data are divided into multiple partitions and the algorithm is repeated once for each partition of the data using all but one of the partitions to build the model and the final partition for evaluation of the model by prediction accuracy. This process is known as cross-validation and it is performed until each partition has served as the basis for model evaluation. The best model is that which is most accurate over all iterations of the cross-validation process. Finally, permutation testing is used to generate a distribution of test statistics under the null hypothesis of no association which the statistic of the best interaction model found in the data can be compared back to in order to determine a p-value.
MDR has been used to search for interactions between estrogen metabolizing genes in breast cancer patients [187] and drug metabolism enzymes in response to efavirenz treatment in HIV-infected individuals [177] as well as other traits [188, 189]. The MDR interaction analysis paradigm does have disadvantages. First, MDR can only utilize binary outcomes and cannot adjust for potential confounders. It is possible to look for interactions between categorical covariates and genetic factors, such as gene-treatment interactions. MDR also imposes a significant computational burden when employed in genome-wide data due to the exhaustive nature of its exploration of interaction effects. These shortcomings are dealt with to some degree by Model-Based MDR (MB-MDR), a version of MDR which is designed to work with both binary and continuously distributed outcomes and can be used in conjunction with mixed-model regression to correct for covariates [190].
In addition to dimensionality reduction methods, there are also tree-based [191, 192], evolutionary [193, 194, 195, 196], and modified regression methods [197, 198, 199] designed to handle complex trait etiology. Tree-based methods such as Classification and Regression Trees (CART) [191] and Random Forests [200] use an iterative algorithm which splits the data based upon the best predictor in order to build a tree which can be used for prediction of the trait value in future individuals. Evolutionary methods seek to find the optimal predictive model using an algorithm which co-opts the biological ideas of genetic mutation, recombination and reproductive fitness. Examples of evolutionary algorithms designed for application to genetic data are Grammatical Evolution Neural Networks (GENN) [196] and Genetic Programming Neural Networks (GPNN) [201]. Due to the popularity of regression, many modifications have been made to allow modeling of complex processes. Examples of complex regression methods designed with genetic association studies in mind are Lasso regression [202] and Logic regression [199]. The use of a wide variety of methods designed for gene-gene interaction analysis in pharmacogenomics studies is reviewed by Motsinger [184].
7.3. Filtering
When conducting large-scale genotyping, such as in a genome-wide pharmacogenomic association study, it is sometimes useful to reduce the data set to a more manageable subset of variants likely to be relevant to the outcome under research. This consideration is particularly relevant if there is intent to explore complex genetic etiology such as gene-gene and/or gene-environment interactions. One of the challenges opposing widespread gene-gene interaction analysis in pharmacogenomic association studies is that of complexity. If a genome-wide genotyping array assaying 1 million markers is used, exhaustively exploring two-way interactions would require approximately 5×1011 statistical tests. Not only does this create multiple comparison issues, but it also presents a computational task bordering infeasibility. Due to both computational and multiple testing issues, it is advantageous to focus the search for interactions. It is important to note that filtering GWAS data presents a viable alternative to performing analysis on all variants due to the inherent assumption that the majority of markers genotyped are not associated with the outcome.
Multiple methods are available by which to filter the initial set of genotyped markers [203, 204, 205, 206]. Spatially Uniform ReliefF and Tuned ReliefF [206] (SURF&TuRF) is an approach which looks at measures of genetic distance to uncover markers associated with the phenotype. The concept is to focus on groups of genetically similar individuals and determine whether deviations of genotype between these individuals are correlated with deviations of phenotype. For each single marker or two-locus pair, a set of individuals who are within a certain threshold of genetic similarity are selected by SURF and then compared for differences in phenotype. Individuals who differ in genotype at the marker or pair of markers being examined and also differ in outcome cause the marker to be up-weighted while those with the same outcome and different genotypes cause the marker or pair of markers to be down-weighted. The process is performed iteratively, where the lowest scoring markers are removed subsequent to each iteration.
Another approach to filtering is to use a statistical test to condition upon single-locus associations and then perform the test for interactions within the subset of markers which display a marginal (i.e. single-locus) effect [207]. The argument against this technique is that it ignores the possibility of interactions between markers with subtle or non-existent marginal effects. With deference to this position, for pharmacogenomic association studies in which the focus is on large genetic effects which could be predictive of drug response, the lack of significant single-locus associations attributable to the variants composing a multi-locus interaction should not be an issue. Theoretical work has shown that while analysis of one study population might reveal a gene-gene interaction with no significant single-locus effects – also known as a purely epistatic effect – even a small shift in allele frequency in future studies would cause a detectable marginal effect [208].
Another alternative to reducing the search space - the total set of variables used for analysis - involved in analysis of gene-gene interaction analysis is the Biofilter approach [203]. Biofilter focuses multi-locus analysis using biological knowledge in the form of protein- and gene-interaction databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Reactome. Genetic markers are selected from genes which are chosen based on manual curation or use of disease-dependent databases containing prior knowledge outcome’s etiology. Markers in interacting genes or in genes sharing a common metabolic or biological pathway are then paired together to form biologically plausible multi-locus interaction models. Biofilter is capable of reducing the search space of gene-gene interactions in a GWAS from hundreds of billions of models to a several million or fewer, a reduction of about four orders of magnitude. Filtering genetic data prior to conducting gene-gene interaction analysis is pertinent to any study generating large quantities of genotypes, although it is particularly relevant for GWAS.
8. POST-ANALYSIS ISSUES
8.1. Multiple Comparison Correction
Possibly the greatest challenge to consider when evaluating the results of a genetic association study is multiple testing. It is expected for a GWAS that approximately 500,000 to 1 million statistical tests will be performed when excluding the possibility of complex genetic effects. The result, when using the standard significance threshold of 0.05 for each statistical test, is that on average 5% of the tests - 50,000 genetic variants - will surpass this threshold purely by chance [209]. This exemplifies the problem of multiple testing. Although a significance threshold is valid when conducting a single statistical test, the probability of surpassing this threshold upon repeated testing increases so that once only 5 tests are conducted, the chance of making at least one association purely by chance rises from 5% with a single test to approximately 23%. The most common procedure used to adjust for multiple testing is the Bonferroni correction [210]. The Bonferroni multiple test correction divides the p-value threshold required to consider a single test significant by the total number of statistical tests to determine the new threshold. For a GWAS with 1 million genetic markers, a p-value of 5×10−8 is required to be considered genome-wide significant. The reality is that the Bonferroni correction is much too conservative for the analysis of most genetic marker data [211], as it assumes that all statistical tests performed are independent. LD patterns tell us that there is significant correlation structure in the genome and thus testing all markers for association will not yield completely independent test statistics. The issue of non-independent tests is particularly relevant when conducting gene-gene interaction analyses, as the tests containing overlapping markers will undoubtedly be correlated. MDR solves this problem utilizing permutation testing [186], although this can be computationally challenging for large data sets.
Another option for multiple testing correction is controlling the false discovery rate (FDR), suggested by Benjamini and Hochberg [212]. The FDR adjustment finds a p-value cutoff to be considered significant based on the distribution of p-values and the user’s definition of an acceptable proportion of false discoveries. All results with p-values below this threshold are taken as significant with the caveat that a proportion of the significant results are assumed to be false positive associations. Correcting for multiple testing requires walking the fine line between accepting false positives and discarding false negatives. Prior problems with false positive associations which failed to replicate in follow-up research [213, 214] have led to an established requirement for replication studies to prove the validity of the results in genetic association studies. Whenever multiple tests are performed, the false positive rate will increase and thus the number of tests must be accounted for in order to consider the results statistically valid.
8.2. Interpretation of Results
The interpretation and reporting of results in genetic association studies can be notoriously difficult due to multiple comparisons and other challenges. For GWAS, results are reported as statistically significant if they surpass the 5×10−8 p-value threshold for genome-wide significance. The fact that a variant displays genome-wide significance says nothing about the causative nature of the genetic variant is implicated by the association. Particularly in GWAS, the associations detected are nearly always indirect associations to tag SNPs [11], meaning that the actual causative mutation will lie within a genomic region of LD which is tagged by the marker on the GWAS platform. When the associated tag SNP is in a gene for which there is biological relevance, interpretation is likely to be more straightforward. Many times in GWAS, however, associated markers are found in intergenic regions [86] such as gene deserts, in which there are no genes present for thousands to hundreds of thousands of base pairs. In these situations, it is useful to examine nearby genes to see if there could be some explanation for an association to that gene. It could be that the associated marker is in a long-range regulatory element for a gene of importance [215]. By changing the action of regulatory elements, gene expression and protein levels of components such as the enzymes tied to metabolism of the research drug could be modulated. Recent work has shown that some intergenic associations could also affect non-coding RNA elements [216, 217].
A particular class of non-coding RNA which has received considerable attention in the recently is the microRNA [218]. A microRNA is an RNA element which is transcribed from DNA into RNA but not translated to protein. Its function is to target the transcripts of particular genes for degradation and thus act as an additional level of regulation in gene expression estimated to affect up to 30% of genes [219]. The importance of microRNAs in relation to pharmacogenomics is shown by Mishra et al. in a 2007 paper which identifies a microRNA binding site mutation located in the gene coding for dihydrofolate reductase which results in resistance to the chemotherapy drug methotrexate [220]. It is hypothesized that targeting microRNAs could serve as a therapeutic intervention for diseases like cancer for which changes in gene expression are proposed to play a significant role [219].
While genetic associations in intergenic regions can be difficult to account for, sometimes the associated marker is in a gene for which the function is not currently understood [86] or for which there is no functional relevance to the outcome under study [90]. It is the latter case, in particular, which warrants skepticism, especially for drugs with well-studied metabolism pathways. If no potential role of the mutation can be elucidated, it could be that the result represents a false positive. At this point, interpretation depends on the power of the study as well as the effect size and significance level of the variant. Replication can also be a decisive factor in revealing false positives in the original study if the replication study is well powered to detect the associated variant. If the effect of the variant in an unrelated gene is large and predicts the outcome well particularly in individuals not present in the original study, or even if the effect size is small but the study was well powered to discover a variant of that effect size it could represent novel biology. For these reasons GWAS is useful for scanning the genome for novel common variation contributing to a trait but does suffer from interpretation issues. Awareness of these issues, however, can mitigate the challenge. Results of small candidate gene studies are often much easier to interpret in this respect. Candidate gene studies have the benefit of searching for direct instead of indirect association so a strong statistical signal can be more easily linked to causation. It should be noted, however, that to prove the predictive value of an associated variant from any study design validation will be required.
8.3. Replication
In 2007, the NHGRI Working Group on Replication in Association Studies laid out requirements for replication in genetic association studies [93]. Replication is defined by the working group as significance of an effect of the same genetic variant, or one which is highly correlated with the original, in the same direction (e.g. increase of risk for an adverse event in both studies) and in a comparable population. This working group came in response to the inability to replicate many early GWAS findings [213, 214]. Recommendations were made that replication of a finding within a population comparable to that in the original association study should be required for the publication of significant GWAS findings, citing the need to differentiate true positive results. While there is merit to this perspective, replication should take different forms under diverse circumstances. For example, the issues surrounding replication for models of gene-gene and/or gene-environment interactions have not been resolved. The underlying rationale for replication is to strengthen the evidence that a genetic variant is truly associated with the trait under study. Ideally, the replication study should find precisely the same SNP to have a significant effect in the same direction (i.e., increased risk of disease) although the indirect nature of GWAS almost necessitates searching within the entire genomic region defined by LD structure around the original variant [221]. Narrowing the association to identify the true causal variant is the rationale for performing fine-mapping studies [221]. Fine-mapping involves genotyping many variants to increase coverage around the marker associated from the primary study and is a valuable approach with increased power to replicate the finding given it is not a false positive. Because this strength of evidence is usually not present with the results of a single GWAS, most replication studies come in the form of a secondary GWAS [221].
It is important to note that the strength of evidence has been small for GWAS in traditional genetic epidemiology studies because the contribution of risk from any single genetic variant found in a genome-wide study is often vanishingly small. Many early pharmacogenomic association studies give us hope that the effect sizes to be discovered in drug response phenotypes are significantly larger. If this turns out to be true, fine-mapping could potentially follow directly from GWAS.
8.4. Validation and Translation
A consideration for all pharmacogenomic association studies is the manner in which to proceed subsequent to finding significant associations. Clearly, the ideal endpoint of a pharmacogenomic study is the translation of the result into a genetic test for use of predicting drug response and guiding clinical treatment. In order to translate a finding into a test, there are several intermediate steps which must be satisfied. First, the effect of the association must be validated [222] and estimates of the size of the effect refined. The latter is particularly important as it will allow cost-effectiveness estimates to be performed. It has been found that initial association studies have a tendency to overstate estimates of risk in a phenomenon known as the “winner’s curse” [221]. A retrospective cohort design in a large study population would be a cost-effective technique for validating and refining the estimate of a variant’s predictive ability in a population setting if the trait is either continuously distributed or fairly common or if the goal is to show that patients with a particular genotype experienced better outcomes over a long term period [223]. A retrospective cohort study would not be fit for a trait which might suffer from survival bias, however [73]. Individuals at one end of the spectrum of trait values could be under-represented due to inability to participate in the study and risk estimates from the study would not be biased. This situation can result with fatal or severe ADR and for drugs prescribed post-transplant, where survival time would strongly skew the portion of the treated population who could participate. Given circumstances of survival bias, conducting a prospective cohort study or making use of RCT data would be more appropriate.
Probably the most accurate genetic risk estimates would be gained by enrolling based on genotype as the exposure. While this would be costly and time-consuming, it has been used to look at long-term outcomes in individuals for whose clinical treatment has been determined as a result of genotyping [144]. For rare outcomes, it would be necessary to rely on a case-control study for a risk estimate. Using either design is acceptable as long as sample size is sufficiently large and selection bias is minimized during ascertainment. A stable risk estimate will enable cost-effectiveness analysis. If a genetic test would not be cost-effective, either by a purely monetary definition or a definition integrating quality-adjusted life-years (QALYs) or disability-adjusted life-years (DALYs) [224], it is extremely unlikely to be accepted by the clinic or investors. Cost-effectiveness has been a factor restricting the implementation of genetic testing for warfarin dosing despite the strong genetic associations [48]. Factors determining cost-effectiveness include severity of the outcome and number needed to test (NNT) to prevent one event. Treatment setting should also be a factor in evaluating cost-effectiveness, as a different scale will be necessary when evaluating treatments in a resource-limited setting.
CONCLUSIONS AND EXPERT OUTLOOK
In the current era of relatively inexpensive genotyping and availability of electronic medical records, we are edging closer to realizing personalized medicine. Using pharmacogenomic association studies to elucidate the interplay between genetics and drug response is an integral component of this shift to personalization. In order to be effective, however, it is paramount that these association studies are designed carefully. Ascertainment must be conducted in a manner which reduces the magnitude of information and selection bias. When possible, it can be statistically advantageous to take advantage of non-binary phenotypes. Statistical testing should be planned and focused to answer the study question in the interest of minimizing false positive results. Due to the costs of an association study and inherent publication bias, it can be tempting to pursue alternative phenotypes in attempts to discover a significant result. While it is important not to over-mine the available data, it is also important to properly account for the complexity of the genetic architecture underlying the outcome. Few association studies explore the possibility of complex genetic effects such as gene-gene or gene-environment interactions. Due to the nuanced nature of drug metabolism and transport networks, it is probable that these higher order effects play a significant role in determining phenotype between individuals which has been heretofore underappreciated.
Association studies are the first step in a long road to enacting genetic tests in practice. The Abacavir story has demonstrated that extensive validation efforts are necessary to prove the effect of a polymorphism and illustrate its predictive ability in practice. Although many genetic variants found are likely to have tremendous potential to effect improvements in treatment on an individual level, the burden of proving cost-effectiveness falls on the researcher. As association studies continue to probe the genetic basis of drug response and our biological knowledge grows, it might be possible to overcome the cost-effectiveness hurdle by determining focused populations to target the genetic test towards. The reality is that the practice of genetic testing will not be feasible on a large-scale until significant changes are made. Current turn-around times on some genetic testing make it infeasible for testing data to guide initial treatment and dosing. One option which is being explored to solve this problem is the use of portable genetic tests which can be run with faster turn-around. Genetic testing to inform the prescription of warfarin is moving towards this end [225].
There are several “portable” assays which can be used to measure variation in VKORC1 and CYP2C9 with considerable decrease in testing time. The cost of hardware required for testing is prohibitive to making these tests truly widespread at this point, particularly in resource-limited settings. Once we can develop truly portable assays without an associated drop in accuracy, such testing could be a panacea for pharmacogenomics. This is especially true in developing countries where the density of high-quality genotyping labs is much lower and the importance of choosing an appropriate initial treatment regimen is maximized due to fewer opportunities for follow-up visits.
Finally, another interesting option is the preemptive testing of a suite of important variants to reduce the per-genotype cost, keeping all results in the patient’s medical records [226]. Having genetic information on file at the time of treatment initiation would be the ideal manifestation of pharmacogenomics. As the cost of sequencing continues to fall, it is possible that genome-wide sequence data will eventually be available to refer to at the time of treatment initiation. Even this scenario, however, relies on the use of well-designed pharmacogenomic association studies to determine upon which genetic data the treatment should be determined.
Acknowledgements
This work was supported by NIH grants LM010040, AI077505, HG004608, and HL065962.
LIST OF ABBREVIATIONS
- ACTG
AIDS Clinical Trials Group
- ADME
Absorption, Distribution, Metabolism and Elimination
- ADR
Adverse Drug Reaction
- AIDS
Acquired Immunodeficiency Syndrome
- AIM
Ancestry Informative Marker
- AMD
Age-related Macular Degeneration
- ANOVA
Analysis of Variance
- CDCV
Common Disease, Common Variant
- CDRV
Common Disease, Rare Variant
- CI
Confidence Interval
- CNV
Copy Number Variant
- CYP
Cytochrome P450
- DALY
Disability-Adjusted Life-Year
- DMET
Drug Metabolizing Enzymes and Transporters
- eMERGE
Electronic Medical Records and Genomics
- FDR
False Discovery Rate
- GENEVA
Gene, Environment Association Studies
- GWAS
Genome Wide Association Study
- HIV
Human Immunodeficiency Virus
- HSR
Hypersensitivity Reaction
- HWE
Hardy Weinberg Equilibrium
- IBD
Identity By Descent
- INR
International Normalized Range
- KEGG
Kyoto Encyclopedia of Genes and Genomes
- LD
Linkage Disequilbrium
- MAF
Minor Allele Frequency
- MB-MDR
Model-Based Multifactor Dimensionality Reduction
- MDR
Multifactor Dimensionality Reduction
- MI
Myocardial Infarction
- NCI
National Cancer Institute
- NHGRI
National Human Genome Research Institute
- NNT
Number Needed to Test
- NPV
Negative Predictive Value
- OR
Odds Ratio
- QALY
Quality-Adjusted Life-Year
- QC
Quality Control
- Q-Q
Quantile-Quantile
- RCT
Randomized Clinical Trial
- RFLP
Restriction Fragment Length Polymorphism
- SNP
Single Nucleotide Polymorphism
- SURF&TuRF
Spatially Uniform ReliefF and Tuned ReliefF
Footnotes
PSC Scientific and Medical Advisory Committee, 2010
CONFLICT OF INTERESTS
None declared/applicable
References
- 1.Lazarou J, Pomeranz BH, Corey PN. Incidence of adverse drug reactions in hospitalized patients: a meta-analysis of prospective studies. JAMA. 1998;279(15):1200–1205. doi: 10.1001/jama.279.15.1200. [DOI] [PubMed] [Google Scholar]
- 2.Botstein D, White RL, Skolnick M, et al. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet. 1980;32(3):314–331. [PMC free article] [PubMed] [Google Scholar]
- 3.Weber JL, May PE. Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am J Hum Genet. 1989;44:388–396. [PMC free article] [PubMed] [Google Scholar]
- 4.Collins FS, Guyer MS, Charkravarti A. Variations on a theme: cataloging human DNA sequence variation. Science. 1997;278(5343):1580–1581. doi: 10.1126/science.278.5343.1580. [DOI] [PubMed] [Google Scholar]
- 5.Durbin RM, Abecasis GR, Altshuler DL, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Sebat J, Lakshmi B, Troge J, et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305(5683):525–528. doi: 10.1126/science.1098918. [DOI] [PubMed] [Google Scholar]
- 7.Joober R, Boksa P. A new wave in the genetics of psychiatric disorders: the copy number variant tsunami. J Psychiatry Neurosci. 2009;34(1):55–59. [PMC free article] [PubMed] [Google Scholar]
- 8.Collins FS, Morgan M, Patrinos A. The Human Genome Project: lessons from large-scale biology. Science. 2003;300(5617):286–290. doi: 10.1126/science.1084564. [DOI] [PubMed] [Google Scholar]
- 9.Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
- 10.International hapmap consortium. The International HapMap Project. Nature. 2003;426(6968):789–796. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
- 11.International hapmap consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449(7164):851–861. doi: 10.1038/nature06258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Slatkin M. Linkage disequilibrium--understanding the evolutionary past and mapping the medical future. Nat Rev Genet. 2008;9(6):477–485. doi: 10.1038/nrg2361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lander ES. The new genomics: global views of biology. Science. 1996;274(5287):536–539. doi: 10.1126/science.274.5287.536. [DOI] [PubMed] [Google Scholar]
- 14.Haines JL, Pericak-Vance MA. Genetic Analysis of Complex Disease. Hoboken, NJ: John Wiley & Sons; 2006. [Google Scholar]
- 15.Hindorff LA, Sethupathy P, Junkins HA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106(23):9362–9367. doi: 10.1073/pnas.0903103106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the genetics of common disease. J Clin Invest. 2008;118(5):1590–1605. doi: 10.1172/JCI34772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Manolio TA. Genomewide association studies and assessment of the risk of disease. N Engl J Med. 2010;363(2):166–176. doi: 10.1056/NEJMra0905980. [DOI] [PubMed] [Google Scholar]
- 18.Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet. 2003;33(Suppl):228–237. doi: 10.1038/ng1090. [DOI] [PubMed] [Google Scholar]
- 19.Carlson CS, Eberle MA, Kruglyak L, et al. Mapping complex disease loci in whole-genome association studies. Nature. 2004;429(6990):446–452. doi: 10.1038/nature02623. [DOI] [PubMed] [Google Scholar]
- 20.Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456(7218):18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]
- 21.Corder EH, Saunders AM, Strittmatter WJ, et al. Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer's disease in late onset families. Science. 1993;261(5123):921–923. doi: 10.1126/science.8346443. [DOI] [PubMed] [Google Scholar]
- 22.Pericak-Vance MA, Bebout JL, Gaskell PC, et al. Linkage studies in familial Alzheimer's disease: evidence for chromosome 19 linkage. Am J Hum Genet. 1991;48:1034–1050. [PMC free article] [PubMed] [Google Scholar]
- 23.Haines JL, Hauser MA, Schmidt S, et al. Complement factor H variant increases the risk of age-related macular degeneration. Science. 2005;308(5720):419–421. doi: 10.1126/science.1110359. [DOI] [PubMed] [Google Scholar]
- 24.Edwards AO, Ritter R, III, Abel KJ, et al. Complement factor H polymorphism and ager-elated macular degeneration. Science. 2005;308(5720):421–424. doi: 10.1126/science.1110189. [DOI] [PubMed] [Google Scholar]
- 25.Klein RJ, Zeiss C, Chew EY, et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308(5720):385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Bertram L, Lill CM, Tanzi RE. The genetics of Alzheimer disease: back to the future. Neuron. 2010;68(2):270–281. doi: 10.1016/j.neuron.2010.10.013. [DOI] [PubMed] [Google Scholar]
- 27.Schork NJ, Murray SS, Frazer KA, et al. Common vs. rare allele hypotheses for complex diseases. Curr Opin Genet Dev. 2009;19(3):212–219. doi: 10.1016/j.gde.2009.04.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Via M, Gignoux C, Burchard EG. The 1000 Genomes Project: new opportunities for research and social challenges. Genome Med. 2010;2(1):3. doi: 10.1186/gm124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Manolio TA, Collins FS, Cox NJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–753. doi: 10.1038/nature08494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Phillips PC. Epistasis--the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet. 2008;9(11):855–867. doi: 10.1038/nrg2452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Bateson W. Mendel's Principles of Heredity. Cambridge: Cambridge University Press; 1909. [Google Scholar]
- 32.Barbaric I, Miller G, Dear TN. Appearances can be deceiving: phenotypes of knockout mice. Brief Funct Genomic Proteomic. 2007;6(2):91–103. doi: 10.1093/bfgp/elm008. [DOI] [PubMed] [Google Scholar]
- 33.Franz T, Winckler L, Boehm T, et al. Capn5 is expressed in a subset of T cells and is dispensable for development. Mol Cell Biol. 2004;24(4):1649–1654. doi: 10.1128/MCB.24.4.1649-1654.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Giaever G, Chu AM, Ni L, et al. Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002;418(6896):387–391. doi: 10.1038/nature00935. [DOI] [PubMed] [Google Scholar]
- 35.Smith V, Chou KN, Lashkari D, et al. Functional analysis of the genes of yeast chromosome V by genetic footprinting. Science. 1996;274(5295):2069–2074. doi: 10.1126/science.274.5295.2069. [DOI] [PubMed] [Google Scholar]
- 36.Winzeler EA, Shoemaker DD, Astromoff A, et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science. 1999;285(5429):901–906. doi: 10.1126/science.285.5429.901. [DOI] [PubMed] [Google Scholar]
- 37.Bouche N, Bouchez D. Arabidopsis gene knockout: phenotypes wanted. Curr Opin Plant Biol. 2001;4(2):111–117. doi: 10.1016/s1369-5266(00)00145-x. [DOI] [PubMed] [Google Scholar]
- 38.Bondos SE, Catanese DJ, Jr, Tan XX, et al. Hox transcription factor ultrabithorax Ib physically and genetically interacts with disconnected interacting protein 1, a double-stranded RNA-binding protein. J Biol Chem. 2004;279(25):26433–26444. doi: 10.1074/jbc.M312842200. [DOI] [PubMed] [Google Scholar]
- 39.Rain JC, Selig L, De RH, et al. The protein-protein interaction map of Helicobacter pylori. Nature. 2001;409(6817):211–215. doi: 10.1038/35051615. [DOI] [PubMed] [Google Scholar]
- 40.Culverhouse R, Suarez BK, Lin J, et al. A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet. 2002;70(2):461–471. doi: 10.1086/338759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Lay-Son G, Puga A, Astudillo P, et al. Cystic fibrosis in Chilean patients: Analysis of 36 common CFTR gene mutations. J Cyst Fibros. 2010 doi: 10.1016/j.jcf.2010.10.002. [DOI] [PubMed] [Google Scholar]
- 42.Bibi Z. Role of cytochrome P450 in drug interactions. Nutr Metab (Lond) 2008;5:27. doi: 10.1186/1743-7075-5-27. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- 43.Ekbal NJ, Holt DW, Macphee IA. Pharmacogenetics of immunosuppressive drugs: prospect of individual therapy for transplant patients. Pharmacogenomics. 2008;9(5):585–596. doi: 10.2217/14622416.9.5.585. [DOI] [PubMed] [Google Scholar]
- 44.Haas DW, Ribaudo HJ, Kim RB, et al. Pharmacogenetics of efavirenz and central nervous system side effects: an Adult AIDS Clinical Trials Group study. AIDS. 2004;18:2391–2400. [PubMed] [Google Scholar]
- 45.Taubert D, Bouman HJ, van Werkum JW. Cytochrome P-450 polymorphisms and response to clopidogrel. N Engl J Med. 2009;360(21):2249–2250. doi: 10.1056/NEJMc090391. [DOI] [PubMed] [Google Scholar]
- 46.Mega JL, Close SL, Wiviott SD, et al. Cytochrome p-450 polymorphisms and response to clopidogrel. N Engl J Med. 2009;360(4):354–362. doi: 10.1056/NEJMoa0809171. [DOI] [PubMed] [Google Scholar]
- 47.James AH, Britt RP, Raskino CL, et al. Factors affecting the maintenance dose of warfarin. J Clin Pathol. 1992;45(8):704–706. doi: 10.1136/jcp.45.8.704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Tan GM, Wu E, Lam YY, et al. Role of warfarin pharmacogenetic testing in clinical practice. Pharmacogenomics. 2010;11(3):439–448. doi: 10.2217/pgs.10.8. [DOI] [PubMed] [Google Scholar]
- 49.Aithal GP, Day CP, Kesteven PJ, et al. Association of polymorphisms in the cytochrome P450 CYP2C9 with warfarin dose requirement and risk of bleeding complications. Lancet. 1999;353(9154):717–719. doi: 10.1016/S0140-6736(98)04474-2. [DOI] [PubMed] [Google Scholar]
- 50.Limdi NA, Veenstra DL. Expectations, validity, and reality in pharmacogenetics. J Clin Epidemiol. 2010;63(9):960–969. doi: 10.1016/j.jclinepi.2009.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Oldenburg J, Bevans CG, Muller CR, et al. Vitamin K epoxide reductase complex subunit 1 (VKORC1): the key protein of the vitamin K cycle. Antioxid Redox Signal. 2006;8(3–4):347–353. doi: 10.1089/ars.2006.8.347. [DOI] [PubMed] [Google Scholar]
- 52.Mallal S, Phillips E, Carosi G, et al. HLA-B*5701 screening for hypersensitivity to abacavir. N Engl J Med. 2008;358(6):568–579. doi: 10.1056/NEJMoa0706135. [DOI] [PubMed] [Google Scholar]
- 53.The Merck Manuals Online Medical Library. [Accessed November 28, 2010];Pharmacokinetics. 2010 Available from: http://www.merck.com/mmpe/sec20/ch303/ch303a.html. [Google Scholar]
- 54.Dahlin MG, Beck OM, Amark PE. Plasma levels of antiepileptic drugs in children on the ketogenic diet. Pediatr Neurol. 2006;35(1):6–10. doi: 10.1016/j.pediatrneurol.2005.11.001. [DOI] [PubMed] [Google Scholar]
- 55.Ohara K, Tanabu S, Ishibashi K, et al. CYP2D6*10 alleles do not determine plasma fluvoxamine concentration/dose ratio in Japanese subjects. Eur J Clin Pharmacol. 2003;58(10):659–661. doi: 10.1007/s00228-002-0529-3. [DOI] [PubMed] [Google Scholar]
- 56.Singh R, Srivastava A, Kapoor R, et al. Impact of CYP3A5 and CYP3A4 gene polymorphisms on dose requirement of calcineurin inhibitors, cyclosporine and tacrolimus, in renal allograft recipients of North India. Naunyn Schmiedebergs Arch Pharmacol. 2009;380(2):169–177. doi: 10.1007/s00210-009-0415-y. [DOI] [PubMed] [Google Scholar]
- 57.The pharmacokinetics working group of the AGAH. Collection of terms, symbols, equations, and explanations of common pharmacokinetic and pharmacodynamic parameters and some statistical functions. [Accessed November 28, 2010];2004 Available from: http://www.agah.info/uploads/media/PK-glossary_PK_working_group_2004_01.pdf.
- 58.The Merck Manuals Online Medical Library. [Accessed November 28, 2010];Pharmacodynamics. 2010 Available from: http://www.merck.com/mmpe/sec20/ch304/ch304a.html. [Google Scholar]
- 59.Majumder PP, Ghosh S. Mapping quantitative trait loci in humans: achievements and limitations. J Clin Invest. 2005;115(6):1419–1424. doi: 10.1172/JCI24757. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Ghosh S. Genome-wide association analyses of quantitative traits: the GAW16 experience. Genet Epidemiol. 2009;33(Suppl 1):S13–S18. doi: 10.1002/gepi.20466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Turner SD, Crawford DC, Ritchie MD. Methods for optimizing statistical analyses in pharmacogenomics research. Expert Rev Clin Pharmacol. 2009;2(5):559–570. doi: 10.1586/ecp.09.32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Fredrickson DS, Lees RS. A System For Phenotyping Hyperlipoproteinemia. Circulation. 1965;31:321–327. doi: 10.1161/01.cir.31.3.321. [DOI] [PubMed] [Google Scholar]
- 63.Crum-Cianflone NF, Grandits G, Echols S, et al. Trends and causes of hospitalizations among HIV-infected persons during the late HAART era: what is the impact of CD4 counts and HAART use? J Acquir Immune Defic Syndr. 2010;54(3):248–257. doi: 10.1097/qai.0b013e3181c8ef22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Ginsburg GS, Voora D. The long and winding road to warfarin pharmacogenetic testing. J Am Coll Cardiol. 2010;55(25):2813–2815. doi: 10.1016/j.jacc.2010.04.006. [DOI] [PubMed] [Google Scholar]
- 65.Kiyotani K, Mushiroda T, Hosono N, et al. Lessons for pharmacogenomics studies: association study between CYP2D6 genotype and tamoxifen response. Pharmacogenet Genomics. 2010;20(9):565–568. doi: 10.1097/FPC.0b013e32833af231. [DOI] [PubMed] [Google Scholar]
- 66.Chen MH, Tzeng CH, Chen PM, et al. VEGF −460T --> C polymorphism and its association with VEGF expression and outcome to FOLFOX-4 treatment in patients with colorectal carcinoma. Pharmacogenomics J. 2010 doi: 10.1038/tpj.2010.48. [DOI] [PubMed] [Google Scholar]
- 67.Kuritzkes DR. Preventing and managing antiretroviral drug resistance. AIDS Patient Care STDS. 2004;18(5):259–273. doi: 10.1089/108729104323076007. [DOI] [PubMed] [Google Scholar]
- 68.Jiang H, Fine JP. Survival analysis. Methods Mol Biol. 2007;404:303–318. doi: 10.1007/978-1-59745-530-5_15. [DOI] [PubMed] [Google Scholar]
- 69.Schlesselman JJ. Design, Conduct, Analysis. New York/Oxford: Oxford University Press; 1982. Case-Control Studies. [Google Scholar]
- 70.Balding DJ. A tutorial on statistical methods for population association studies. Nat Rev Genet. 2006;7(10):781–791. doi: 10.1038/nrg1916. [DOI] [PubMed] [Google Scholar]
- 71.Manolio TA, Bailey-Wilson JE, Collins FS. Genes, environment and the value of prospective cohort studies. Nat Rev Genet. 2006;7(10):812–820. doi: 10.1038/nrg1919. [DOI] [PubMed] [Google Scholar]
- 72.Thomas DC. Statistical Methods in Genetic Epidemiology. New York, NY: Oxford University Press, Inc.; 2004. [Google Scholar]
- 73.Little J, Sharp L, Khoury MJ, et al. The epidemiologic approach to pharmacogenomics. Am J Pharmacogenomics. 2005;5(1):1–20. doi: 10.2165/00129785-200505010-00001. [DOI] [PubMed] [Google Scholar]
- 74.Stolberg HO, Norman G, Trop I. Randomized controlled trials. AJR Am J Roentgenol. 2004;183(6):1539–1544. doi: 10.2214/ajr.183.6.01831539. [DOI] [PubMed] [Google Scholar]
- 75.Jadad AR. Randomised controlled trials: a user's guide. London, England: BMJ Books; 1998. [Google Scholar]
- 76.Smith PG, Day NE. The design of case-control studies: the influence of confounding and interaction effects. International Journal of Epidemiology. 1984;13:356–365. doi: 10.1093/ije/13.3.356. [DOI] [PubMed] [Google Scholar]
- 77.Zang EA, Wynder EL. Reevaluation of the confounding effect of cigarette smoking on the relationship between alcohol use and lung cancer risk, with larynx cancer used as a positive control. Prev Med. 2001;32(4):359–370. doi: 10.1006/pmed.2000.0818. [DOI] [PubMed] [Google Scholar]
- 78.Hoggart CJ, Parra EJ, Shriver MD, et al. Control of confounding of genetic associations in stratified populations. Am J Hum Genet. 2003;72(6):1492–1504. doi: 10.1086/375613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Black N. Why we need observational studies to evaluate the effectiveness of health care. BMJ. 1996;312(7040):1215–1218. doi: 10.1136/bmj.312.7040.1215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Sanson-Fisher RW, Bonevski B, Green LW, et al. Limitations of the randomized controlled trial in evaluating population-based health interventions. Am J Prev Med. 2007;33(2):155–161. doi: 10.1016/j.amepre.2007.04.007. [DOI] [PubMed] [Google Scholar]
- 81.Haas DW. Human genetic variability and HIV treatment response. Curr HIV /AIDS Rep. 2006;3(2):53–58. doi: 10.1007/s11904-006-0018-x. [DOI] [PubMed] [Google Scholar]
- 82.Khoury MJ, Flanders WD. Nontraditional epidemiologic approaches in the analysis of gene-environment interaction: case-control studies with no controls! Am J Epidemiol. 1996;144(3):207–213. doi: 10.1093/oxfordjournals.aje.a008915. [DOI] [PubMed] [Google Scholar]
- 83.VanderWeele TJ, Hernandez-Diaz S, Hernan MA. Case-only gene-environment interaction studies: when does association imply mechanistic interaction? Genet Epidemiol. 2010;34(4):327–334. doi: 10.1002/gepi.20484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.McCarthy MI, Abecasis GR, Cardon LR, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008;9(5):356–369. doi: 10.1038/nrg2344. [DOI] [PubMed] [Google Scholar]
- 85.Tishkoff SA, Verrelli BC. Patterns of human genetic diversity: implications for human evolutionary history and disease. Annu Rev Genomics Hum Genet. 2003;4:293–340. doi: 10.1146/annurev.genom.4.070802.110226. [DOI] [PubMed] [Google Scholar]
- 86.Hindorff LA, Junkins HA, Hall PN, et al. A Catalog of Published Genome-Wide Association Studies. [Accessed November 28, 2010];2010 Available from: www.genome.gov/gwastudies.
- 87.Niculescu AB, Le-Niculescu H. The P-value illusion: how to improve (psychiatric) genetic studies. Am J Med Genet B Neuropsychiatr Genet. 2010;153B(4):847–849. doi: 10.1002/ajmg.b.31076. [DOI] [PubMed] [Google Scholar]
- 88.Lublin FD, Reingold SC. Defining the clinical course of multiple sclerosis: results of an international survey. National Multiple Sclerosis Society (USA) Advisory Committee on Clinical Trials of New Agents in Multiple Sclerosis. Neurology. 1996;46(4):907–911. doi: 10.1212/wnl.46.4.907. [DOI] [PubMed] [Google Scholar]
- 89.Hoffjan S, Akkad DA. The genetics of multiple sclerosis: An update 2010. Mol Cell Probes. 2010;24(5):237–243. doi: 10.1016/j.mcp.2010.04.006. [DOI] [PubMed] [Google Scholar]
- 90.McClellan J, King MC. Genetic heterogeneity in human disease. Cell. 2010;141(2):210–217. doi: 10.1016/j.cell.2010.03.032. [DOI] [PubMed] [Google Scholar]
- 91.Pilgrim JL, Gerostamoulos D, Drummer OH. Review: Pharmacogenetic aspects of the effect of cytochrome P450 polymorphisms on serotonergic drug metabolism, response, interactions, and adverse effects. Forensic Sci Med Pathol. 2010 doi: 10.1007/s12024-010-9188-3. [DOI] [PubMed] [Google Scholar]
- 92.Yap TA, Sandhu SK, Workman P, et al. Envisioning the future of early anticancer drug development. Nat Rev Cancer. 2010;10(7):514–523. doi: 10.1038/nrc2870. [DOI] [PubMed] [Google Scholar]
- 93.Chanock SJ, Manolio T, Boehnke M, et al. Replicating genotype-phenotype associations. Nature. 2007;447(7145):655–660. doi: 10.1038/447655a. [DOI] [PubMed] [Google Scholar]
- 94.Burmester JK, Sedova M, Shapero MH, et al. DMET microarray technology for pharmacogenomics-based personalized medicine. Methods Mol Biol. 2010;632:99–124. doi: 10.1007/978-1-60761-663-4_7. [DOI] [PubMed] [Google Scholar]
- 95.Talmud PJ, Drenos F, Shah S, et al. Gene-centric association signals for lipids and apolipoproteins identified via the HumanCVD BeadChip. Am J Hum Genet. 2009;85(5):628–642. doi: 10.1016/j.ajhg.2009.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Roeder K, Bacanu SA, Wasserman L, et al. Using linkage genome scans to improve power of association in genome scans. Am J Hum Genet. 2006;78(2):243–252. doi: 10.1086/500026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Saccone SF, Bierut LJ, Chesler EJ, et al. Supplementing high-density SNP microarrays for additional coverage of disease-related genes: addiction as a paradigm. PLoS One. 2009;4(4):e5225. doi: 10.1371/journal.pone.0005225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Sissung TM, English BC, Venzon D, et al. Clinical pharmacology and pharmacogenetics in a genomics era: the DMET platform. Pharmacogenomics. 2010;11(1):89–103. doi: 10.2217/pgs.09.154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Chuang LY, Yang CS, Ho CH, et al. Tag SNP selection using particle swarm optimization. Biotechnol Prog. 2010;26(2):580–588. doi: 10.1002/btpr.350. [DOI] [PubMed] [Google Scholar]
- 100.Han B, Kang HM, Seo MS, et al. Efficient association study design via power-optimized tag SNP selection. Ann Hum Genet. 2008;72(Pt 6):834–847. doi: 10.1111/j.1469-1809.2008.00469.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Li J. Prioritize and select SNPs for association studies with multi-stage designs. J Comput Biol. 2008;15(3):241–257. doi: 10.1089/cmb.2007.0090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Pico AR, Smirnov IV, Chang JS, et al. SNPLogic: an interactive single nucleotide polymorphism selection, annotation, and prioritization system. Nucleic Acids Res. 2009;37(Database issue):D803–D809. doi: 10.1093/nar/gkn756. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Tabor HK, Risch NJ, Myers RM. Candidate-gene approaches for studying complex genetic traits: practical considerations. Nat Rev Genet. 2002;3(5):391–397. doi: 10.1038/nrg796. [DOI] [PubMed] [Google Scholar]
- 104.Allegue C, Gil R, Sanchez-Diz P, et al. A new approach to long QT syndrome mutation detection by Sequenom MassARRAY system. Electrophoresis. 2010;31(10):1648–1655. doi: 10.1002/elps.201000022. [DOI] [PubMed] [Google Scholar]
- 105.Oberst RD, Hays MP, Bohra LK, et al. PCR-based DNA amplification and presumptive detection of Escherichia coli O157:H7 with an internal fluorogenic probe and the 5' nuclease (TaqMan) assay. Appl Environ Microbiol. 1998;64(9):3389–3396. doi: 10.1128/aem.64.9.3389-3396.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Steemers FJ, Gunderson KL. Illumina, Inc. Pharmacogenomics. 2005;6(7):777–782. doi: 10.2217/14622416.6.7.777. [DOI] [PubMed] [Google Scholar]
- 107.Macphee IA, Fredericks S, Tai T, et al. The influence of pharmacogenetics on the time to achieve target tacrolimus concentrations after kidney transplantation. Am J Transplant. 2004;4(6):914–919. doi: 10.1111/j.1600-6143.2004.00435.x. [DOI] [PubMed] [Google Scholar]
- 108.Rebbeck TR, Jaffe JM, Walker AH, et al. Modification of clinical presentation of prostate tumors by a novel genetic variant in CYP3A4. J Natl Cancer Inst. 1998;90(16):1225–1229. doi: 10.1093/jnci/90.16.1225. [DOI] [PubMed] [Google Scholar]
- 109.St George-Hyslop P. Molecular genetics of Alzheimer's disease. Biol Psychiatry. 2000;47(3):183–199. doi: 10.1016/s0006-3223(99)00301-7. [DOI] [PubMed] [Google Scholar]
- 110.Finishing the euchromatic sequence of the human genome. Nature. 2004;431(7011):931–945. doi: 10.1038/nature03001. [DOI] [PubMed] [Google Scholar]
- 111.Pushkarev D, Neff NF, Quake SR. Single-molecule sequencing of an individual human genome. Nat Biotechnol. 2009;27(9):847–852. doi: 10.1038/nbt.1561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Kedes L, Liu ET. The Archon Genomics X PRIZE for whole human genome sequencing. Nat Genet. 2010;42(11):917–918. doi: 10.1038/ng1110-917. [DOI] [PubMed] [Google Scholar]
- 113.Mardis ER. Anticipating the 1,000 dollar genome. Genome Biol. 2006;7(7):112. doi: 10.1186/gb-2006-7-7-112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Bonetta L. Whole-genome sequencing breaks the cost barrier. Cell. 2010;141(6):917–919. doi: 10.1016/j.cell.2010.05.034. [DOI] [PubMed] [Google Scholar]
- 115.Choi M, Scholl UI, Ji W, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A. 2009;106(45):19096–19101. doi: 10.1073/pnas.0910672106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116.Ng SB, Turner EH, Robertson PD, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461(7261):272–276. doi: 10.1038/nature08250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol. 2008;26(10):1135–1145. doi: 10.1038/nbt1486. [DOI] [PubMed] [Google Scholar]
- 118.Hosseini P, Tremblay A, Matthews BF, et al. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets. BMC Res Notes. 2010;3:183. doi: 10.1186/1756-0500-3-183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 119.Koboldt DC. Challenges of sequencing human genomes. Brief Bioinform. 2010;11(5):484–498. doi: 10.1093/bib/bbq016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Korbel JO, Urban AE, Grubert F, et al. Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. Proc Natl Acad Sci U S A. 2007;104(24):10110–10115. doi: 10.1073/pnas.0703834104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121.Weale ME. Quality control for genome-wide association studies. Methods Mol Biol. 2010;628:341–372. doi: 10.1007/978-1-60327-367-1_19. [DOI] [PubMed] [Google Scholar]
- 122.Laurie CC, Doheny KF, Mirel DB, et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol. 2010;34(6):591–602. doi: 10.1002/gepi.20516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123.Turner SD, Armstrong LL, Bradford Y, et al. Quality Control Procedures for Genome Wide Association Studies. Current Protocols. 2010 doi: 10.1002/cpz1.603. In press. [DOI] [PubMed] [Google Scholar]
- 124.Purcell S, Neale B, Todd-Brown K, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125.Grady BJ, Torstenson E, Dudek SM, et al. Finding unique filter sets in plato: a precursor to efficient interaction analysis in gwas data. Pac Symp Biocomput. 2010:315–326. [PMC free article] [PubMed] [Google Scholar]
- 126.Aulchenko YS, Ripke S, Isaacs A, et al. GenABEL: an R library for genome-wide association analysis. Bioinformatics. 2007;23(10):1294–1296. doi: 10.1093/bioinformatics/btm108. [DOI] [PubMed] [Google Scholar]
- 127.Thornton T, McPeek MS. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am J Hum Genet. 2007;81(2):321–337. doi: 10.1086/519497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128.Aulchenko YS, Struchalin MV, van Duijn CM. ProbABEL package for genome-wide association analysis of imputed data. BMC Bioinformatics. 2010;11:134. doi: 10.1186/1471-2105-11-134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129.Voight BF, Pritchard JK. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 2005;1(3):e32. doi: 10.1371/journal.pgen.0010032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Wittke-Thompson JK, Pluzhnikov A, Cox NJ. Rational inferences about departures from Hardy-Weinberg equilibrium. Am J Hum Genet. 2005;76(6):967–986. doi: 10.1086/430507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Hardy GH. Mendelian Proportions in a Mixed Population. Science. 1908;28(706):49–50. doi: 10.1126/science.28.706.49. [DOI] [PubMed] [Google Scholar]
- 132.Edenberg HJ, Liu Y. Laboratory methods for high-throughput genotyping. Cold Spring Harb Protoc. 2009;2009(11) doi: 10.1101/pdb.top62. db. [DOI] [PubMed] [Google Scholar]
- 133.Pluzhnikov A, Below JE, Konkashbaev A, et al. Spoiling the whole bunch: quality control aimed at preserving the integrity of high-throughput genotyping. Am J Hum Genet. 2010;87(1):123–128. doi: 10.1016/j.ajhg.2010.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Leek JT, Scharpf RB, Bravo HC, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010 doi: 10.1038/nrg2825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 135.Sebastiani P, Solovieff N, Puca A, et al. Genetic Signatures of Exceptional Longevity in Humans. Science. 2010 doi: 10.1126/science.1190532. [DOI] [PubMed] [Google Scholar]
- 136.Alberts B. Editorial expression of concern. Science. 2010;330(6006):912. doi: 10.1126/science.330.6006.912-b. [DOI] [PubMed] [Google Scholar]
- 137.23andMe. SNPwatch: Uncertainty Surrounds Longevity GWAS. [Accessed December 1, 2010];2010 Available from: http://spittoon.23andme.com/2010/07/07/snpwatch-uncertainty-surrounds-longevity-gwas/ [Google Scholar]
- 138.Tian C, Gregersen PK, Seldin MF. Accounting for ancestry: population substructure and genome-wide association studies. Hum Mol Genet. 2008;17(R2):R143–R150. doi: 10.1093/hmg/ddn268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 139.Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet. 1999;65(1):220–228. doi: 10.1086/302449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140.Montana G, Pritchard JK. Statistical tests for admixture mapping with case-control and cases-only data. Am J Hum Genet. 2004;75(5):771–789. doi: 10.1086/425281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 141.Ribaudo HJ, Liu H, Schwab M, et al. Effect of CYP2B6, ABCB1, and CYP3A5 Polymorphisms on Efavirenz Pharmacokinetics and Treatment Response: An AIDS Clinical Trials Group Study. J Infect Dis. 2010;202(5):717–722. doi: 10.1086/655470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 142.Mancinelli LM, Frassetto L, Floren LC, et al. The pharmacokinetics and metabolic disposition of tacrolimus: a comparison across ethnic groups. Clin Pharmacol Ther. 2001;69(1):24–31. doi: 10.1067/mcp.2001.113183. [DOI] [PubMed] [Google Scholar]
- 143.Cavallari LH, Langaee TY, Momary KM, et al. Genetic and clinical predictors of warfarin dose requirements in African Americans. Clin Pharmacol Ther. 2010;87(4):459–464. doi: 10.1038/clpt.2009.223. [DOI] [PubMed] [Google Scholar]
- 144.Epstein MP, Allen AS, Satten GA. A simple and improved correction for population stratification in case-control studies. Am J Hum Genet. 2007;80(5):921–930. doi: 10.1086/516842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 145.Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55(4):997–1004. doi: 10.1111/j.0006-341x.1999.00997.x. [DOI] [PubMed] [Google Scholar]
- 146.Price AL, Patterson NJ, Plenge RM, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 147.Pritchard JK, Stephens M, Rosenberg NA, et al. Association mapping in structured populations. Am J Hum Genet. 2000;67(1):170–181. doi: 10.1086/302959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 148.Satten GA, Flanders WD, Yang Q. Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet. 2001;68(2):466–477. doi: 10.1086/318195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 149.Barnholtz-Sloan JS, Chakraborty R, Sellers TA, et al. Examining population stratification via individual ancestry estimates versus self-reported race. Cancer Epidemiol Biomarkers Prev. 2005;14(6):1545–1551. doi: 10.1158/1055-9965.EPI-04-0832. [DOI] [PubMed] [Google Scholar]
- 150.Tian C, Kosoy R, Nassir R, et al. European population genetic substructure: further definition of ancestry informative markers for distinguishing among diverse European ethnic groups. Mol Med. 2009;15(11–12):371–383. doi: 10.2119/molmed.2009.00094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 151.Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155(2):945–959. doi: 10.1093/genetics/155.2.945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 152.Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS results: A review of statistical methods and recommendations for their application. Am J Hum Genet. 2010;86(1):6–22. doi: 10.1016/j.ajhg.2009.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 153.Greenwood PE, Nikulin MS. A guide to chi-squared testing. New York: Wiley; 1996. [Google Scholar]
- 154.Zheng HX, Webber SA, Zeevi A, et al. The impact of pharmacogenomic factors on steroid dependency in pediatric heart transplant patients using logistic regression analysis. Pediatr Transplant. 2004;8(6):551–557. doi: 10.1111/j.1399-3046.2004.00223.x. [DOI] [PubMed] [Google Scholar]
- 155.Armitage P. Tests for linear trends in proportions and frequencies. Biometrics. 1955;11:375–386. [Google Scholar]
- 156.Cree BA, Rioux JD, McCauley JL, et al. A major histocompatibility Class I locus contributes to multiple sclerosis susceptibility independently from HLA-DRB1*15:01. PLoS One. 2010;5(6):e11296. doi: 10.1371/journal.pone.0011296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 157.Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Amer Statist Assn. 1958;(53):457–481. [Google Scholar]
- 158.Huang SW, Chen HS, Wang XQ, et al. Validation of VKORC1 and CYP2C9 genotypes on interindividual warfarin maintenance dose: a prospective study in Chinese patients. Pharmacogenet Genomics. 2009;19(3):226–234. doi: 10.1097/FPC.0b013e328326e0c7. [DOI] [PubMed] [Google Scholar]
- 159.Stephens M, Balding DJ. Bayesian statistical methods for genetic association studies. Nat Rev Genet. 2009;10(10):681–690. doi: 10.1038/nrg2615. [DOI] [PubMed] [Google Scholar]
- 160.Coassin S, Brandstatter A, Kronenberg F. Lost in the space of bioinformatic tools: a constantly updated survival guide for genetic epidemiology. The GenEpi Toolbox. Atherosclerosis. 2010;209(2):321–335. doi: 10.1016/j.atherosclerosis.2009.10.026. [DOI] [PubMed] [Google Scholar]
- 161.Woodahl EL, Hingorani SR, Wang J, et al. Pharmacogenomic associations in ABCB1 and CYP3A5 with acute kidney injury and chronic kidney disease after myeloablative hematopoietic cell transplantation. Pharmacogenomics J. 2008;8(4):248–255. doi: 10.1038/sj.tpj.6500472. [DOI] [PubMed] [Google Scholar]
- 162.Fu SJ, Wang YB, Yu LX, et al. Factors responsible for inter-individual variations in dosage/concentration of tacrolimus in renal transplant recipients. Nan Fang Yi Ke Da Xue Xue Bao. 2008;28(12):2161–2164. [PubMed] [Google Scholar]
- 163.Epstein RS, Moyer TP, Aubert RE, et al. Warfarin genotyping reduces hospitalization rates results from the MM-WES (Medco-Mayo Warfarin Effectiveness study) J Am Coll Cardiol. 2010;55(25):2804–2812. doi: 10.1016/j.jacc.2010.03.009. [DOI] [PubMed] [Google Scholar]
- 164.Aidoo M, McElroy PD, Kolczak MS, et al. Tumor necrosis factor-alpha promoter variant 2 (TNF2) is associated with pre-term delivery, infant mortality, and malaria morbidity in western Kenya: Asembo Bay Cohort Project IX. Genet Epidemiol. 2001;21(3):201–211. doi: 10.1002/gepi.1029. [DOI] [PubMed] [Google Scholar]
- 165.Nelder J, Wedderburn R. Generalized Linear Models. Journal of the Royal Statistical Society. 2010;135(3):370–384. [Google Scholar]
- 166.Kirkwood B, Sterne J. Essentials of Medical Statistics. Malden, MA: Wiley-Blackwell; 2001. [Google Scholar]
- 167.Huber PJ. Robust Statistics. Wiley; 1981. [Google Scholar]
- 168.Agresti A. Categorical Data Analysis. New York: John Wiley & Sons; 1990. [Google Scholar]
- 169.Slinker BK, Glantz SA. Multiple linear regression is a useful alternative to traditional analyses of variance. Am J Physiol. 1988;255(3 Pt 2):R353–R367. doi: 10.1152/ajpregu.1988.255.3.R353. [DOI] [PubMed] [Google Scholar]
- 170.Li W, Reich J. A complete enumeration and classification of two-locus disease models. 2000 doi: 10.1159/000022939. [DOI] [PubMed] [Google Scholar]
- 171.Hosmer DW, Lemeshow S. Applied Logistic Regression. New York: John Wiley & Sons Inc.; 2000. [Google Scholar]
- 172.Maxwell SE, Delaney HD. Designing Experiments and Analyzing Data. New Jersey: 1990. [Google Scholar]
- 173.Cohen J, Cohen P, West SG, et al. Applied Multiple Regression and Correlation Analysis for the Behavioral Sciences. New Jersey: Lawrence Earlbaum Associates; 2003. [Google Scholar]
- 174.Ohno-Machado L. Modeling medical prognosis: survival analysis techniques. J Biomed Inform. 2001;34(6):428–439. doi: 10.1006/jbin.2002.1038. [DOI] [PubMed] [Google Scholar]
- 175.Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. Journal of Econometrics. 1996;75(2):317–343. [Google Scholar]
- 176.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer-Verlag; 2001. [Google Scholar]
- 177.Motsinger AA, Ritchie MD, Shafer RW, et al. Multilocus genetic interactions and response to efavirenz-containing regimens: an adult AIDS clinical trials group study. Pharmacogenet Genomics. 2006;16(11):837–845. doi: 10.1097/01.fpc.0000230413.97596.fa. [DOI] [PubMed] [Google Scholar]
- 178.Zanella F, Lorens JB, Link W. High content screening: seeing is believing. Trends Biotechnol. 2010;28(5):237–245. doi: 10.1016/j.tibtech.2010.02.005. [DOI] [PubMed] [Google Scholar]
- 179.Anderson GD. Pharmacogenetics and enzyme induction/inhibition properties of antiepileptic drugs. Neurology. 2004;63(10 Suppl 4):S3–S8. doi: 10.1212/wnl.63.10_suppl_4.s3. [DOI] [PubMed] [Google Scholar]
- 180.Iyer L, King CD, Whitington PF, et al. Genetic predisposition to the metabolism of irinotecan (CPT-11). Role of uridine diphosphate glucuronosyltransferase isoform 1A1 in the glucuronidation of its active metabolite (SN-38) in human liver microsomes. J Clin Invest. 1998;101(4):847–854. doi: 10.1172/JCI915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 181.Levy G, Weber W. Interindividual Variability in Human Drug Metabolism. New York: Taylor and Francis; 2001. [Google Scholar]
- 182.Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10(6):392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 183.Motsinger-Reif AA, Reif DM, Fanelli TJ, et al. A comparison of analytical methods for genetic association studies. Genet Epidemiol. 2008;32(8):767–778. doi: 10.1002/gepi.20345. [DOI] [PubMed] [Google Scholar]
- 184.Motsinger AA, Ritchie MD, Reif DM. Novel methods for detecting epistasis in pharmacogenomics studies. Pharmacogenomics. 2007;8(9):1229–1241. doi: 10.2217/14622416.8.9.1229. [DOI] [PubMed] [Google Scholar]
- 185.Motsinger AA, Ritchie MD. Multifactor dimensionality reduction: an analysis strategy for modelling and detecting gene-gene interactions in human genetics and pharmacogenomics studies. Hum Genomics. 2006;2(5):318–328. doi: 10.1186/1479-7364-2-5-318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 186.Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003;19(3):376–382. doi: 10.1093/bioinformatics/btf869. [DOI] [PubMed] [Google Scholar]
- 187.Ritchie MD, Hahn LW, Roodi N, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001;69(1):138–147. doi: 10.1086/321276. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 188.Cho YM, Ritchie MD, Moore JH, et al. Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004;47(3):549–554. doi: 10.1007/s00125-003-1321-3. [DOI] [PubMed] [Google Scholar]
- 189.Coffey CS, Hebert PR, Ritchie MD, et al. An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: the importance of model validation. BMC Bioinformatics. 2004;5:49. doi: 10.1186/1471-2105-5-49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 190.Cattaert T, Urrea V, Naj AC, et al. FAM-MDR: a flexible family-based multifactor dimensionality reduction technique to detect epistasis using related individuals. PLoS One. 2010;5(4):e10304. doi: 10.1371/journal.pone.0010304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 191.Breiman L, Friedman JH, Olshen RA, et al. Classification and Regression Trees. New York: Chapman & Hall; 1984. [Google Scholar]
- 192.Schwarz DF, Konig IR, Ziegler A. On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics. 2010;26(14):1752–1758. doi: 10.1093/bioinformatics/btq257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 193.Motsinger AA, Lee SL, Mellick G, et al. GPNN: power studies and applications of a neural network method for detecting gene-gene interactions in studies of human disease. BMC Bioinformatics. 2006;7:39. doi: 10.1186/1471-2105-7-39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 194.Motsinger AA, Reif DM, Dudek SM, et al. Understanding the Evolutionary Process of Grammatical Evolution Neural Networks for Feature Selection in Genetic Epidemiology. Proc IEEE Symp Comput Intell Bioinforma Comput Biol. 2006;2006:1–8. doi: 10.1109/CIBCB.2006.330945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 195.Ritchie MD, White BC, Parker JS, et al. Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC Bioinformatics. 2003;4(1):28. doi: 10.1186/1471-2105-4-28. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 196.Turner SD, Dudek SM, Ritchie MD. Grammatical Evolution of Neural Networks for Discovering Epistasis among Quantitative Trait Loci. Lect Notes Comput Sci. 2010;6023:86–97. doi: 10.1186/1756-0381-3-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 197.Friedman JH. Multivariate Adaptive Regression Splines. The Annals of Statistics. 1991;19(1):1–67. [Google Scholar]
- 198.Kooperberg C, Ruczinski I, LeBlanc ML, et al. Sequence analysis using logic regression. Genet Epidemiol. 2001;21(Suppl 1):S626–S631. doi: 10.1002/gepi.2001.21.s1.s626. [DOI] [PubMed] [Google Scholar]
- 199.Kooperberg C, Ruczinski I. Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol. 2005;28(2):157–170. doi: 10.1002/gepi.20042. [DOI] [PubMed] [Google Scholar]
- 200.Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]
- 201.Ritchie MD, Coffey CSMJH. Genetic programming neural networks: A bioinformatics tool for human genetics. Lecture Notes in Computer Science. 2004;3102:438–448. [Google Scholar]
- 202.Ayers KL, Cordell HJ. SNP Selection in genome-wide and candidate gene studies via penalized logistic regression. Genet Epidemiol. 2010;34(8):879–891. doi: 10.1002/gepi.20543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 203.Bush WS, Dudek SM, Ritchie MD. Biofilter: a knowledge-integration system for the multi-locus analysis of genome-wide association studies. Pac Symp Biocomput. 2009:368–379. [PMC free article] [PubMed] [Google Scholar]
- 204.Chan EK, Hawken R, Reverter A. The combined effect of SNP-marker and phenotype attributes in genome-wide association studies. Anim Genet. 2009;40(2):149–156. doi: 10.1111/j.1365-2052.2008.01816.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 205.Fong C, Ko DC, Wasnick M, et al. GWAS analyzer: integrating genotype, phenotype and public annotation data for genome-wide association study analysis. Bioinformatics. 2010;26(4):560–564. doi: 10.1093/bioinformatics/btp714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 206.Greene CS, Penrod NM, Kiralis J, et al. Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions. BioData Min. 2009;2(1):5. doi: 10.1186/1756-0381-2-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 207.Marchini J, Donnelly P, Cardon LR. Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet. 2005;37(4):413–417. doi: 10.1038/ng1537. [DOI] [PubMed] [Google Scholar]
- 208.Greene CS, Penrod NM, Williams SM, et al. Failure to replicate a genetic association may provide important clues about genetic architecture. PLoS One. 2009;4(6):e5639. doi: 10.1371/journal.pone.0005639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 209.Fisher RA. Statistical Methods and Scientific Inference. New York: Hafner; 1956. [Google Scholar]
- 210.Abdi H. Bonferroni and Sidak corrections for multiple comparisons. Thousand Oaks, CA: Sage; 2007. [Google Scholar]
- 211.Rice TK, Schork NJ, Rao DC. Methods for handling multiple testing. Adv Genet. 2008;60:293–308. doi: 10.1016/S0065-2660(07)00412-9. [DOI] [PubMed] [Google Scholar]
- 212.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. 1995;57(1):289–300. [Google Scholar]
- 213.Ioannidis JP, Ntzani EE, Trikalinos TA, et al. Replication validity of genetic association studies. Nat Genet. 2001;29(3):306–309. doi: 10.1038/ng749. [DOI] [PubMed] [Google Scholar]
- 214.Ioannidis JP. Non-replication and inconsistency in the genome-wide association setting. Hum Hered. 2007;64(4):203–213. doi: 10.1159/000103512. [DOI] [PubMed] [Google Scholar]
- 215.Wasserman NF, Aneas I, Nobrega MA. An 8q24 gene desert variant associated with prostate cancer risk confers differential in vivo activity to a MYC enhancer. Genome Res. 2010;20(9):1191–1197. doi: 10.1101/gr.105361.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 216.Glinskii AB, Ma J, Ma S, et al. Identification of intergenic trans-regulatory RNAs containing a disease-linked SNP sequence and targeting cell cycle progression/differentiation pathways in multiple common human disorders. Cell Cycle. 2009;8(23):3925–3942. doi: 10.4161/cc.8.23.10113. [DOI] [PubMed] [Google Scholar]
- 217.Glinsky GV. SNP-guided microRNA maps (MirMaps) of 16 common human disorders identify a clinically accessible therapy reversing transcriptional aberrations of nuclear import and inflammasome pathways. Cell Cycle. 2008;7(22):3564–3576. doi: 10.4161/cc.7.22.7073. [DOI] [PubMed] [Google Scholar]
- 218.Garzon R, Marcucci G, Croce CM. Targeting microRNAs in cancer: rationale, strategies and challenges. Nat Rev Drug Discov. 2010;9(10):775–789. doi: 10.1038/nrd3179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 219.Bader AG, Brown D, Winkler M. The promise of microRNA replacement therapy. Cancer Res. 2010;70(18):7027–7030. doi: 10.1158/0008-5472.CAN-10-2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 220.Mishra PJ, Humeniuk R, Mishra PJ, et al. A miR-24 microRNA binding-site polymorphism in dihydrofolate reductase gene leads to methotrexate resistance. Proc Natl Acad Sci U S A. 2007;104(33):13513–13518. doi: 10.1073/pnas.0706217104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 221.Ioannidis JP, Thomas G, Daly MJ. Validating, augmenting and refining genome-wide association signals. Nat Rev Genet. 2009;10(5):318–329. doi: 10.1038/nrg2544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 222.Coffey CS, Hebert PR, Krumholz HM, et al. Reporting of model validation procedures in human studies of genetic interactions. Nutrition. 2004;20(1):69–73. doi: 10.1016/j.nut.2003.09.012. [DOI] [PubMed] [Google Scholar]
- 223.Taniguchi A, Urano W, Tanaka E, et al. Validation of the associations between single nucleotide polymorphisms or haplotypes and responses to disease-modifying antirheumatic drugs in patients with rheumatoid arthritis: a proposal for prospective pharmacogenomic study in clinical practice. Pharmacogenet Genomics. 2007;17(6):383–390. doi: 10.1097/01.fpc.0000236326.80809.b1. [DOI] [PubMed] [Google Scholar]
- 224.Grosse SD, Wordsworth S, Payne K. Economic methods for valuing the outcomes of genetic testing: beyond cost-effectiveness analysis. Genet Med. 2008;10(9):648–654. doi: 10.1097/gim.0b013e3181837217. [DOI] [PubMed] [Google Scholar]
- 225.Langley MR, Booker JK, Evans JP, et al. Validation of clinical testing for warfarin sensitivity: comparison of CYP2C9-VKORC1 genotyping assays and warfarin-dosing algorithms. J Mol Diagn. 2009;11(3):216–225. doi: 10.2353/jmoldx.2009.080123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 226.Relling MV, Altman RB, Goetz MP, et al. Clinical implementation of pharmacogenomics: overcoming genetic exceptionalism. Lancet Oncol. 2010;11(6):507–509. doi: 10.1016/S1470-2045(10)70097-8. [DOI] [PMC free article] [PubMed] [Google Scholar]