The Genetic Interpretation of Area under the ROC Curve in Genomic Profiling

Naomi R Wray; Jian Yang; Michael E Goddard; Peter M Visscher

doi:10.1371/journal.pgen.1000864

. 2010 Feb 26;6(2):e1000864. doi: 10.1371/journal.pgen.1000864

The Genetic Interpretation of Area under the ROC Curve in Genomic Profiling

Naomi R Wray ^1,^*, Jian Yang ¹, Michael E Goddard ^2,³, Peter M Visscher ¹

Editor: Nicholas J Schork⁴

PMCID: PMC2829056 PMID: 20195508

Abstract

Genome-wide association studies in human populations have facilitated the creation of genomic profiles which combine the effects of many associated genetic variants to predict risk of disease. The area under the receiver operator characteristic (ROC) curve is a well established measure for determining the efficacy of tests in correctly classifying diseased and non-diseased individuals. We use quantitative genetics theory to provide insight into the genetic interpretation of the area under the ROC curve (AUC) when the test classifier is a predictor of genetic risk. Even when the proportion of genetic variance explained by the test is 100%, there is a maximum value for AUC that depends on the genetic epidemiology of the disease, i.e. either the sibling recurrence risk or heritability and disease prevalence. We derive an equation relating maximum AUC to heritability and disease prevalence. The expression can be reversed to calculate the proportion of genetic variance explained given AUC, disease prevalence, and heritability. We use published estimates of disease prevalence and sibling recurrence risk for 17 complex genetic diseases to calculate the proportion of genetic variance that a test must explain to achieve AUC = 0.75; this varied from 0.10 to 0.74. We provide a genetic interpretation of AUC for use with predictors of genetic risk based on genomic profiles. We provide a strategy to estimate proportion of genetic variance explained on the liability scale from estimates of AUC, disease prevalence, and heritability (or sibling recurrence risk) available as an online calculator.

Author Summary

Genome-wide association studies in human populations have facilitated the creation of genomic profiles that combine the effects of many associated genetic variants to predict risk of disease. However, genomic profiles are inherently constrained in their ability to classify diseased from non-diseased individuals dictated by the genetic epidemiology of the disease. In this paper, we use a genetic interpretation to provide insight into the constraints on genomic profiles for risk prediction. We provide a strategy to estimate proportion of genetic variance explained on the liability scale from estimates of AUC, disease prevalence, and heritability available as an online calculator.

Introduction

Genome-wide association studies in human populations have facilitated the creation of genomic profiles which combine the effects of many associated genetic variants to predict risk of disease. Genetic testing has long been available for Mendelian genetic diseases for which variants within one gene are directly responsible for the disease. In contrast, the etiology of complex genetic diseases, such those listed in Table 1, comprises both genetic and environmental risk factors. Results from genome-wide association studies have provided empirical evidence that very few associated genetic variants with effect size greater than odds ratio of 1.5 exist [1],[2]. Reconciliation of these effect sizes with the, often sizeable, estimates of heritability for many complex diseases (Table 1) means that we must expect there to be many (perhaps thousands) of genetic variants underlying complex disease if the effect size of any one variant is very small. It follows that each individual will carry a different, probably unique, portfolio of risk alleles. Whereas common risk variants have size too small to be used individually as risk predictors, profiles based on many associated genetic variants could provide useful predictions of genetic risk [3],[4]. We define genetic risk as the risk of disease given an individual's unique multi-locus genotype; genetic risk remains unchanged throughout an individual's lifetime and so could be predicted at birth prior to exposure to many environmental risk factors. Indeed, such risk predictions could be age specific, for example, risk of type 2 diabetes at 10 years, 20 years or 50 years if genomic profile sets based on empirical data were available for these scenarios which have age-specific genetic epidemiologies. As more variants are identified in the coming years, there will be increasing interest in the prospects of genomic profiling. It has been argued that genomic profiles should be assessed in terms of their clinical validity as diagnostic classifiers [5],[6]. The receiver operator characteristic (ROC) curve [7] is a well established tool for determining the efficacy of clinical diagnostic and prognostic tests in correctly classifying diseased and non-diseased individuals and has been used in the context of genomic profiling e.g., [6],[8],[9]. While the area under the ROC curve (AUC) is an important measure for clinical validity it does not tell the whole story as it does not differentiate between the accuracy with which the genomic profile predicts the true genetic risk of individuals and the accuracy with which true genetic risk predicts disease status, which is not under our control. We believe that the ability to differentiate between these components (i.e. the distinction between prediction of genotype and phenotype) is important for interpretation of the value of the genomic profile, particularly as the use of genomic profiles is very much in its infancy at present. Our knowledge of the genetic epidemiology of a disease means that we can know a priori that genomic profiles might not, on their own, be accurate diagnostic classifiers. For this reason, genomic profiles should judged in the first instance on the basis of their analytic validity [10] as predictors of genetic rather than absolute risk. Of course, in the long term genomic profiles can be combined with environmental risk factors to predict absolute risk in the context of clinical utility. Genomic profiles should improve upon family history which has long been used as a crude estimate of genetic risk (see Text S1).

Table 1. AUC related statistics for complex genetic diseases.

							AUC =	0.75^d
Disease with reference for K and λ_S	K×100	λ_S		AUC_max ^a	AUC_half ^b	AUC_quar ^c		(λ_S _[x] – 1)/(λ_S – 1),
Age related macular degeneration [21],[22]	11.8	2.2	0.68	0.92	0.81	0.72	0.31	0.27
Unipolar disorder [33],[34]	10	1.7	0.39	0.84	0.74	0.67	0.52	0.49
Coronary Artery Disease [35]	5.6	3.2	0.72	0.95	0.84	0.75	0.25	0.18
Breast cancer [36]	3.6	2.5	0.44	0.89	0.79	0.71	0.36	0.29
Type-II diabetes [37]	3	3.5	0.60	0.94	0.84	0.75	0.25	0.18
Prostate cancer [36]	2.4	2.8	0.44	0.90	0.80	0.72	0.33	0.25
Asthma [38]	2	2.6	0.37	0.88	0.79	0.71	0.37	0.29
Lung cancer [36]	1.7	6.1	0.76	0.98	0.89	0.80	0.17	0.09
Colon cancer [36]	1.5	5.1	0.64	0.96	0.87	0.77	0.20	0.12
Bladder Cancer [36]	1	1.7	0.16	0.79	0.71	0.65	0.74	0.70
Stomach cancer [36]	1	6	0.63	0.97	0.88	0.78	0.19	0.10
Bipolar disorder [39]	1	6.8	0.69	0.97	0.89	0.80	0.17	0.08
Bipolar disorder [40]	0.45	7.9	0.60	0.97	0.90	0.80	0.17	0.07
Schizophrenia [25],[41]	0.85	8.6	0.76	0.98	0.90	0.81	0.15	0.07
Schizophrenia [40]	0.4	9	0.63	0.98	0.90	0.80	0.15	0.06
Rheumatoid Arthritis [42]	0.75	8	0.70	0.98	0.90	0.80	0.16	0.07
Type-I diabetes [43]	0.54	13.7	0.86	1.00	0.93	0.84	0.12	0.04
Crohn's disease [44]	0.1	26	0.76	1.00	0.95	0.86	0.10	0.02
Systemic lupus erythematosus [45]	0.03	30	0.64	1.00	0.95	0.86	0.10	0.02

Open in a new tab

a AUC_max is the maximum AUC possible based on the genetic epidemiology parameters of disease prevalence (K) and sibling recurrence risk i.e. (λ_S) when all the known genetic variance is explained by the genomic profile, Inline graphic = 1.

b AUC_half is the AUC possible if the variants included in the genomic profile explain half of the known genetic variance i.e., Inline graphic = 0.5.

c AUC_quar is the AUC possible if the variants included in the genomic profile explain 1/4 of the known genetic variance i.e., Inline graphic = 0.25.

d Inline graphic and (λ_S _[x] – 1)/(λ_S – 1), proportion of sibling risk explained, when the measured AUC for a genomic profile is 0.75.

In this paper, we provide insight into the genetic interpretation of AUC. We begin by considering quantitative traits for which the concepts of accuracy of risk prediction are well developed. For disease traits we differentiate between measures on the observed scale of disease versus the underlying scale of disease risk as we believe recognition of scale of measurement is often overlooked. We define AUC_max as the maximum AUC that could be achieved for a disease when the test classifier is a perfect predictor of genetic risk. We quantify the relationship between AUC_max and heritability of liability and disease prevalence (lifetime morbidity risk). We show how to interpret AUC (which is a measure on the observed disease scale) of a genomic profile as the proportion of variance explained (or accuracy of prediction squared) on the underlying liability scale. Finally, we benchmark the value of genomic profiles by comparing them to the AUC expected when family history resulting from shared genetic risk factors is used as a predictor of genetic risk.

Methods

Background: quantitative traits

For quantitative traits, in which phenotypic scores are (or can be transformed to be) normally distributed, the efficacy of a genomic profile is naturally expressed as the proportion of the genetic variance explained by the profile. The variance in phenotypes, V_P, can be partitioned into variance of genetic values, V_G, so that the proportion of the variance that is genetic is the heritability V_G/V_P. Genomic profiling provides a direct estimate, Inline graphic , of true genetic values, G, for individuals in a population and the efficacy of a genomic profile can be expressed as the proportion of the genetic variance explained by the profile /V_G. We define = /V_G, since in selection theory [11], used in livestock and plant breeding, the correlation between predicted and true genetic risk ( Inline graphic ) is used as the measure of accuracy of prediction, , and if the predictor is unbiased (the regression of G on is 1), . The ratio /V_P is estimated as the R ² from the regression of P on and is interpreted recognising its upper limit to be V_G/V_P or heritability. These measures show that for quantitative traits, the accuracy with which the genomic profile predicts genetic risk is clearly separable from accuracy with which the true genetic risk predicts the phenotype. In contrast, AUC is a measure of the efficacy with which Inline graphic predicts phenotype which, as shown below, has an upper limit constrained by the heritability, and also prevalence, of the disease.

Background: disease traits

For disease traits, the phenotype has two possible values, either affected or not affected. On this observed scale, the directly measurable genetic parameters are those of recurrence risks to relatives, λ_R for relatives of type R, which is the ratio of the prevalence of disease in the relatives of affected individuals (K_R) compared to the prevalence in the population (K),

where cov(X, R) the covariance in disease status between diseased individuals X and their relatives on the observed disease risk scale [12]. For example, when the relatives are monozygous twins (R = MZ), Cov(X,MZ) = Inline graphic the genetic variance, with the subscript “01” denoting the all-or-none disease risk scale. On this scale, the majority of the genetic variance is non-additive, especially when disease prevalence is low [13],[14]. The broad sense heritability on this scale is = (λ_MZ -1)K/(1-K) where λ_MZ is the monozygotic twin recurrence risk, assuming there is no common environmental component to the recurrence risk. Inline graphic is not a normally reported statistic because of its dependence on disease prevalence [15]. If the relatives are siblings (R = S) then λ_S is the sibling risk ratio and Cov(X,S) = [11], where the variance subscripts A and D denote additive and dominance terms, and in combination denote epistatic variance terms. Thus, although λ_S is an estimable quantity, it is not simply related to the genetic variances on the observed binary scale.

The genetic properties of disease are much more easily understood by using the threshold liability model [11], in which risk of disease is transformed to a normally distributed liability scale P ∼N(0, 1) and P = A + E, where A∼N(0, Inline graphic ) are the genetic effects on the liability scale. On this scale the genetic effects combine in an additive way; is the narrow sense heritability on the liability scale (or heritability of liability) and on this scale broad sense and narrow sense heritability are equal. E are independent environmental effects, E∼N(0,1- Inline graphic ). The biological plausibility of an underlying normally distributed liability to disease is based on the assumption that complex traits are influenced by many variables; the central limit theorem states that the distribution of the sum of independent random variables approaches normality as the number of variables increases. Under the threshold liability model individuals are affected when P >T, where T is the threshold on the normal distribution which truncates the proportion of affected individuals or disease prevalence (i.e., K), T = Φ⁻¹(1-K), Φ(T) = 1-K, where Φ(T) is the cumulative density function of the normal distribution up to values of T, e.g. if K = 0.05, T = 1.645. The threshold liability of risk scale has much nicer properties than the observed disease scale and provides a framework for comparison of scenarios independent of disease prevalence. The relationship between heritability of liability Inline graphic and the directly estimable parameters of K and λ_S is

(1)

[16] with Inline graphic and z the height of the standard normal curve and T ₁ = Φ⁻¹(1- λ_S K), i.e. the threshold T ₁<T when λ_S>1, reflecting that the prevalence amongst sibs of affected individuals, K_S is greater than the prevalence in the population as a whole (e.g. if K = 0.05 and λ_S = 2, z = 0.103, T ₁ = 1.282, Inline graphic = 0.371).

Area under the ROC curve

The AUC is a statistic calculated on the observed disease scale and is a measure of the efficacy of prediction of phenotype using a test classifier. The ROC plots the true positive rate (TPR or sensitivity) against the false-positive rate (FPR or 1-specificity). TPR = probability (positive test result|diseased) and FPR = probability (positive test result|not diseased). Since these probabilities are conditional, they are not dependent on the number of cases or controls tested, except through the sampling variance associated with them. In genomic profiling the ROC is obtained by ranking a set of individuals with known disease status by their genomic profile from lowest estimated risk (i.e., profile score) to highest estimated risk and then assessing sensitivity and specificity assuming a cut-off after each rank (starting with the highest ranked individual). If n_d and n_d' are the numbers of diseased and not diseased individuals, and if the individual with the highest predicted genetic risk has rank r₁ = n_d + n_d' = n, AUC can be calculated directly from the mean rank of the diseased individuals ( Inline graphic ),

(2)

(see example in Figure S1). Equally, AUC can be calculated as AUC = 0.5(1 + D) where D is the Somers' rank correlation [17] between risk profile and disease status (1 = diseased, 0 = not diseased). Another equivalent definition of AUC is the probability that a randomly selected pair of diseased (d) and non-diseased (d') individuals are accurately classified [18]. The probability is the same as the probability that difference between the genetic liability of the d and d' individuals is greater than zero. This difference is approximately normally distributed with mean μ_d - μ_d' and variance Inline graphic . Using the liability threshold model and results of standard genetic selection theory [11] the means (μ) and variances (σ ²) of the genetic liability of d and d' individuals are

where v = -iK/(1 – K). The genetic liabilities of the d and d' groups are each approximately normally distributed, the approximation being less accurate for high heritabilities.

Therefore,

graphic file with name pgen.1000864.e037.jpg

(3)

Using AUC measured by a genomic profile to estimate the proportion of genetic variance explained

A useful property of AUC (as discussed above) is that for a given disease the estimated AUC is independent of the relative proportions of cases and controls in the sample being classified [7], i.e. the mean rank is approximately the same if the proportion of cases: controls is K: (1-K) or 1∶1. Or equally, the probability of a randomly selected case and control being correctly ranked is independent (except for sampling) of the number of cases and controls measured. We can use equation 3 to estimate the variance on the liability scale explained by a genomic profile, x, by making Inline graphic the subject of the equation, but renaming it as , recognising that it represents the proportion of variance explained by the profile. Then, from two measurable parameters, K and AUC, we can calculate ,

(4)

Where Q = Φ⁻¹(AUC). From this, we can calculate the proportion of the known genetic variance explained by the genomic profile

(5)

using the estimates of K and λ_S to calculate Inline graphic (equation 1). We can also calculate the proportion of the sibling risk explained by the profile, (λ_S _[x] – 1)/(λ_S – 1), where λ_S _[x] = (1-Φ(T _1[x]))/K and

graphic file with name pgen.1000864.e044.jpg

(6)

[19]. Inline graphic and (λ_S _[x] – 1)/(λ_S – 1) measure the same concept but in different ways and on different scales; both are useful criteria for assessing the extent to which the genomic profile accounts for the known genetic component of disease. We consider family history as a predictor of genetic risk in the Text S1.

Simulation

We used simulation under the liability threshold model [11],[14] to check our derivations. We simulated 100,000 nuclear families sampling risk on the liability scale, P = A + E, A ∼ N(0, Inline graphic ) for parents, and A = ½A_dad+½A_mum+A_mend for children, where the Mendelian segregation terms were random numbers sampled as A_mend ∼ N(0, ½); E ∼ N(0,1 - ). Individuals were considered affected, P ₀₁ = 1, if P >Φ⁻¹(1-K) = T, otherwise individuals were not affected and P ₀₁ = 0. Genetic values on the observed scale, G₀₁, were calculated as the normal probability, Inline graphic . From this we could calculate , , (using the G₀₁ and P₀₁ of the first child from each family) and sibling recurrence risk. AUC_max was calculated from the mean rank of diseased individuals using equation 2 when ranked on A.

Results

The maximum value of AUC when the test classifier is a genetic predictor depends on heritability and disease prevalence

In Figure 1A we consider two diseases both with heritability of liability, Inline graphic = 0.2, plotting probability of disease (i.e. G₀₁) vs genetic liability (i.e. A). To allow an extreme comparison, one of the diseases has prevalence K = 0.5 and the other, K = 0.01. Figure 1B also considers two diseases with prevalences K = 0.5 and 0.01, but in this case both have Inline graphic = 0.8. In Figure 1A and 1B, the position of the rise in probability of disease along the x-axis reflects the disease prevalence and the steepness of the rise reflects the heritability of the disease. In Figure 1A the distribution of genetic liabilities on the underlying scale is exactly the same for these two diseases, but when K = 0.01 higher genetic liabilities are needed before probability of disease rises above virtual zero (virtual because it is not exactly zero, but very close to zero); similarly for the diseases in Figure 1B. Figure 1C and 1D plot the ROC curves for the diseases considered in Figure 1A and 1B, respectively. These graphs demonstrate firstly (not unexpectedly), that for diseases with the same prevalence, genetic liability is a better predictor of disease status for diseases with higher heritability and secondly, that for diseases with the same heritability, genetic liability is a better predictor of disease status for rarer diseases, because a higher proportion of those with high genetic liability are actually diseased. For example, if we used genetic liability of ≥1 as our predictor of disease, then the TPR is 0.26 and the FPR = 0.00, when K = 0.5, compared to TPR = 0.99 and the FPR = 0.12, when K = 0.01. These graphs demonstrate that maximum value of AUC (i.e. AUC_max) when the test classifier is a genetic predictor is dependent on both Inline graphic and K.

(A,B) Probability of disease versus genetic liability. (C,D) ROC curve [46].

Prediction of AUC_max from (or λ_S) and K

Figure 2 plots AUC_max vs Inline graphic , for K = 0.001, 0.01, 0.1, 0.3 from simulation (dashed line) and from equation 3 (solid line) and shows that AUC_max is particularly constrained for more common or low heritability diseases. Jannsens et al [3], in their Fig. 4, have shown the relationship between AUC and the proportion of variance on the disease scale explained by the genomic profile; since their genomic profile assumed all genetic variants were known without error their graph represents the relationship between AUC_max and Inline graphic . Our simulation results provided the same relationship when plotted on this scale (Figure 3, solid line). In Figure 3 we show the relationship of AUC_max with and (for each simulation combination of K and , the AUC_max and are calculated).

For different disease prevalences (A–D) from simulation (dashed line) and from equation 3 (solid line).

Complex genetic diseases

Table 1 lists AUC_max for a range of complex genetic diseases calculated using equation 3, with Inline graphic calculated using equation 1 from published estimates of K and λ_S. Despite being observable, the parameters K and λ_S are subject to considerable sampling variance; we have tried, where possible, to take estimates from reviews or large studies, but large study samples simply do not exist for some low prevalence disorders. The values of AUC_max show that it should be possible for a genomic profile for complex diseases to exceed 0.75, the threshold regarded [20] as making a diagnostic classifier clinically useful when applied to a sample considered to be at increased risk. However, based on the results in Table 1 only the diseases with high heritability and low prevalence, such as Type I diabetes, Crohn's Disease and Lupus, can achieve an AUC, by genomic profiling alone, above the 0.99 threshold regarded [20] as being required for a diagnostic classifier to be applied in the general population. In Table 1, we also consider the AUC expected under scenarios where a genomic profile accounts for only a half (AUC_half) or a quarter (AUC_quar) of the known genetic variance. These results show that for rare diseases genomic profiles can be useful classifiers of disease (AUC>0.8 when K<0.01), when the profile explains only a quarter of the genetic variance.

Using equations (4) and (5) we calculate Inline graphic for the diseases listed in Table 1 when AUC = 0.75. The results (Table 1) show that the same AUC can represent quite different successes of the genomic profile in representing the known genetic variance, ranging from 0.10 to 0.74. If we are able to explain half of the known genetic variance with identified risk variants then genomic profiles for most complex genetic disease (AUC_half, Table 1) will achieve some clinical validity as AUC is >0.75 for all but bladder cancer, for the examples provided.

Example: age related macular degeneration

Consider the first listed example in Table 1, age related macular degeneration (AMD).

Based on the review of Scholl et al [21] and the large twin study of Seddon et al [22] we have used a prevalence after 80 years age of advanced AMD K = 11.8% and a sibling recurrence risk representing the genetic contribution of λ_S = 2.2, which correspond to heritability on the liability scale of Inline graphic = 0.68 (equation 1). If the genetic test explains all the genetic variance ( = 1), the maximum AUC that could be achieved by a genomic profile is AUC_max = 0.92. If only half or a quarter of the genetic variance can be detected by genomic markers then the maximum AUC that can achieved are AUC_half = 0.81 and AUC_quar = 0.72, respectively, values that exceed the prediction of genetic risk based of the most optimistic scenario from a prediction based on family history (Text S1). If complete disease status is known for all siblings, parents, grandparents, aunts, uncles and cousins then the maximum AUC that could be achieved is 0.71, translating to a genomic profile that explains 0.21 of the genetic variance (Table S1). In practice, the AUC for a risk predictor based on rs1061170 a single nucleotide polymorphism in the complement factor H (CFH) gene was 0.69 [23] (and was approximately equal for advanced AMD cases vs controls and all AMD cases vs controls). From equations 4–6, Inline graphic = 0.12, λ_S _[x] = 1.17, = 0.17 and (λ_S _[x] – 1)/(λ_S – 1) = 0.15.

Discussion

Relationship of AUC_max to heritability and disease prevalence when the disease classifier is a genetic risk predictor

The AUC is a widely used statistic that summarises the clinical validity of a diagnostic or prognostic test. However, the AUC statistic of a genomic profile alone has an upper limit (i.e. AUC_max) which depends on the genetic epidemiology of the disease, namely the disease prevalence and heritability. It is important that in the first instance, particularly when genomic profiling is in its infancy, that genomic profiles are judged on their ability to predict genetic risk (their analytic validity) rather than on the basis of clinical validity [10]. Since AUC is estimated as a function of a rank correlation its genetic interpretation is not immediately obvious. Here we provide a genetic interpretation of the AUC expressed in terms of it genetic epidemiology parameters (equation 3). A relationship between AUC_max and heritability was first demonstrated graphically by Janssens et al [3] (see solid line Figure 3). However, their representation was of broad sense heritability on the observed scale (i.e. Inline graphic ) which is a little used measure of heritability because of its dependence on disease prevalence [13]. Here we show (Figure 2 and equation 3) the relationship between AUC_max and the more commonly used measure of heritability, the heritability of liability (i.e., ) We show that AUC_max is dependent on both Inline graphic and disease prevalence (i.e. K).

Initially, it may seem counter-intuitive that AUC depends on disease prevalence since for an individual disease TPR and FPR are independent of the proportion of cases and controls measured and therefore of the sample prevalence. However, as we have clearly shown (Figure 1A and 1B) the dependence on disease prevalence results from our ability to generalise across diseases in the context of a test classifier being a genomic profile.

In contrast to our results and those of Janssens et al [3], Clayton [24] provided an expression for ROC under a polygenic model which is independent of population disease prevalence. His derivation assumes that the effect of each locus is additive on the log risk scale [25]. Slatkin [26] and we [27] have found that this model allows probabilities of disease that exceed one, which although they occur with low frequency can have substantial impact on the estimates of recurrence risk and genetic variance. Under this model there is a relationship between recurrence risk to monozygotic twins and to siblings of λ_MZ/ Inline graphic = 1; this ratio is not achieved when probabilities of disease are constrained to their natural parameter space of a maximum of 1. Furthermore, empirical estimates of the ratio of λ_MZ/ from the studies listed in Table 1 that provide estimates of λ_MZ and λ_S are mostly less than 1.0 [27], particularly for low prevalence diseases. Recognising that these estimates are subject to sampling variance, the estimates of λ_MZ/ Inline graphic are 1.1 (AMD), 0.4 (coronary artery disease), 0.8 (breast cancer), 0.7 (schizophrenia [25]), 0.9 (rheumatoid arthritis) and 0.4 (Type I diabetes). Therefore, we believe the model used by Clayton to derive the relationship between AUC and heritability (or sibling recurrence risk) independent of disease prevalence is not valid.

AUC and accuracy of genetic profiles

AUC is a useful measure because of its independence of the numbers of diseased and diseased individuals tested, but we advocate the reporting of an estimate of the proportion of the known genetic variance on the liability scale ( Inline graphic ) or the proportion of sibling risk accounted for by the profile and we provide a method to do this using the estimated AUC, disease prevalence and heritability on the liability scale or sibling recurrence risk (equation 5). An AUC of 0.75 can imply anything from 0.10 to 0.74 of the genetic variance explained by the genomic profile for the complex diseases listed in Table 1. The correlation Inline graphic has long been the benchmark in non-human genetics of accuracy of genetic risk predictors. can be calculated from three measurable statistics, disease prevalence, sibling recurrence risk and AUC of the profile (using equations (1) and (4)). In this way, estimates of AUC can provide direct estimates of the proportion of ‘missing heritability’ [28] which takes into account the interdependence of identified associated variants.

Currently, the derivation of genomic profiles is very much in its infancy. As the sample size of genome-wide association studies increase, we can expect genomic profiles to include more and more validated associated variants. However, Inline graphic is constrained by the variance that could be detected by the markers that are genotyped recognising that the current generation of genome-wide chips explain at most ∼80% of the known variance in single nucleotide polymorphisms across the Caucasian genome [29]. This, in turn, may only be a fraction of the total genomic variance once structural variants such as copy number variants are included [30]. The actual variance explained by the profile depends on the sample size (i.e., power) of the studies from which associated genetic variants have been detected. It is likely that there are many variants which have such a small effect size that they will be impossible to detect even with very large samples. Although each such variant makes only a very small contribution to the genetic variance, there may be so many that a sizeable proportion of the variance will go undetected. Even if only quarter of the genetic variance is detectable by our future genotyping technology, the AUC is still greater for the genomic profile than for family history (ignoring shared environmental risks of family members, Text S1).

Limitations

In our derivations we have assumed the liability threshold model [11],[14]. Slatkin [26] demonstrated that the threshold model was one of several genetic models that provided the necessary steep increase in probability of disease with increasing load of genetic risk alleles [26]. The main assumption of the liability threshold model is that the distribution of liability scores is unimodal which should be achieved as long as there is no single unidentified genetic or environmental of very large effect [11]. The model accommodates any distribution of risk allele effect sizes and risk allele frequencies as long as there are sufficient (“more than one or a few” [11]) risk alleles in the population to create an approximately normal distribution of genetic liability scores. Since our simulation results of AUC_max vs Inline graphic (Figure 3) based on the liability threshold model agree with those of Janssens et al [3] who used a logit model to combine genetic risks from individual genetic variants, it is clear that the dependence of AUC_max on heritability and disease prevalence is not a function of the threshold model.

We have also assumed that a genetic profile is applied in the same “average” environment as the genetic risks were estimated and we have assumed that all familiality is of genetic origin. The AUC_max will be lower than those derived here if any part of the sibling recurrence risk reflects co-variation of non-genetic origin. Using recurrence risks from different types of relatives, the importance of common environmental factors can be assessed and a λ_S which reflects the genetic contribution of sibling recurrence can be used in our calculations. We have also assumed that the genomic profile consists of genetic markers associated with disease that are passed on according to the rules of Mendelian inheritance. In the future, a genomic profile might include non-heritable genetic variants, for example recurrent de novo copy number variants or perhaps methylation status variants (for which the inheritance pattern, if any, is currently unclear [31]). Such variants, although genetic, do not contribute to the similarity between relatives, and so would be included in the environmental component when partitioning variance. Under these circumstances it is possible that a genomic profile could exceed the AUC_max based on sibling recurrence ratio. Our calculations assume that we know the population parameters K and λ_S (and therefore Inline graphic ). Estimates of these parameters are sometimes based on small sample size and are subject to sampling bias or different definitions of the disease. In particular, prevalence rates can depend on the age distribution of the population in which they are measured. In addition, recurrence risk ratios of relatives have a maximum possible value which is dependent on the disease prevalence, so that higher risk ratios are achievable when disease prevalence is lower; and estimates of sibling risk ratio and disease prevalence calculated in different studies sometimes reflect this dependence. In Table 1, we included two different estimates for both schizophrenia and bipolar disorder, but for these examples the estimates of AUC_max are robust to the magnitude of differences reported in genetic epidemiology parameters for individual diseases. At present, genomic profiles based on validated associated variants do not come anywhere close to the maximum implied by their AUC_max; Jakobsdottir et al [6] have reported AUC of 0.80 for risk of cardiovascular events, 0.64 for type 2 diabetes, 0.56 for prostate cancer, 0.66 for Crohn's Disease and 0.79 for age related macular degeneration. This is not surprising given the effect size of individual associated variants discovered in genome-wide association studies, which imply that much larger sample sizes will be needed to discover the majority of the variants that explain the genetic variance [4]. However, already these genomic profiles outperform family history (resulting from shared genetic risk only) for four out of five of these diseases. Although the AUC is a useful summary statistic for clinical validity, in practice clinical utility depends on many other factors such as the benefits versus risks of the intervention strategies that follow from the risk prediction [5],[32]; these important factors are not considered here.

Conclusion

We have provided a genetic interpretation of and insight into the AUC statistic calculated under a genomic profile. Time will tell if genetic variants amenable to genotyping are able to reconstruct the known genetic variance in its totality. Even if it is possible to explain only a quarter of the known genetic variance, the genomic profile will be a more useful predictor of genetic risk than self-reported family history (in the absence of shared environmental risk factors) which is a commonly used measure for targeted screening programmes for complex genetic diseases. In practice, predictions of risk to disease will incorporate both genetic and environmental risk factors to produce the best predictions of absolute risk to disease. Here we provide a benchmark for the expected contribution from the genetic component of the prediction illustrating that the same AUC estimated for different diseases can imply quite different proportions of genetic variance explained by the genomic profile, which is often overlooked (e.g. [5]). Ultimately, genomic profiles may be used without contributions from environmental risk factors, since the contribution from the genomic profile can be estimated perinatally, prior to exposure by many environmental risk factors and when limited family history of disease is available. Indeed, one purpose of a genetic risk predictor is to allow individuals to choose to modify their exposure to environmental risks. We provide a simple online calculator (http://gump.qimr.edu.au/genroc) to calculate i) the maximum AUC for a genomic profile of a disease given estimates of disease prevalence and sibling recurrence risk or heritability of liability, ii) the proportion of variance explained on the liability scale given an estimate of AUC from a risk predictor and disease prevalence and iii) proportion of genetic variance or of sibling risk explained given an estimate AUC, disease prevalence and sibling recurrence risk [2].

Supporting Information

Figure S1

Example calculation of ROC curve for a genomic profile. An example of n_d = 9 diseased (case) and n_d' = 10 non-diseased (control) individuals listed in rank order on a genomic profile. The area under the curve is calculated from equation 2, which is derived as the sum of the horizontal rectangles (bounded by dashed lines) of the ROC plot (solid line) generated by progressing through the ranked list of individuals: each time the next ranked individual is not diseased, the ROC line moves along the x-axis by 1/n_d' and each time the next ranked individual is diseased the ROC line moves up y-axis by 1/n_d. The mean rank value (r_i) of the cases is Inline graphic = 13.2 and AUC = 0.82.

(0.07 MB TIF)

Click here for additional data file.^{(69.1KB, tif)}

Table S1

AUC related statistics for complex genetic diseases: Table 1 with added columns considering family history.

(0.10 MB PDF)

Click here for additional data file.^{(101.8KB, pdf)}

Text S1

AUC based on family history as a prediction of genetic risk.

(0.07 MB PDF)

Click here for additional data file.^{(70.3KB, pdf)}

Acknowledgments

We would like to thank three reviewers for the constructive criticism on earlier versions of this manuscript.

Footnotes

The authors have declared that no competing interests exist.

This work was supported by the Australian National Health and Medical Research Council (grants 389892, 442915, and 496688), and by the Australian Research Council (grant DP0770096). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics. 2008;9:356–369. doi: 10.1038/nrg2344. [DOI] [PubMed] [Google Scholar]
2.Iles MM. What can genome-wide association studies tell us about the genetics of common disease? PLoS Genet. 2008;4:e33. doi: 10.1371/journal.pgen.0040033. doi: 10.1371/journal.pgen.0040033. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Janssens AC, Aulchenko YS, Elefante S, Borsboom GJ, Steyerberg EW, et al. Predictive testing for complex diseases using multiple genes: fact or fiction? Genet Med. 2006;8:395–400. doi: 10.1097/01.gim.0000229689.18263.f4. [DOI] [PubMed] [Google Scholar]
4.Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007;17:1520–1528. doi: 10.1101/gr.6665407. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Kraft P, Wacholder S, Cornelis MC, Hu FB, Hayes RB, et al. OPINION Beyond odds - ratios communicating disease risk based on genetic profiles. Nature Reviews Genetics. 2009;10:264–269. doi: 10.1038/nrg2516. [DOI] [PubMed] [Google Scholar]
6.Jakobsdottir J, Gorin MB, Conley YP, Ferrell RE, Weeks DE. Interpretation of genetic association studies: markers with replicated highly significant odds ratios may be poor classifiers. PLoS Genet. 2009;5:e1000337. doi: 10.1371/journal.pgen.1000337. doi: 10.1371/journal.pgen.1000337. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Metz CE. Basic principles of ROC analysis. Seminars in Nuclear Medicine. 1978;8:283–298. doi: 10.1016/s0001-2998(78)80014-2. [DOI] [PubMed] [Google Scholar]
8.Lu Q, Elston RC. Using the optimal receiver operating characteristic curve to design a predictive genetic test, exemplified with type 2 diabetes. American Journal of Human Genetics. 2008;82:641–651. doi: 10.1016/j.ajhg.2007.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.van der Net JB, Janssens A, Defesche JC, Kastelein JJP, Sijbrands EJG, et al. Usefulness of Genetic Polymorphisms and Conventional Risk Factors to Predict Coronary Heart Disease in Patients With Familial Hypercholesterolemia. American Journal of Cardiology. 2009;103:375–380. doi: 10.1016/j.amjcard.2008.09.093. [DOI] [PubMed] [Google Scholar]
10.Grosse SD, Khoury MJ. What is the clinical utility of genetic testing? Genet Med. 2006;8:448–450. doi: 10.1097/01.gim.0000227935.26763.c6. [DOI] [PubMed] [Google Scholar]
11.Falconer D, Mackay T. England: Longman; 1996. Introduction to Quantitative Genetics.464 [Google Scholar]
12.James JW. Frequency in relatives for an all-or-none trait. Ann Hum Genet. 1971;35:47–49. doi: 10.1111/j.1469-1809.1956.tb01377.x. [DOI] [PubMed] [Google Scholar]
13.Dempster ER, Lerner IM. Heritability of Threshold Characters. Genetics. 1950;35:212–236. doi: 10.1093/genetics/35.2.212. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Lynch M, Walsh B. Sunderland, Massachusetts: Sinauer Associates, Inc; 1998. Genetics and Analysis of Quantitative Traits. [Google Scholar]
15.Robertson A, Lerner IM. The heritability of all-or-none traits - viability of poultry. Genetics. 1949;34:395–411. doi: 10.1093/genetics/34.4.395. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Reich T, James JW, Morris CA. The use of multiple thresholds in determining the mode of transmission of semi-continuous traits. Ann Hum Genet. 1972;36:163–184. doi: 10.1111/j.1469-1809.1972.tb00767.x. [DOI] [PubMed] [Google Scholar]
17.Somers RH. A new asymmetric measure of association for ordinal variables. American Sociological Review. 1962;27:799–811. [Google Scholar]
18.Hanley J, McNeil B. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology. 1982;143 doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
19.Yang J, Visscher PM, Wray NR. Sporadic cases are the norm for common disease. European Journal of Human Genetics 2009 Oct 14. [Epub ahead of print] 2009 doi: 10.1038/ejhg.2009.177. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Janssens AC, Moonesinghe R, Yang Q, Steyerberg EW, van Duijn CM, et al. The impact of genotype frequencies on the clinical validity of genomic profiling for predicting common chronic diseases. Genet Med. 2007;9:528–535. doi: 10.1097/gim.0b013e31812eece0. [DOI] [PubMed] [Google Scholar]
21.Scholl HPN, Fleckenstein M, Issa PC, Keilhauer C, Holz FG, et al. An update on the genetics of age-related macular degeneration. Molecular Vision. 2007;13:196–205. [PMC free article] [PubMed] [Google Scholar]
22.Seddon JM, Cote J, Page WF, Aggen SH, Neale MC. The US twin study of age-related macular degeneration - Relative roles of genetic and einivironmental influences. Archives of Ophthalmology. 2005;123:321–327. doi: 10.1001/archopht.123.3.321. [DOI] [PubMed] [Google Scholar]
23.Gu J, Pauer GJ, Yue X, Narendra U, Sturgill GM, et al. Assessing susceptibility to age-related macular degeneration with proteomic and genomic biomarkers. Mol Cell Proteomics. 2009;8:1338–1349. doi: 10.1074/mcp.M800453-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Clayton DG. Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet. 2009;5:e1000540. doi: 10.1371/journal.pgen.1000540. doi: 10.1371/journal.pgen.1000540. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Risch N. Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet. 1990;46:222–228. [PMC free article] [PubMed] [Google Scholar]
26.Slatkin M. Exchangeable models of complex inherited diseases. Genetics. 2008;179:2253–2261. doi: 10.1534/genetics.107.077719. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Wray NR, Goddard ME. Multi-locus models of genetic risk of disease. Genome Medicine In press. 2010 doi: 10.1186/gm131. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456:18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]
29.Bhangale TR, Rieder MJ, Nickerson DA. Estimating coverage and power for genetic association studies using near-complete variation data. Nature Genetics. 2008;40:841–843. doi: 10.1038/ng.180. [DOI] [PubMed] [Google Scholar]
30.Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. doi: 10.1038/nature05329. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Youngson NA, Whitelaw E. Transgenerational epigenetic effects. Annual Review of Genomics and Human Genetics. 2008;9:233–257. doi: 10.1146/annurev.genom.9.081307.164445. [DOI] [PubMed] [Google Scholar]
32.Baker SG, Cook NR, Vickers A, Kramer BS. Using relative utility curves to evaluate risk prediction. Journal of the Royal Statistical Society. 2009;172:729–748. doi: 10.1111/j.1467-985X.2009.00592.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Levinson DF. The genetics of depression: A review. Biological Psychiatry. 2006;60:84–92. doi: 10.1016/j.biopsych.2005.08.024. [DOI] [PubMed] [Google Scholar]
34.Sullivan PF, Neale MC, Kendler KS. Genetic epidemiology of major depression: Review and meta-analysis. American Journal of Psychiatry. 2000;157:1552–1562. doi: 10.1176/appi.ajp.157.10.1552. [DOI] [PubMed] [Google Scholar]
35.Marenberg ME, Risch N, Berkman LF, Floderus B, Defaire U. Genetic susceptibility to death from coronary heart disease in a study of twins. New England Journal of Medicine. 1994;330:1041–1046. doi: 10.1056/NEJM199404143301503. [DOI] [PubMed] [Google Scholar]
36.Risch N. The genetic epidemiology of cancer: interpreting family and twin studies and their implications for molecular genetic approaches. Cancer Epidemiol Biomarkers Prev. 2001;10:733–741. [PubMed] [Google Scholar]
37.Das SK, Elbein SC. The Genetic Basis of Type 2 Diabetes. Cellscience. 2006;2:100–131. doi: 10.1901/jaba.2006.2-100. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Hemminki K, Li X, Sundquist K, Sundquist J. Familial risks for asthma among twins and other siblings based on hospitalizations in Sweden. Clinical and Experimental Allergy. 2007;37:1320–1325. doi: 10.1111/j.1365-2222.2007.02737.x. [DOI] [PubMed] [Google Scholar]
39.Craddock N, Khodel V, Van Eerdewegh P, Reich T. Mathematical limits of multilocus models: the genetic transmission of bipolar disorder. Am J Hum Genet. 1995;57:690–702. [PMC free article] [PubMed] [Google Scholar]
40.Lichtenstein P, Yip BH, Bjork C, Pawitan Y, Cannon TD, et al. Common genetic determinants of schizophrenia and bipolar disorder in Swedish families: a population-based study. Lancet. 2009;373:234–239. doi: 10.1016/S0140-6736(09)60072-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.McGue M, Gottesman II, Rao DC. The transmission of schizophrenia under a multifactorial threshold model. American Journal of Human Genetics. 1983;35:1161–1178. [PMC free article] [PubMed] [Google Scholar]
42.Harney S, Wordsworth BP. Genetic epidemiology of rheumatoid arthritis. Tissue Antigens. 2002;60:465–473. doi: 10.1034/j.1399-0039.2002.600601.x. [DOI] [PubMed] [Google Scholar]
43.Hyttinen V, Kaprio J, Kinnunen L, Koskenvuo M, Tuomilehto J. Genetic liability of type 1 diabetes and the onset age among 22,650 young Finnish twin pairs - A nationwide follow-up study. Diabetes. 2003;52:1052–1055. doi: 10.2337/diabetes.52.4.1052. [DOI] [PubMed] [Google Scholar]
44.WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Harley JB, Alarcon-Riquelme ME, Criswell LA, Jacob CO, Kimberly RP, et al. Genome-wide association scan in women with systemic lupus erythematosus identifies susceptibility variants in ITGAM, PXK, KIAA1542 and other loci. Nat Genet. 2008;40:204–210. doi: 10.1038/ng.81. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:3940–3941. doi: 10.1093/bioinformatics/bti623. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

(0.07 MB TIF)

Click here for additional data file.^{(69.1KB, tif)}

Table S1

AUC related statistics for complex genetic diseases: Table 1 with added columns considering family history.

(0.10 MB PDF)

Click here for additional data file.^{(101.8KB, pdf)}

Text S1

AUC based on family history as a prediction of genetic risk.

(0.07 MB PDF)

Click here for additional data file.^{(70.3KB, pdf)}

[pgen.1000864-McCarthy1] 1.McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics. 2008;9:356–369. doi: 10.1038/nrg2344. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Iles1] 2.Iles MM. What can genome-wide association studies tell us about the genetics of common disease? PLoS Genet. 2008;4:e33. doi: 10.1371/journal.pgen.0040033. doi: 10.1371/journal.pgen.0040033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Janssens1] 3.Janssens AC, Aulchenko YS, Elefante S, Borsboom GJ, Steyerberg EW, et al. Predictive testing for complex diseases using multiple genes: fact or fiction? Genet Med. 2006;8:395–400. doi: 10.1097/01.gim.0000229689.18263.f4. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Wray1] 4.Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007;17:1520–1528. doi: 10.1101/gr.6665407. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Kraft1] 5.Kraft P, Wacholder S, Cornelis MC, Hu FB, Hayes RB, et al. OPINION Beyond odds - ratios communicating disease risk based on genetic profiles. Nature Reviews Genetics. 2009;10:264–269. doi: 10.1038/nrg2516. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Jakobsdottir1] 6.Jakobsdottir J, Gorin MB, Conley YP, Ferrell RE, Weeks DE. Interpretation of genetic association studies: markers with replicated highly significant odds ratios may be poor classifiers. PLoS Genet. 2009;5:e1000337. doi: 10.1371/journal.pgen.1000337. doi: 10.1371/journal.pgen.1000337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Metz1] 7.Metz CE. Basic principles of ROC analysis. Seminars in Nuclear Medicine. 1978;8:283–298. doi: 10.1016/s0001-2998(78)80014-2. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Lu1] 8.Lu Q, Elston RC. Using the optimal receiver operating characteristic curve to design a predictive genetic test, exemplified with type 2 diabetes. American Journal of Human Genetics. 2008;82:641–651. doi: 10.1016/j.ajhg.2007.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-vanderNet1] 9.van der Net JB, Janssens A, Defesche JC, Kastelein JJP, Sijbrands EJG, et al. Usefulness of Genetic Polymorphisms and Conventional Risk Factors to Predict Coronary Heart Disease in Patients With Familial Hypercholesterolemia. American Journal of Cardiology. 2009;103:375–380. doi: 10.1016/j.amjcard.2008.09.093. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Grosse1] 10.Grosse SD, Khoury MJ. What is the clinical utility of genetic testing? Genet Med. 2006;8:448–450. doi: 10.1097/01.gim.0000227935.26763.c6. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Falconer1] 11.Falconer D, Mackay T. England: Longman; 1996. Introduction to Quantitative Genetics.464 [Google Scholar]

[pgen.1000864-James1] 12.James JW. Frequency in relatives for an all-or-none trait. Ann Hum Genet. 1971;35:47–49. doi: 10.1111/j.1469-1809.1956.tb01377.x. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Dempster1] 13.Dempster ER, Lerner IM. Heritability of Threshold Characters. Genetics. 1950;35:212–236. doi: 10.1093/genetics/35.2.212. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Lynch1] 14.Lynch M, Walsh B. Sunderland, Massachusetts: Sinauer Associates, Inc; 1998. Genetics and Analysis of Quantitative Traits. [Google Scholar]

[pgen.1000864-Robertson1] 15.Robertson A, Lerner IM. The heritability of all-or-none traits - viability of poultry. Genetics. 1949;34:395–411. doi: 10.1093/genetics/34.4.395. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Reich1] 16.Reich T, James JW, Morris CA. The use of multiple thresholds in determining the mode of transmission of semi-continuous traits. Ann Hum Genet. 1972;36:163–184. doi: 10.1111/j.1469-1809.1972.tb00767.x. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Somers1] 17.Somers RH. A new asymmetric measure of association for ordinal variables. American Sociological Review. 1962;27:799–811. [Google Scholar]

[pgen.1000864-Hanley1] 18.Hanley J, McNeil B. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology. 1982;143 doi: 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Yang1] 19.Yang J, Visscher PM, Wray NR. Sporadic cases are the norm for common disease. European Journal of Human Genetics 2009 Oct 14. [Epub ahead of print] 2009 doi: 10.1038/ejhg.2009.177. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Janssens2] 20.Janssens AC, Moonesinghe R, Yang Q, Steyerberg EW, van Duijn CM, et al. The impact of genotype frequencies on the clinical validity of genomic profiling for predicting common chronic diseases. Genet Med. 2007;9:528–535. doi: 10.1097/gim.0b013e31812eece0. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Scholl1] 21.Scholl HPN, Fleckenstein M, Issa PC, Keilhauer C, Holz FG, et al. An update on the genetics of age-related macular degeneration. Molecular Vision. 2007;13:196–205. [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Seddon1] 22.Seddon JM, Cote J, Page WF, Aggen SH, Neale MC. The US twin study of age-related macular degeneration - Relative roles of genetic and einivironmental influences. Archives of Ophthalmology. 2005;123:321–327. doi: 10.1001/archopht.123.3.321. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Gu1] 23.Gu J, Pauer GJ, Yue X, Narendra U, Sturgill GM, et al. Assessing susceptibility to age-related macular degeneration with proteomic and genomic biomarkers. Mol Cell Proteomics. 2009;8:1338–1349. doi: 10.1074/mcp.M800453-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Clayton1] 24.Clayton DG. Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet. 2009;5:e1000540. doi: 10.1371/journal.pgen.1000540. doi: 10.1371/journal.pgen.1000540. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Risch1] 25.Risch N. Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet. 1990;46:222–228. [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Slatkin1] 26.Slatkin M. Exchangeable models of complex inherited diseases. Genetics. 2008;179:2253–2261. doi: 10.1534/genetics.107.077719. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Wray2] 27.Wray NR, Goddard ME. Multi-locus models of genetic risk of disease. Genome Medicine In press. 2010 doi: 10.1186/gm131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Maher1] 28.Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456:18–21. doi: 10.1038/456018a. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Bhangale1] 29.Bhangale TR, Rieder MJ, Nickerson DA. Estimating coverage and power for genetic association studies using near-complete variation data. Nature Genetics. 2008;40:841–843. doi: 10.1038/ng.180. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Redon1] 30.Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. doi: 10.1038/nature05329. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Youngson1] 31.Youngson NA, Whitelaw E. Transgenerational epigenetic effects. Annual Review of Genomics and Human Genetics. 2008;9:233–257. doi: 10.1146/annurev.genom.9.081307.164445. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Baker1] 32.Baker SG, Cook NR, Vickers A, Kramer BS. Using relative utility curves to evaluate risk prediction. Journal of the Royal Statistical Society. 2009;172:729–748. doi: 10.1111/j.1467-985X.2009.00592.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Levinson1] 33.Levinson DF. The genetics of depression: A review. Biological Psychiatry. 2006;60:84–92. doi: 10.1016/j.biopsych.2005.08.024. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Sullivan1] 34.Sullivan PF, Neale MC, Kendler KS. Genetic epidemiology of major depression: Review and meta-analysis. American Journal of Psychiatry. 2000;157:1552–1562. doi: 10.1176/appi.ajp.157.10.1552. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Marenberg1] 35.Marenberg ME, Risch N, Berkman LF, Floderus B, Defaire U. Genetic susceptibility to death from coronary heart disease in a study of twins. New England Journal of Medicine. 1994;330:1041–1046. doi: 10.1056/NEJM199404143301503. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Risch2] 36.Risch N. The genetic epidemiology of cancer: interpreting family and twin studies and their implications for molecular genetic approaches. Cancer Epidemiol Biomarkers Prev. 2001;10:733–741. [PubMed] [Google Scholar]

[pgen.1000864-Das1] 37.Das SK, Elbein SC. The Genetic Basis of Type 2 Diabetes. Cellscience. 2006;2:100–131. doi: 10.1901/jaba.2006.2-100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Hemminki1] 38.Hemminki K, Li X, Sundquist K, Sundquist J. Familial risks for asthma among twins and other siblings based on hospitalizations in Sweden. Clinical and Experimental Allergy. 2007;37:1320–1325. doi: 10.1111/j.1365-2222.2007.02737.x. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Craddock1] 39.Craddock N, Khodel V, Van Eerdewegh P, Reich T. Mathematical limits of multilocus models: the genetic transmission of bipolar disorder. Am J Hum Genet. 1995;57:690–702. [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Lichtenstein1] 40.Lichtenstein P, Yip BH, Bjork C, Pawitan Y, Cannon TD, et al. Common genetic determinants of schizophrenia and bipolar disorder in Swedish families: a population-based study. Lancet. 2009;373:234–239. doi: 10.1016/S0140-6736(09)60072-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-McGue1] 41.McGue M, Gottesman II, Rao DC. The transmission of schizophrenia under a multifactorial threshold model. American Journal of Human Genetics. 1983;35:1161–1178. [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Harney1] 42.Harney S, Wordsworth BP. Genetic epidemiology of rheumatoid arthritis. Tissue Antigens. 2002;60:465–473. doi: 10.1034/j.1399-0039.2002.600601.x. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-Hyttinen1] 43.Hyttinen V, Kaprio J, Kinnunen L, Koskenvuo M, Tuomilehto J. Genetic liability of type 1 diabetes and the onset age among 22,650 young Finnish twin pairs - A nationwide follow-up study. Diabetes. 2003;52:1052–1055. doi: 10.2337/diabetes.52.4.1052. [DOI] [PubMed] [Google Scholar]

[pgen.1000864-WTCCC1] 44.WTCCC. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. doi: 10.1038/nature05911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Harley1] 45.Harley JB, Alarcon-Riquelme ME, Criswell LA, Jacob CO, Kimberly RP, et al. Genome-wide association scan in women with systemic lupus erythematosus identifies susceptibility variants in ITGAM, PXK, KIAA1542 and other loci. Nat Genet. 2008;40:204–210. doi: 10.1038/ng.81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pgen.1000864-Sing1] 46.Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:3940–3941. doi: 10.1093/bioinformatics/bti623. [DOI] [PubMed] [Google Scholar]

PERMALINK

The Genetic Interpretation of Area under the ROC Curve in Genomic Profiling

Naomi R Wray

Jian Yang

Michael E Goddard

Peter M Visscher

Roles

Abstract

Author Summary

Introduction

Table 1. AUC related statistics for complex genetic diseases.

Methods

Background: quantitative traits

Background: disease traits

Area under the ROC curve

Using AUC measured by a genomic profile to estimate the proportion of genetic variance explained

Simulation

Results

The maximum value of AUC when the test classifier is a genetic predictor depends on heritability and disease prevalence

Figure 1. The dependence of maximum AUC (AUCmax) from a genomic profile on heritability and disese prevalence.

Prediction of AUCmax from (or λS) and K

Figure 2. Relationship between maximum AUC (AUCmax) from a genomic profile and heritability on the liability scale .

Figure 3. The relationship between maximum AUC (AUCmax) from a genomic profile and heritability on the liability scale (dashed line) or heritability on the observed scale (solid line), for disease prevalences in order from top left, K = 0.001, 0.01, 0.1, 0.3.

Complex genetic diseases

Example: age related macular degeneration

Discussion

Relationship of AUCmax to heritability and disease prevalence when the disease classifier is a genetic risk predictor

AUC and accuracy of genetic profiles

Limitations

Conclusion

Supporting Information

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Figure 1. The dependence of maximum AUC (AUC_max) from a genomic profile on heritability and disese prevalence.

Prediction of AUC_max from (or λ_S) and K

Figure 2. Relationship between maximum AUC (AUC_max) from a genomic profile and heritability on the liability scale .

Figure 3. The relationship between maximum AUC (AUC_max) from a genomic profile and heritability on the liability scale (dashed line) or heritability on the observed scale (solid line), for disease prevalences in order from top left, K = 0.001, 0.01, 0.1, 0.3.

Relationship of AUC_max to heritability and disease prevalence when the disease classifier is a genetic risk predictor