Naïve Bayesian Classifier and Genetic Risk Score for Genetic Risk Prediction of a Categorical Trait: Not so Different after all!

Paola Sebastiani; Nadia Solovieff; Jenny X Sun

doi:10.3389/fgene.2012.00026

. 2012 Feb 29;3:26. doi: 10.3389/fgene.2012.00026

Naïve Bayesian Classifier and Genetic Risk Score for Genetic Risk Prediction of a Categorical Trait: Not so Different after all!

Paola Sebastiani ^1,^*, Nadia Solovieff ^2,^3,⁴, Jenny X Sun ¹

PMCID: PMC3289795 PMID: 22393331

Abstract

One of the most popular modeling approaches to genetic risk prediction is to use a summary of risk alleles in the form of an unweighted or a weighted genetic risk score, with weights that relate to the odds for the phenotype in carriers of the individual alleles. Recent contributions have proposed the use of Bayesian classification rules using Naïve Bayes classifiers. We examine the relation between the two approaches for genetic risk prediction and show that the methods are mathematically related. In addition, we study the properties of the two approaches and describe how they can be generalized to include various models of inheritance.

Keywords: genetic risk prediction, genetic score, Naïve Bayes classifier, classification score, classification rule

Introduction

Several statistical methods have been proposed to capture the complex genetic bases of common diseases. These approaches include standard regression models in which the contribution of several genetic variants is summarized by a genetic risk score (GRS; Meigs et al., 2008; Purcell et al., 2009; Paynter et al., 2010), multivariate regression models and “machine learning type” approaches such as support vector machines (Wei et al., 2009; Wu et al., 2011), Naïve Bayes classifiers (NBC; Okser et al., 2010), classification and regression trees, random forests (Bureau et al., 2005; McKinney et al., 2006), rule induction (Sebastiani and Perls, 2010; Stengard et al., 2010), multifactor dimensionality reduction (Moore et al., 2006), and Bayesian networks (Rodin and Boerwinkle, 2005; Sebastiani et al., 2005; Jiang et al., 2011; Kang et al., 2011a). NBCs use a simple but surprisingly effective Bayesian rule that classifies a subject at risk of a trait if the posterior probability of the trait, given the individual’s genetic profile, is maximal (Hand, 2009). The classification rule can be built using a large number of genetic variants, such as single nucleotide polymorphisms (SNPs), by assuming that the SNPs are conditionally independent given the trait (Sebastiani et al., 2012). This hypothesis is often mistaken for “marginal independence” but marginal and conditional independence have no relation (Whittaker, 1990).

In this manuscript we show that there is a mathematical link between NBCs and logistic regression models that use a GRS to summarize the contribution of many SNPs to the susceptibility to a genetic disease. The link between these two approaches also highlights their limitations. We discuss how the directed graphical model underlying a NBC can be extended to include interactions between genes and/or environmental risk factors by maintaining the computations scalable to genome-wide genotype data and even whole genome sequence data.

Methods and Results

We describe two approaches – logistic regression and Bayesian classifier – to define a classification score and a rule to be used for genetic risk prediction of a dichotomous trait denoted as T or “not T.” The classification score for genetic risk prediction is a function that maps a set of SNPs Σ = {S₁, …, S_k} into real numbers. The classification rule links the output of the score function to the events T or “not T.” Formally, with S denoting the space of SNPs and R the real numbers:

\begin{matrix} Classification score: Sc (Σ) : S \to R \\ Classification rule: Sc(Σ) > τ \Rightarrow Classify as T \end{matrix}

Logistic regression with a genetic risk score

A logistic regression model that includes the general effects of k biallelic SNPs Σ = {S₁, …, S_k} to model the odds for a dichotomous trait T is defined by the logit equation:

log (\frac{p (T | Σ)}{1 - p (T | Σ)}) = α_{0} + \sum_{j = 1}^{k} (α_{1 j} X_{j AB} + α_{2 j} X_{j BB}) where X_{j AB} = \{\begin{matrix} 1 if S_{j} genotype = AB \\ 0 otherwise \end{matrix} and X_{j BB} = \{\begin{matrix} 1 if S_{j} genotype = BB \\ 0 otherwise \end{matrix}

We assume that the alleles of the SNPs are ordered in lexicographical order (A < C < G < T), and A represents the first allele and B the second allele regardless of their frequency. The logit equation is the classification score that can be used to define a classification rule based on a threshold τ:

Classification score: Sc (Σ) = log (\frac{p (T | Σ)}{1 - p (T | Σ)}) = α_{0} + \sum_{j} ({α_{1 j}}^{X_{j AB}} + {α_{2 j}}^{X_{j BB}}) Classification rule: Sc(Σ) > τ \Rightarrow Classify as T

and τ can be determined to optimize sensitivity and specificity by receiver operating characteristic (ROC) curve analysis.

The coefficients of the logistic score are typically estimated by maximum likelihood (McCullagh and Nelder, 1989), or Bayesian methods using large sample approximations or Gibbs sampling (Balding, 2006). By definition, the intercept α₀ represents the log-odds for the trait T for the referent group with all SNPs genotypes equal to AA, while each parameter α_1j represents the log-odds ratio for the trait T between the AB genotype and the AA genotype of the jth SNP, and each parameter α_2j represents the log-odds ratio for T between the BB and AA genotype of the jth SNP, assuming the other SNP genotypes fixed. When α_2j = 2α_1j for all j = 1, …, k, then the logistic regression encodes the additive effects of the SNPs, and each parameter α_1j represents the log-odds ratio for T for each additional copy of the B allele relative to the referent genotype AA.

It is well known that when the data are from a case–control study design, the intercept does not provide the correct estimate of the odds for T in the populations and several corrections have been proposed to limit this problem (Jewell, 2003). Bias of the intercept term is not a problem when the logistic regression model is meant to be used for classification because different intercepts will simply shift the logistic function and classification scores that differ only by the intercept term lead to equivalent classification rules. We state this property formally because it will be used further.

Property 1: Irrelevance of the intercept term of a logistic regression model for classification

Let Sc₁(Σ) and Sc₂(Σ) be two classification scores defined as:

S c_{1} (Σ) = log (\frac{p (T | Σ)}{1 - p (T | Σ)}) = α_{0} + \sum_{j} (α_{1 j} X_{j AB} + α_{2 j} X_{j BB}) S c_{2} (Σ) = log (\frac{p (T | Σ)}{1 - p (T | Σ)}) = β_{0} + \sum_{j} (α_{1 j} X_{j AB} + α_{2 j} X_{j BB})

The two classification scores can be used to define equivalent classification rules by using the relation:

\begin{array}{rcl} “if S c_{1} (Σ) > τ \Rightarrow classify as T” if and only if \\ “if S c_{2} (Σ) > τ + β_{0} - α_{0} \Rightarrow classify as T” □ \end{array}

We note however that the correct estimate of the intercept term is necessary to be able to interpret the prediction from the logistic model in terms of prevalence of the trait in the population.

One of the limitations of multivariate logistic regression is that the number of covariates is bounded above by the sample size. It is expected that many common genetic complex traits may be determined by hundreds of genetic variants (Kraft and Hunter, 2009), so that the sample size needed to build reliable logistic regression models for risk prediction can be prohibitively large.

A naïve but very popular alternative is to collapse the contribution of the k SNPs into a GRS to be used in a univariate logistic model. A GRS is typically defined as the weighted sum of the genotypes:

GRS = GRS (Σ) = \sum_{i = 1}^{k} (w_{i} X_{i AA} + v_{i} X_{i AB} + z_{i} X_{i BB})

with weights that can be appropriately chosen. The variables X_iAB and X_iBB are defined as above, and X_iAA = 1 if the ith SNP genotype is AA and 0 otherwise. See Table 1 for a summary of three possible weighting schemes. The GRS is then used as risk factor to define a classification score using a univariate logistic regression:

Table 1.

Example of choice of weights for the weighted genetic risk score.

Case	w_i	v_i	z_i	Comments
1	2δ(A = R)	1	2δ(B = R)	R denotes the risk allele and δ(X = Y) = 1 if X = Y is true and 0 otherwise
2	0	$V_{i} = \log \frac{\frac{p (T \| X_{i} = 1)}{1 - p (T \| X_{i} = 1)}}{\frac{p (T \| X_{i} = 0)}{1 - p (T \| X_{i} = 0)}}$	2v_i	X_i = 0 when the ith SNP genotype is AA, and X_i = 1 when the genotype is AB. This is the standard coding for an additive model
3	0	$\log \frac{\frac{p (T \| S_{i} = AB)}{1 - p (T \| S_{i} = AB)}}{\frac{p (T \| S_{i} = AA)}{1 - p (T \| S_{i} = AA)}}$	$\log \frac{\frac{p (T \| S_{i} = BB)}{1 - p (T \| S_{i} = BB)}}{\frac{p (T \| S_{i} = AA)}{1 - p (T \| S_{i} = AA)}}$	The two weights represent the log-odds ratio relative to the referent genotype AA. This is the coding for genotypic model.

Open in a new tab

Case 1 is known as the “unweighted score” and case 2 is typically referred to as the “weighted genetic risk score.” Case 3 is the most general and flexible but it does not seem to be used.

Sc (Σ) = log (\frac{p (T | GRS)}{1 - p (T | GRS)}) = γ_{0} + γ_{1} GRS

Case 1

Although this is often referred to as the “unweighted genetic score,” the heterozygote genotype is always assigned a weight 1, while the homozygous genotype for the risk allele is assigned weight 2 and the other genotype is assigned weight 0. By adopting this weighting scheme, we are simply counting the number of risk alleles each subject carries. The risk allele of each SNP is determined by a “one-SNP-at-a-time” association analysis, typically under an additive genetic model. Using the same notation and lexicographical order of the SNPs that we used earlier, the risk allele of each SNP will be the A allele if the regression coefficient α_i of the logistic regression model

log (\frac{p (T | S_{i})}{1 - p (T | S_{i})}) = α_{0 i} + α_{i} (X_{i AB} + 2 X_{i BB})

is negative, and the B allele if α_i is positive. In the first case (α_i < 0), each copy of the B allele decreases the odds for T, while in the second case (α_i ≥ 0) each copy of the B allele increases the odds for T. With this definition, the GRS is only a function of the different number of risk alleles regardless of their individual genetic effects, and two identical GRS values can represent genetic profiles that are substantially different. See Figure 1 for an example.

**Example of GRS (case 1 and case 2 in Table 1) based on three SNPs associated with exceptional longevity**. The table on top reports the A/B alleles for the three SNPs, the frequencies of A allele in cases and controls, and the p-value for the additive model (Column PVAL.AA) and the odds ratio (OR) for exceptional longevity in carriers of the B allele. The two bottom panels show the calculations of the GRS with weights as in case 1 (left), and case 2 (right). Note that the GRS on the left is only a function of the different number of risk alleles regardless of their individual genetic effects, so the genetic profiles Σ₂, Σ₃, and Σ₄ have the same score while the case 2 GRS assigns different weights to non-referent genotypes and the scores are different. The profile Σ_R1 denotes the referent group in case 1, while Σ_R2 denotes the referent group in case 2. The data for this example are taken from Sebastiani et al. (2012).

The slope γ₁ in the classification score:

Sc (Σ) = log (\frac{p (T | GRS)}{1 - p (T | GRS)}) = γ_{0} + γ_{1} GRS

(1)

measures the association of the GRS with the trait T in terms of log-odds ratio for T between two GRS that differ by 1, and it is often estimated to test whether the GRS is significantly associated with T. However, the value of γ₁ is irrelevant for classification because two classification scores defined as inEq. 1that differ by the slope will produce equivalent classification rules. This is stated in the next property.

Property 2: Irrelevance of the slope of a univariate logistic regression model for classification

Let Sc₁(Σ) and Sc₂(Σ) be two classification scores defined as:

S c_{1} (Σ) = log (\frac{p (T | GRS)}{1 - p (T | GRS)}) = γ_{0} + γ_{1} GRS S c_{2} (Σ) = log (\frac{p (T | GRS)}{1 - p (T | GRS)}) = β_{0} + β_{1} GRS

The two classification scores can be used to define equivalent classification rules by using the relation:

\begin{array}{rcl} “S c_{1} (Σ) > τ \Rightarrow classify as T”, if and only if \\ “S c_{2} (Σ) > β_{0} + β_{1} \frac{τ - γ_{0}}{γ_{1}} \Rightarrow classify as T” □ \end{array}

The GRSs labeled 2 and 3 in Table 1 weight SNP alleles in different ways to reflect their individual associations with the trait T.

Case 2

The GRS can be written as:

GRS = \sum_{i = 1}^{k} v_{i} (X_{i AB} + 2 X_{i BB})

where each weight v_i is the maximum likelihood estimate of the regression coefficient in the univariate logistic regression:

log (\frac{p (T | X_{i})}{1 - p (T | X_{i})}) = α_{i 0} + v_{i} X_{i}; X_{i} = \{\begin{matrix} 1 & if S_{i} = AB \\ 2 & if S_{i} = BB \\ 0 & otherwise \end{matrix}

that measures the association between SNP S_i and the trait T with an additive genetic model. Therefore, each weight

v_{i} = \log \frac{\frac{p (T | X_{i} = 1)}{1 - p (T | X_{i} = 1)}}{\frac{p (T | X_{i} = 0)}{1 - p (T | X_{i} = 0)}}

estimates the log-odds ratio for T for each copy of the B allele in an additive genetic model. Note that this formulation of the GRS does not require the specification of the risk allele of the SNPs, and the weighted genetic score will increase by v_i for each copy of the B allele of SNP S_i, if this is a risk allele, and decrease by v_i for each copy of the B allele if this is the protective allele. See the example in Figure 1.

The classification score based on this GRS is computed using the logistic regression inEq. 1, with parameters γ₀, γ₁ that can be estimated by maximum likelihood or Bayesian methods. The slope represents the odds ratio (OR) for T for a unit change of the GRS. In general, the OR for T between two genetic profiles Σ₁ = {S₁₁, …, S_k1} and Σ₂ = {S₁₂, …, S_k2} associated with GRS₁ and GRS₂ is

\begin{array}{l} \log (\frac{p (T | {GRS}_{1}) / (1 - p (T | {GRS}_{1})}{p (T | {GRS}_{2}) / (1 - p (T | {GRS}_{2})}) \\ = γ_{1} \sum_{i = 1}^{k} \log (\frac{p (T | S_{i 1}) / (1 - p (T | S_{i 1})}{p (T | S_{i 2}) / (1 - p (T | S_{i 2})}) \end{array}

and this equation shows that the log-odds ratio for T between two weighted GRSs is an average of log-odds ratios of the individual genetic effects rescaled by the coefficient γ₁.

The classification rule

if S c_{1} (Σ) = log (\frac{p (T | GRS)}{1 - p (T | GRS)}) > τ \Rightarrow classify as T,

based on the score

S c_{1} (Σ) = log (\frac{p (T | GRS)}{1 - p (T | GRS)}) = γ_{0} + γ_{1} GRS

is equivalent to:

\begin{array}{l} if \sum_{i = 1}^{k} \log \frac{p (T | S_{i}) / (1 - p (T | S_{i})}{p (T | S_{i} = AA) / (1 - p (T | S_{i} = A A)} > \frac{τ - γ_{0}}{γ_{1}} \\ \Rightarrow classify as T \end{array}

So the classification rule that uses the weighted GRS in case 2 is essentially based on an average of the individual log-odds ratio for T of each SNP genotype relative to the referent genotypes.

Case 3

The GRS is:

GRS = \sum_{i = 1}^{k} (v_{i} X_{i AB} + z_{i} X_{i BB})

where v_i and z_i are the MLE estimate of the regression coefficients of the univariate logistic regression

\begin{array}{rcl} log (\frac{p (T | S_{i})}{1 - p (T | S_{i})}) = α_{i 0} + v_{i} X_{i AB} + z_{i} X_{i BB}; \\ X_{i AB} = \{\begin{matrix} 1 & if S_{i} = AB \\ 0 & otherwise \end{matrix}; X_{i BB} = \{\begin{matrix} 1 & if S_{i} = BB \\ 0 & otherwise \end{matrix} \end{array}

that measures the genotypic association between SNP S_i and the trait T. Therefore

v_{i} = log \frac{\frac{p (T | S_{i} = AB)}{1 - p (T | S_{i} = AB)}}{\frac{p (T | S_{i} = AA)}{1 - p (T | S_{i} = AA)}}; z_{i} = log \frac{\frac{p (T | S_{i} = BB)}{1 - p (T | S_{i} = BB)}}{\frac{p (T | S_{i} = AA)}{1 - p (T | S_{i} = AA)}}

are the log-odds ratio for T between the AB and AA genotypes, and BB and AA genotypes. See Figure 2 for an example. The classification score and classification rule are derived as in case 2 and can be interpreted as average of the log-odds ratios of individual SNPs genotypes. Compared to case 2, the weights based on genotype associations allow for more general model of associations that are not restricted to linear increase of the log-odds for T. Note also that when the SNPs included in a GRS (case 2 and 3) are independent, the two scores should be approximately equivalent to multivariate logistic regression with additive (case 2) or genotypic association (case 3). In addition, if the SNPs included in the GRS have similar effects, then the GRS in case 1 and 2 should be approximately equivalent.

**Example of GRS (case 3 in Table 1)**. The table on top reports the A/B alleles for the three SNPs, the frequencies of A allele in cases and controls, and the odds ratio for exceptional longevity in carriers of the AB allele relative to carriers of the AA allele (OR.AB.AA), and the odds ratio for exceptional longevity in carriers of the BB allele relative to carriers of the AA allele (OR.BB.AA). The bottom panel shows the calculations of the GRS with weights as in case 3. The profile Σ_R denotes the referent group.

Naïve bayes classifiers

The classification score based on a NBC is the posterior probability of the trait T that is calculated using the formula:

Sc (\sum) = p (T | \sum) = \frac{p (T) \prod_{i = 1}^{k} p (S_{i} | T)}{\begin{array}{l} p (T) \prod_{i = 1}^{k} p (S_{i} | T) \\ + (1 - p (T)) \prod_{i = 1}^{k} p (S_{i} | n o t T) \end{array}}

where p(T) and 1 − p(T) are the prior probabilities of having the trait T or not. The conditional probabilities p(S_i | T) and p(S_i | not T) represent the distribution of the ith SNP genotype in subjects with and without the trait T. They are typically estimated assuming genotypic association (Sebastiani et al., 2012), but they could also be estimated using an additive genetic model. The formula is derived using Bayes’ theorem and assuming that the SNPs are independent, conditionally on T (Hand, 2009). The usual Bayesian classification rule is to classify a subject with the most probable outcome

if Sc (Σ) > 0.5 \Rightarrow classify as T .

This rule is based on a 0–1 loss that assigns the same weight to misclassification errors. A general loss function that weights differently sensitivity and specificity would lead to the classification rule:

if Sc (Σ) > \frac{λ}{1 + λ} \Rightarrow classify as T for λ > 0

that can also be written as:

Sc (Σ) > \frac{λ}{1 + λ} \Leftrightarrow log (\frac{p (T | Σ)}{1 - p (T | Σ)}) > log (λ)

and simple algebra shows that this is equivalent to:

log (\frac{p (T | Σ)}{1 - p (T | Σ)}) = log (\frac{p (T) \prod_{i = 1}^{k} p (S_{i} | T)}{(1 - p (T)) \prod_{i = 1}^{k} p (S_{i} | not T)}) = log (\frac{\prod_{i = 1}^{k} p (T) p (S_{i} | T)}{\prod_{i = 1}^{k} (1 - p (T)) p (S_{i} | not T)}) = log (\prod_{i = 1}^{k} \frac{p (T | S_{i})}{1 - p (T | S_{i})}) = \sum_{i = 1}^{k} log (\frac{p (T | S_{i})}{1 - p (T | S_{i})}) > log (λ)

As long as the log-odds ratios are calculated using the same genetic model, this classification rule is equivalent to the classification rule based on the GRS (either case 2 or 3)

\begin{array}{rcl} if \sum_{i = 1}^{k} log \frac{p (T | S_{i}) ∕ (1 - p (T | S_{i}))}{p (T | S_{i} = AA) ∕ (1 - p (T | S_{i} = AA))} > \frac{τ - γ_{0}}{γ_{1}} \\ \Rightarrow classify as T \end{array}

by setting the threshold

τ = γ_{0} - γ_{1} \sum_{i = 1}^{k} log (\frac{p (T | S_{i} = AA)}{1 - p (T | S_{i} = AA)}) + γ_{1} log (λ)

We state this relation formally.

Property 3: Equivalence of classification rules based on the GRS and the NBC

The classification rules based on a logistic model of a GRS(case 2 or 3) and a NBC are equivalent when the same genetic models are used to link individual SNPs to the trait.

The details of the algebraic manipulations are in Section “Appendix.”□

Note that the equivalence between the classification rules based on a NBC and a logistic regression model with a GRS as in case 2 or 3 is a simple consequence of the fact that both models base the prediction on a weighted average of ORs of the individual SNPs. This equivalence is independent of the choice of the prior for T because different prior distributions will lead to equivalent classification rules but with different classification thresholds. Also, the equivalence of classification rules based on GRS and NBC implies that when alternative classifiers are compared by the area under the receiving operator curve they must reach the same value. This is shown in the next example.

Example

To demonstrate the connection between the NBC and the GRS in case 3, we performed a simple simulation. We simulated a dataset with 3000 cases and 3000 controls, and genotype data from 75 causal SNP and 500,000 null SNPs. For the null SNPs, we randomly selected frequencies of the minor allele (p) from a uniform (0.05, 0.5) distribution and genotype frequencies were generated assuming Hardy–Weinberg equilibrium [p²,2p(1 − p),(1 − p)²]. The causal SNPs were simulated with ORs of 1.2, 1.3, 1.4, 1.5, and 1.6 and minor allele frequencies (MAFs) of 0.1, 0.2, 0.3, 0.4, and 0.5. A causal SNP was simulated for each combination of the above ORs and MAFs (25 combinations) under an additive, recessive and dominant mode of inheritance (25 combinations × 3 modes of inheritance = 75 SNPs). The genotype frequencies in controls were generated to follow Hardy–Weinberg equilibrium [p²,2p(1 − p),(1 − p)²]. The genotype frequencies in cases for the additive, recessive, and dominant models were [p²,2ORp(1 − p),OR²(1 − p)²], [p²,2p(1 − p),OR(1 − p)²] and [p²,2ORp(1 − p),OR(1 − p)²], respectively. For the cases, the genotype frequencies were divided by the sum of the frequencies so that the frequencies add up to 1. Using the genotype frequencies for each SNP, we simulated a discovery set of 3000 cases and 3000 controls and a replication set with the same sample sizes.

The data in the discovery set were analyzed to generate genetic risk models based on GRS and NBCs in the following way. A Bayesian genome-wide association study was performed on the discovery set and SNPs were ordered according to the posterior probability for the genotypic association to build nested NBCs with increasing number of SNPs as in Sebastiani et al. (2012). To obtain the weights for the three GRSs, we ran two logistic regression models for each SNP, using an additive mode of inheritance and a genotypic mode of inheritance. The results of these analyses were used to detect the risk alleles of SNPs for nested GRS as in case 1; and to estimate the weights of GRS as in cases 2 and 3. Using SNPs ordered by the posterior probability for the genotypic association, we then built three sets of classification models based on logistic regression and the three different GRS, with increasing number of SNPs. The prediction models were tested on the replication set to avoid issues of over-fitting. The simulation described above was repeated five times and the mean AUC across the replicates was used to assess accuracy.

Figure 3 (left panel) shows the mean AUC across five replicates for the NBCs and logistic regression models for different GRSs, with increasing number of SNPs. As expected based on our mathematical calculations, the AUCs of the genetic risk models based on the NBCs and the GRSs with a genotypic weights are identical (Figure 3, left panel), and the predicted probabilities are almost identical (Figure 3, right panel). The weighted and unweighted GRS using an additive mode of inheritance have lower AUCs demonstrating the loss of accuracy with assuming additivity when some of the SNPs do not follow an additive mode of inheritance. Of course if all SNPs do in fact follow an additive model of the inheritance, the genotypic and additive prediction models would perform similarly. The trend of the AUC shows that accuracy keeps increasing as true positive SNPs are included in the model, and then declines when each classification model starts including false positive SNPs. The decline is more evident for the case 1 GRS, while both weighted GRS based on additive or genotypic associations appear to be more robust.

**Results of simulation for replication set**. The left hand plot graphs the mean area under the ROC (AUC) versus the number of SNPs in the prediction model. The colored lines refer to the AUC of the NBC (black), the unweighted GRS from an additive model (case 1, blue), the weighted GRS from an additive model (case 2, green), and the weighted GRS from a genotypic model (case 3, red). The maximum AUC occurs at 45 SNPs. The right hand plot graphs the probability of the trait T given the weighted GRS (genotypic model) on the y-axis versus the probability of disease given the SNP profile estimated by NBC on the x-axis for a model containing 45 SNPs for one of the replicates. The right hand plot is similar across the five replicates.

Discussion

One of the selling points of genome-wide association studies was to discover genetic variants that are associated with increased susceptibility for disease and could be used for personalized diagnosis and prognosis. Initial results published for example in Meigs et al. (2008) and Paynter et al. (2010) however showed that genetic data added limited predicted values to well established risk factors of Type II diabetes and cardiovascular disease. These initial studies limited the attention to those SNPs that reached genome-wide significance and their effect was summarized into a GRS. Since then, a growing body of literature has shown the increased value of deeper mining of genome-wide association studies but inclusion of large number of SNPs in genetic risk model has continued to resort on GRSs (Cui, 2009; Goddard et al., 2009; Kooperberg et al., 2009; Purcell et al., 2009; Yang et al., 2010; Chen et al., 2011; Chibnik et al., 2011), while machine learning type methods continue to be rare regardless of some successful applications (Wei et al., 2009; Okser et al., 2010; Kang et al., 2011b; Sebastiani et al., 2012).

Our study shows that risk prediction based on a GRS is mathematically equivalent to risk prediction based on a NBC, when the same SNPs with the same mode of inheritance are used in the models. The equivalence is based on the fact that both models essentially base the prediction on a weighted average of ORs of the individual SNPs. While this equivalence establishes the validity of methods based on the NBC for genetic risk prediction and we hope will contribute to make this approach more popular in this field, it also shows that contrary to what stated in Okser et al. (2010) a NBC does not include interactions of SNPs but only additive genetic effects. However, the directed graphical model underlying a NBC can be extended to more general structures to include interactions between genes and/or environmental risk factors by maintaining the computations scalable to genome-wide genotype data and even whole genome sequence data (Sebastiani and Perls, 2008).

Figure 4 shows some ways to extend NBCs for risk prediction to include population ancestry, as well as genetic and non-genetic effects that may be missed by test for marginal associations. Figure 4A describes a directed acyclic graph (DAG) with one parent node (T) and two children nodes (X₁ and X₂) that may represent SNPs. The DAG describes the conditional independence of X₁ and X₂ given T. This type of DAG with one root node and multiple conditionally independent children represents a NBC (Sebastiani and Abad-Grau, 2007). The DAG in Figure 4B extends the NBC in Figure 4A with an additional node X₃ that is marginally independent of T, but conditionally dependent on T given X₂. In the context of genetic risk modeling, the node X₃ could represent a non-genetic risk factor that is associated with a trait T only in specific genetic backgrounds (the node X₂). The DAG in Figure 4C includes an additional node X₄ that is conditionally independent of all other nodes given X₁. This additional node may represent a gene × gene interaction that is induced by linkage disequilibrium. Note that both DAGS in Figures 4B,C would give the same classification score for T, because of the independence of T from X₄ given X₁. So, the DAG in Figure 4C would be useful for a better explanation of the biology rather than improving genetic risk prediction. Finally, the DAG in Figure 4D extends the DAG in Figure 4B by adding a link from T to X₃. The inclusion of this link makes the node X₃ marginally dependent of T and interaction between X₂ and X₃ changes the classification score compared to the DAG in Figure 4B.

**Examples of directed acyclic graph (DAG)**. All nodes are random variables and the DAG represents Markov properties of marginal and conditional independence (Lauritzen and Sheehan, 2004). In particular, the global Markov property states that a node is independent of all other nodes in the DAG given its parent nodes, its children nodes and additional parents of its children (Lauritzen and Sheehan, 2004). In addition, two nodes are marginally independent when they have no directed joining paths after their children are dropped. Therefore, the nodes X₁ and X₂ in the DAG in **(A)** are conditionally independent given T. The DAG in **(B)** adds the node X₃ to the NBC in **(A)**. This additional node is marginally independent of T but conditionally dependent on T given X₂. The DAG in **(C)** includes an additional node X₄ that is conditionally independent of all other nodes given X₁. Finally, the DAG in **(D)** extends the DAG in **(B)** by adding a link from T to X₃ so that X₃ and T are marginally dependent.

In addition, and most importantly, the fact that all variables in a DAG are random provides a sound framework for marginal and conditional inference. For example, a genetic risk model based on a DAG can be used for predicting the outcome of a subject by marginalizing out unobserved variables (Solovieff et al., 2011).

Our analysis is limited to binary outcomes, but we expect that similar results hold when the outcome to be predicted is a quantitative trait that follows a normal distribution. Furthermore, our analysis shows that linear transformations of a GRS do not impact predictive accuracy, and similarly, that the predictive accuracy of a NBC cannot be changed by a choice of prior for T. Improving the accuracy can be accomplished by selection of the most predictive SNP and by choosing alternative weights to calculate the GRS. There is no obvious similar choice for a NBC. However, a closely related approach that we used in Sebastiani et al. (2012) to improve the predictive accuracy is to use ensemble of nested NBCs. Finally, the machine learning community has developed many feature selection algorithms for building classifiers (Hastie et al., 2009) that, by the equivalence proved in this paper, may prove to be useful to generate better genetic risk models.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

Research supported with funds from NIH/NHLBI R01 HL089655-03 and R21HL114237 (Paola Sebastiani). We thank Dr. Maria Abad Grau for stimulating discussion during her visit at Boston University in August 2011 that triggered the writing of this manuscript.

Appendix

Derivation of property 3

\begin{array}{l} \sum_{i = 1}^{k} \log \frac{p (T | S_{i}) / (1 - p (T | S_{i})}{p (T | S_{i} = AA) / (1 - p (T | S_{i} = AA))} > \frac{τ - γ_{0}}{γ_{1}} \Rightarrow classify as T \\ if and only if \\ \sum_{i = 1}^{k} \log (\frac{p (T | S_{i})}{1 - p (T | S_{i})}) - \sum_{i = 1}^{k} \log (\frac{p (T | S_{i} = AA}{1 - p (T | S_{i} = AA}) > \frac{τ - γ_{0}}{γ_{1}} \Rightarrow classify as T \\ if and only if \\ \sum_{i = 1}^{k} \log (\frac{p (T | S_{i})}{1 - p (T | S_{i})}) > \frac{τ - γ_{0}}{γ_{1}} + \sum_{i = 0}^{k} \log (\frac{p (T | S_{i} = AA)}{1 - p (T | S_{i} = AA)}) \Rightarrow classify as T \\ if and only if \\ \sum_{i = 1}^{k} \log (\frac{p (T | S_{i})}{1 - p (T | S_{i})}) > \log (λ) \\ where \log (λ) = \frac{τ - γ_{0}}{γ_{1}} + \sum_{i = 1}^{k} \log (\frac{p (T | S_{i} = AA)}{1 - p (T | S_{i} = AA)}) \\ and τ = γ_{0} - γ_{1} \sum_{i = 1}^{k} \log (\frac{p (T | S_{i} = AA)}{1 - p (T | S_{i} = AA)}) + γ_{1} \log (γ) \end{array}

References

Balding D. J. (2006). A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7, 781–791. 10.1038/nrg1916 [DOI] [PubMed] [Google Scholar]
Bureau A., Dupuis J., Falls K., Lunetta K. L., Hayward B., Keith T. P., Van Eerdewegh P. (2005). Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 28, 171–182. 10.1002/gepi.20041 [DOI] [PubMed] [Google Scholar]
Chen H., Poon A., Yeung C., Helms C., Pons J., Bowcock A. M., Kwok P. Y., Liao W. (2011). A genetic risk score combining ten psoriasis risk loci improves disease prediction. PLoS ONE 6, e19454. 10.1371/journal.pone.0019454 [DOI] [PMC free article] [PubMed] [Google Scholar]
Chibnik L. B., Keenan B. T., Cui J., Liao K. P., Costenbader K. H., Plenge R. M., Karlson E. W. (2011). Genetic risk score predicting risk of rheumatoid arthritis phenotypes and age of symptom onset. PLoS ONE 6, e24380. 10.1371/journal.pone.0024380 [DOI] [PMC free article] [PubMed] [Google Scholar]
Cui J. (2009). Overview of risk prediction models in cardiovascular disease research. Ann Epidemiol 19, 711–717. 10.1016/j.annepidem.2009.05.005 [DOI] [PubMed] [Google Scholar]
Goddard M. E., Wray N. R., Verbyla K., Visscher P. M. (2009). Estimating effects and making predictions from genome-wide marker data. Stat. Sci. 24, 517–529. 10.1214/09-STS306 [DOI] [Google Scholar]
Hand D. J. (2009). “Naive Bayes,” in The Top Ten Algorithms in Data Mining, eds Wu X., Kumar V. (London: Chapman and Hall; ), 163–178. [Google Scholar]
Hastie T., Tibshirani R., Friedman J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer. [Google Scholar]
Jewell R. (2003). Statistics for Epidemiology. Boca Raton: CRC/Chapman and Hall. [Google Scholar]
Jiang X., Barmada M. M., Cooper G. F., Becich M. J. (2011). A Bayesian method for evaluating and discovering disease loci associations. PLoS ONE 6, e22075. 10.1371/journal.pone.0022075 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kang J., Zheng W., Li L., Lee J., Yan X., Zhao H. (2011a). Use of Bayesian networks to dissect the complexity of genetic disease: application to the Genetic Analysis Workshop 17 simulated data. BMC Proc. 5(Suppl. 9), S37. 10.1186/1753-6561-5-S9-S37 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kang J., Kugathasan S., Georges M., Zhao H., Cho J. H. (2011b). Improved risk prediction for Crohn’s disease with a multi-locus approach. Hum. Mol. Genet. 20, 2435–2442. 10.1093/hmg/ddr116 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kooperberg C., LeBlanc M., Obenchain V. (2009). Risk prediction using genome-wide association studies. Genet. Epidemiol. 34, 643–652. 10.1002/gepi.20509 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kraft P., Hunter D. J. (2009). Genetic risk prediction – are we there yet? N. Engl. J. Med. 360, 1701–1703. 10.1056/NEJMp0810107 [DOI] [PubMed] [Google Scholar]
Lauritzen S. L., Sheehan N. A. (2004). Graphical models for genetic analysis. Stat. Sci. 18, 489–514. [Google Scholar]
McCullagh P., Nelder J. (1989). Generalized Linear Models. London: Chapman and Hall. [Google Scholar]
McKinney B. A., Reif D. M., Ritchie M. D., Moore J. H. (2006). Machine learning for detecting gene-gene interactions: a review. Appl. Bioinformatics 5, 77–88. 10.2165/00822942-200605020-00002 [DOI] [PMC free article] [PubMed] [Google Scholar]
Meigs J. B., Shrader P., Sullivan L. M., McAteer J. B., Fox C. S., Dupuis J., Manning A. K., Florez J. C., Wilson P. W., D’Agostino R. B., Sr., Cupples L. A. (2008). Genotype score in addition to common risk factors for prediction of type 2 diabetes. N. Engl. J. Med. 359, 2208–2219. 10.1056/NEJMoa0804742 [DOI] [PMC free article] [PubMed] [Google Scholar]
Moore J. H., Gilbert J. C., Tsai C. T., Chiang F. T., Holden T., Barney N., White B. C. (2006). A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J. Theor. Biol. 241, 252–261. 10.1016/j.jtbi.2005.11.036 [DOI] [PubMed] [Google Scholar]
Okser S., Lehtimaki T., Elo L. L., Mononen N., Peltonen N., Kahonen M., Juonala M., Fan Y. M., Hernesniemi J. A., Laitinen T., Lyytikainen L. P., Rontu R., Eklund C., Hutri-Kahonen N., Taittonen L., Hurme M., Viikari J. S., Raitakari O. T., Aittokallio T. (2010). Genetic variants and their interactions in the prediction of increased pre-clinical carotid atherosclerosis: the cardiovascular risk in young Finns study. PLoS Genet. 6, e1001146. 10.1371/journal.pgen.1001146 [DOI] [PMC free article] [PubMed] [Google Scholar]
Paynter N. P., Chasman D. I., Pare G., Buring J. E., Cook N. R., Miletich J. P., Ridker P. M. (2010). Association between a literature-based genetic risk score and cardiovascular events in women. JAMA 303, 631–637. 10.1001/jama.2010.119 [DOI] [PMC free article] [PubMed] [Google Scholar]
Purcell S. M., Wray N. R., Stone J. L., Visscher P. M., O’Donovan M. C., Sullivan P. F., Sklar P. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rodin A. S., Boerwinkle E. (2005). Mining genetic epidemiology data with Bayesian networks I: Bayesian networks and example application (plasma apoE levels). Bioinformatics 21, 3273–3278. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sebastiani P., Abad-Grau M. (2007). “Bayesian networks for genetic analysis,” in Bioinformatics: An Engineering Case-Based Approach, eds Alterovitz G., Ramoni M. F. (Cambridge, MA: Artech House; ), 205–228. [Google Scholar]
Sebastiani P., Perls T. T. (2008). “Complex genetic models,” in Bayesian Networks, eds Pourret O., Naïm P., Marcot B. (Chichester: John Wiley & Sons; ), 53–72. [Google Scholar]
Sebastiani P., Perls T. T. (2010). Prediction models that include genetic data. Circ. Cardiovasc. Genet. 3, 1–2. 10.1161/CIRCGENETICS.109.933614 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sebastiani P., Ramoni M. F., Nolan V., Baldwin C. T., Steinberg M. H. (2005). Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia. Nat. Genet. 37, 435–440. 10.1038/ng1533 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sebastiani P., Solovieff N., DeWan A., Walsh K., Puca A., Hartley S. W., Melista E., Andersen S., Dworkis D. A., Wilk J. B., Myers R. H., Steinberg M. H., Montano M., Baldwin C. T., Hoh J., Perls T. T. (2012). Genetic signatures of exceptional longevity in humans. PLoS ONE 7, e29848. 10.1371/journal.pone.0029848 [DOI] [PMC free article] [PubMed] [Google Scholar]
Solovieff N., Baldwin C. T., Steinberg M. H., Perls T. T., Sebastiani P. (2011). “Incorporating genetic ancestry into risk prediction models,” in The 12th International Congress of Human Genetics and the American Society of Human Genetics 61st Annual Meeting, Montreal. [Google Scholar]
Stengard J. H., Dyson G., Frikke-Schmidt R., Tybjaerg-Hansen A., Nordestgaard B. G., Sing C. F. (2010). Context-dependent associations between variation in risk of ischemic heart disease and variation in the 5′ promoter region of the apolipoprotein E gene in Danish women. Circ. Cardiovasc. Genet. 3, 22–30. 10.1161/CIRCGENETICS.109.862748 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wei Z., Wang K., Qu H. Q., Zhang H., Bradfield J., Kim C., Frackleton E., Hou C., Glessner J. T., Chiavacci R., Stanley C., Monos D., Grant S. F., Polychronakos C., Hakonarson H. (2009). From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet. 5, e1000678. 10.1371/journal.pgen.1000678 [DOI] [PMC free article] [PubMed] [Google Scholar]
Whittaker J. (1990). Graphical Models in Applied Multivariate Statistics. New York: John Wiley & Sons. [Google Scholar]
Wu C., Walsh K., DeWan A., Hoh J., Wang Z. (2011). Disease risk prediction with rare and common variants. BMC Proc. 5(Suppl. 9), S61. 10.1186/1753-6561-5-S9-S61 [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang J., Manolio T. A., Pasquale L. R., Boerwinkle E., Caporaso N., Cunningham J. M., de Andrade M., Feenstra B., Feingold E., Hayes M. G., Hill W. G., Landi M. T., Alonso A., Lettre G., Lin P., Ling H., Lowe W., Mathias R. A., Melbye M., Pugh E., Cornelis M. C., Weir B. S., Goddard M. E., Visscher P. M. (2010). Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43, 519–525. 10.1038/ng.823 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B1] Balding D. J. (2006). A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7, 781–791. 10.1038/nrg1916 [DOI] [PubMed] [Google Scholar]

[B2] Bureau A., Dupuis J., Falls K., Lunetta K. L., Hayward B., Keith T. P., Van Eerdewegh P. (2005). Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 28, 171–182. 10.1002/gepi.20041 [DOI] [PubMed] [Google Scholar]

[B3] Chen H., Poon A., Yeung C., Helms C., Pons J., Bowcock A. M., Kwok P. Y., Liao W. (2011). A genetic risk score combining ten psoriasis risk loci improves disease prediction. PLoS ONE 6, e19454. 10.1371/journal.pone.0019454 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] Chibnik L. B., Keenan B. T., Cui J., Liao K. P., Costenbader K. H., Plenge R. M., Karlson E. W. (2011). Genetic risk score predicting risk of rheumatoid arthritis phenotypes and age of symptom onset. PLoS ONE 6, e24380. 10.1371/journal.pone.0024380 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Cui J. (2009). Overview of risk prediction models in cardiovascular disease research. Ann Epidemiol 19, 711–717. 10.1016/j.annepidem.2009.05.005 [DOI] [PubMed] [Google Scholar]

[B6] Goddard M. E., Wray N. R., Verbyla K., Visscher P. M. (2009). Estimating effects and making predictions from genome-wide marker data. Stat. Sci. 24, 517–529. 10.1214/09-STS306 [DOI] [Google Scholar]

[B7] Hand D. J. (2009). “Naive Bayes,” in The Top Ten Algorithms in Data Mining, eds Wu X., Kumar V. (London: Chapman and Hall; ), 163–178. [Google Scholar]

[B8] Hastie T., Tibshirani R., Friedman J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer. [Google Scholar]

[B9] Jewell R. (2003). Statistics for Epidemiology. Boca Raton: CRC/Chapman and Hall. [Google Scholar]

[B10] Jiang X., Barmada M. M., Cooper G. F., Becich M. J. (2011). A Bayesian method for evaluating and discovering disease loci associations. PLoS ONE 6, e22075. 10.1371/journal.pone.0022075 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Kang J., Zheng W., Li L., Lee J., Yan X., Zhao H. (2011a). Use of Bayesian networks to dissect the complexity of genetic disease: application to the Genetic Analysis Workshop 17 simulated data. BMC Proc. 5(Suppl. 9), S37. 10.1186/1753-6561-5-S9-S37 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] Kang J., Kugathasan S., Georges M., Zhao H., Cho J. H. (2011b). Improved risk prediction for Crohn’s disease with a multi-locus approach. Hum. Mol. Genet. 20, 2435–2442. 10.1093/hmg/ddr116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Kooperberg C., LeBlanc M., Obenchain V. (2009). Risk prediction using genome-wide association studies. Genet. Epidemiol. 34, 643–652. 10.1002/gepi.20509 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] Kraft P., Hunter D. J. (2009). Genetic risk prediction – are we there yet? N. Engl. J. Med. 360, 1701–1703. 10.1056/NEJMp0810107 [DOI] [PubMed] [Google Scholar]

[B15] Lauritzen S. L., Sheehan N. A. (2004). Graphical models for genetic analysis. Stat. Sci. 18, 489–514. [Google Scholar]

[B16] McCullagh P., Nelder J. (1989). Generalized Linear Models. London: Chapman and Hall. [Google Scholar]

[B17] McKinney B. A., Reif D. M., Ritchie M. D., Moore J. H. (2006). Machine learning for detecting gene-gene interactions: a review. Appl. Bioinformatics 5, 77–88. 10.2165/00822942-200605020-00002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] Meigs J. B., Shrader P., Sullivan L. M., McAteer J. B., Fox C. S., Dupuis J., Manning A. K., Florez J. C., Wilson P. W., D’Agostino R. B., Sr., Cupples L. A. (2008). Genotype score in addition to common risk factors for prediction of type 2 diabetes. N. Engl. J. Med. 359, 2208–2219. 10.1056/NEJMoa0804742 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] Moore J. H., Gilbert J. C., Tsai C. T., Chiang F. T., Holden T., Barney N., White B. C. (2006). A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J. Theor. Biol. 241, 252–261. 10.1016/j.jtbi.2005.11.036 [DOI] [PubMed] [Google Scholar]

[B20] Okser S., Lehtimaki T., Elo L. L., Mononen N., Peltonen N., Kahonen M., Juonala M., Fan Y. M., Hernesniemi J. A., Laitinen T., Lyytikainen L. P., Rontu R., Eklund C., Hutri-Kahonen N., Taittonen L., Hurme M., Viikari J. S., Raitakari O. T., Aittokallio T. (2010). Genetic variants and their interactions in the prediction of increased pre-clinical carotid atherosclerosis: the cardiovascular risk in young Finns study. PLoS Genet. 6, e1001146. 10.1371/journal.pgen.1001146 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Paynter N. P., Chasman D. I., Pare G., Buring J. E., Cook N. R., Miletich J. P., Ridker P. M. (2010). Association between a literature-based genetic risk score and cardiovascular events in women. JAMA 303, 631–637. 10.1001/jama.2010.119 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] Purcell S. M., Wray N. R., Stone J. L., Visscher P. M., O’Donovan M. C., Sullivan P. F., Sklar P. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Rodin A. S., Boerwinkle E. (2005). Mining genetic epidemiology data with Bayesian networks I: Bayesian networks and example application (plasma apoE levels). Bioinformatics 21, 3273–3278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Sebastiani P., Abad-Grau M. (2007). “Bayesian networks for genetic analysis,” in Bioinformatics: An Engineering Case-Based Approach, eds Alterovitz G., Ramoni M. F. (Cambridge, MA: Artech House; ), 205–228. [Google Scholar]

[B25] Sebastiani P., Perls T. T. (2008). “Complex genetic models,” in Bayesian Networks, eds Pourret O., Naïm P., Marcot B. (Chichester: John Wiley & Sons; ), 53–72. [Google Scholar]

[B26] Sebastiani P., Perls T. T. (2010). Prediction models that include genetic data. Circ. Cardiovasc. Genet. 3, 1–2. 10.1161/CIRCGENETICS.109.933614 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] Sebastiani P., Ramoni M. F., Nolan V., Baldwin C. T., Steinberg M. H. (2005). Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia. Nat. Genet. 37, 435–440. 10.1038/ng1533 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] Sebastiani P., Solovieff N., DeWan A., Walsh K., Puca A., Hartley S. W., Melista E., Andersen S., Dworkis D. A., Wilk J. B., Myers R. H., Steinberg M. H., Montano M., Baldwin C. T., Hoh J., Perls T. T. (2012). Genetic signatures of exceptional longevity in humans. PLoS ONE 7, e29848. 10.1371/journal.pone.0029848 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] Solovieff N., Baldwin C. T., Steinberg M. H., Perls T. T., Sebastiani P. (2011). “Incorporating genetic ancestry into risk prediction models,” in The 12th International Congress of Human Genetics and the American Society of Human Genetics 61st Annual Meeting, Montreal. [Google Scholar]

[B30] Stengard J. H., Dyson G., Frikke-Schmidt R., Tybjaerg-Hansen A., Nordestgaard B. G., Sing C. F. (2010). Context-dependent associations between variation in risk of ischemic heart disease and variation in the 5′ promoter region of the apolipoprotein E gene in Danish women. Circ. Cardiovasc. Genet. 3, 22–30. 10.1161/CIRCGENETICS.109.862748 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] Wei Z., Wang K., Qu H. Q., Zhang H., Bradfield J., Kim C., Frackleton E., Hou C., Glessner J. T., Chiavacci R., Stanley C., Monos D., Grant S. F., Polychronakos C., Hakonarson H. (2009). From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet. 5, e1000678. 10.1371/journal.pgen.1000678 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] Whittaker J. (1990). Graphical Models in Applied Multivariate Statistics. New York: John Wiley & Sons. [Google Scholar]

[B33] Wu C., Walsh K., DeWan A., Hoh J., Wang Z. (2011). Disease risk prediction with rare and common variants. BMC Proc. 5(Suppl. 9), S61. 10.1186/1753-6561-5-S9-S61 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34] Yang J., Manolio T. A., Pasquale L. R., Boerwinkle E., Caporaso N., Cunningham J. M., de Andrade M., Feenstra B., Feingold E., Hayes M. G., Hill W. G., Landi M. T., Alonso A., Lettre G., Lin P., Ling H., Lowe W., Mathias R. A., Melbye M., Pugh E., Cornelis M. C., Weir B. S., Goddard M. E., Visscher P. M. (2010). Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43, 519–525. 10.1038/ng.823 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Naïve Bayesian Classifier and Genetic Risk Score for Genetic Risk Prediction of a Categorical Trait: Not so Different after all!

Paola Sebastiani

Nadia Solovieff

Jenny X Sun

Abstract

Introduction

Methods and Results