Skip to main content
Statistical Applications in Genetics and Molecular Biology logoLink to Statistical Applications in Genetics and Molecular Biology
. 2011 Jul 12;10(1):32. doi: 10.2202/1544-6115.1691

Random Forests for Genetic Association Studies

Benjamin A Goldstein 1, Eric C Polley 2, Farren B S Briggs 3
PMCID: PMC3154091  PMID: 22889876

Abstract

The Random Forests (RF) algorithm has become a commonly used machine learning algorithm for genetic association studies. It is well suited for genetic applications since it is both computationally efficient and models genetic causal mechanisms well. With its growing ubiquity, there has been inconsistent and less than optimal use of RF in the literature. The purpose of this review is to breakdown the theoretical and statistical basis of RF so that practitioners are able to apply it in their work. An emphasis is placed on showing how the various components contribute to bias and variance, as well as discussing variable importance measures. Applications specific to genetic studies are highlighted. To provide context, RF is compared to other commonly used machine learning algorithms.

Keywords: machine learning, SNP, genome wide association studies

1. Introduction

The Random Forests (RF) [Breiman 2001] algorithm is an increasingly popular machine learning algorithm within statistical genetics. While many different algorithms have been successfully applied to genetic data, RF contains a combination of characteristics that make it well suited for genetic applications. First, it is well adapted for both prediction (the traditional domain of machine learning) and variable importance (VI) (the typical interest of statistical geneticists). Second, RF is fairly robust to the setting of the tuning parameters, making it, as Breiman referred to, an “off-the-shelf” algorithm readily accessible to novice users. It is also one of the few algorithms capable of handling thousands of observations and hundreds of thousands of predictors. Furthermore, the final derived model based on a combination of trees, represents a non-parametric model relating the predictors to the outcome. Trees are capable of capturing interaction and complex relationships in the data and make an ideal “base” learner for complex genetic problems. Finally, unlike many algorithms, RF is relatively straightforward to understand due to its non-parametric features.

Given these characteristics, it is not surprising that the number of statistical genetics articles published, since the introduction of the RF algorithm in 2001, has steadily increased (see Figure 1). Such papers usually take one of four forms. The first are direct applications of the RF algorithm to uncover genes associated with disease. Such analyses either independently use RF to identify associations (e.g. Goldstein et al. [2010]) or use RF in conjunction with other modeling approaches (e.g. Briggs et al. [2010a]). In these, RF is often used as a classification algorithm (the outcome is disease or not diseased), however there is some growing work in using time-to-event outcomes Ishwaran et al. [2008]. The second class of papers use RF as a prediction tool to predict disease state (e.g. Sun et al. [2007]). The third class of papers are methodological, illustrating improvements of the RF algorithm tailored for genetic applications, primarily focusing on the calculation of VI (e.g. Meng et al. [2009]) or providing an approach for variable selection (e.g. Díaz-Uriarte and Alvarez de Andrés [2006]). The final class of papers are those that compare RF to other algorithms (e.g. Statnikov et al. [2008]).

Figure 1:

Figure 1:

Articles listed in PubMed using the search terms (“Random Forest” or “Random Forests”) AND (Gene OR SNP). Over 125 articles since 2001 (articles from the Genetics Analysis Workshop omitted).

Based on the breadth of work both studying and applying RF, it is easy to recognize that it has desirable properties that allow for successful application in genetic epidemiology studies. Unfortunately, there is often an inconsistent and less than optimal use of RF in the literature. In many ways it is a quintessential “black box” algorithm with many moving parts which spits out a series of “answers.” However, underlying it is a build-up of simple theory which helps in understanding how best to optimize it. The purpose of this paper is not to justify the use of RF or provide a prescription of how best to use it. Sun [2010] provides a good introduction for the rationale behind using RF for genetic studies. Moreover, it is not our opinion that a “best” way to use RF (or any algorithm) exists since the best approach depends on the data problem at hand. Instead, the goal of this paper is to clearly lay out the theory behind RF to allow users to best determine how to optimize the algorithm for their specific data problem. Since the substantive theoretical work on RF occurred within the field of statistical learning and much of the applied work since has occurred within statistical genetics both literatures are represented.

This paper is written specifically to understand the use of RF for the type of classification problems encountered in large genetic association studies. While most of the discussion is generally applicable, some of it (particularly that of Section 3.1) will differ when the outcome is a continuous or survival outcome. Section 2 gives background on genetic association studies. Section 3 provides a discussion of classification and how its component break down into bias and variance. It also lays out the three main components of RF: trees, bagging, and randomization. Section 4 discusses various calculations of VI. The next section discusses how to use the RF algorithm focusing on tuning parameters and modifications that could be made to the input data relevant to genetic studies. The emphasis is on showing how these changes relate to changes in bias and variance. Other uses of RF and various implementations are briefly discussed in Section 5. Section 6 provides a comparison to some other algorithms to provide more context for RF. Some concluding thoughts are provided in section 7.

2. Genetic Data

Typical applications of RF for genetic studies involves the use of case-control association studies. In these studies, individuals with and without a disease of interest are recruited. The outcome, Y, can be expressed as a binary variable making the task a classification problem. Since there is no easy means to control for confounding factors within the algorithm, it is important that such considerations are made in the study design to minimize confounding (e.g. matching cases and controls by age and gender). The primary form of confounding in genetic studies are gender and ethnicity. Generally, researchers have focused analyses to autosomal chromosomes thus avoiding potential gender bias that may exist when investigating the sex chromosomes. Confounding due to ethnicity (population stratification) is an important concern and it is essential to ensure that ethnic backgrounds between cases and controls are similar and that population outliers are removed beforehand. This can be accomplished using any of several freely available programs (e.g. EIGENSTRAT [Price et al. 2006]).

The predictor variables of interest are usually either units of DNA referred to as single nucleotide polymorphisms (SNPs) or gene expression measurements. In gene expression studies, each individual receives a continuous measurement indicating the degree of expression for up to tens of thousands genes, where the measure of interest is the relative differences in expression levels between cases and controls.

SNP studies are generally much larger. Where expression studies may have only 100 observations, SNP studies can have over 10,000 observations. Moreover, SNP studies may contain up to 2.5 million SNPs (or more), referred to as genome wide association (GWA) studies. SNPs are genomic structural variation that consists of a pair of bi-allelic nucleotides, where each allele can be represented as a 0–1 variable, where 1 represents the minor (less frequent) allele. The identification of SNPs for inclusion in GWA studies have largely been informed by the International HapMap Project and the 1000 Genomes Project which extensively describe and catalogue the common patterns of human DNA sequence variation [The International HapMap Consortium 2003; The 1000 Genomes Project Consortium 2010]. The standard dogma is that variation at these points, or points correlated with them (referred to as linkage disequilibrium [LD]) account for biochemical changes which lead towards variable phenotypes and disease.

GWA studies attempt to identify relatively common SNPs (minor allele frequency [MAF] > 1 – 5%) in LD with causal susceptibility variants in a complex disease. GWA studies assume the common-disease-common-variant hypothesis; that common disease susceptibility is a result of the joint action of several common variants with relatively small to moderate effects, and that a significant proportion of disease alleles is shared among unrelated affected individuals (Reich 2001). These studies are an attractive approach to investigating the genetic basis of complex diseases as they are hypothesis free and unconstrained by a priori assumptions regarding the disease’s etiology. Successful application of GWA studies is dependent on: 1) sufficiently large study samples from clearly defined study populations capable of contributing relevant genetic information regarding the research question, 2) polymorphic SNPs that can be inexpensively and efficiently genotyped, and that capture extensive genetic variation across the whole genome, and 3) analytical approaches that are statistically robust, despite the dimensionality conflict (excessively large number of variables relative to number of observations), and can be employed to identify the genetic association in an unbiased fashion (conventionally, a single-point, one degree of freedom test of association, where significance is defined as p < 5 × 10−8) [McCarthy et al. 2008; Cantor et al. 2010].

A SNP that has a causal impact may have one of four potential effects outlined in Table 1. For data analysis, we can consider coding SNPs as an ordinal 0/1/2 variable where the value represents the number of minor alleles. Tree based algorithms are well suited for both expression and SNP analyses. As opposed to parametric methods which require specifying a causal model, trees allow for a general search of an optimal cut-point (see section 3.2), allowing the range of causal mechanisms to be easily uncovered by the tree structure. For example, a partition for a given SNP between 0 & 1/2 would represent a dominant effect.

Table 1:

Different genetic effects

Type Mechanism Partition
Additive Each additional minor allele increases variation 0, 1, 2
Dominant Presence of at least 1 minor allele increases variation 0, 1/2
Recessive Two minor alleles needed for variation 0/1, 2
Heterosis Heterozygote leads to variation 0/2, 1

3. The Components of the Random Forests Algorithm

3.1. Bias - Variance Decomposition

One of the first steps in understanding a predictor is to see how its predictions contribute to bias and variance. We start with the setup, given an outcome y, input vector x, and relationship y = f(x) + ε, where E[ε] = 0 and Var[ε]=σε2. For a given training set T, the prediction is (x|T). The well known decomposition for prediction error (PE) under squared-error loss with a continuous outcome is:

ET[yf^(x|T)]2=σε2Noise+[f(x)ETf^(x|T)]2Bias+ET[f^(x|T)ETf^(x|T)]Variance2 (1)

Where the expectation is over random training sets. The first term is the variance of the outcome y and is referred to as the noise. This represents the irreducible error. The next two terms represent the reducible error. The first of these is the bias. We can think of the bias as the systematic difference between the prediction and the target. The final term is the variance. It is the measure of randomness of the prediction. It is important to note that the variance is independent of the true outcome y and the true function f(x).

In classification with a 0–1 outcome we are trying to minimize P((x) ≠ y), y ∈ {0, 1} This is usually done under miss-classification loss

l(y,f^(x))={1if  yf^(x),0if  y=f^(x). (2)

In the mid 1990s multiple authors attempted to define a decomposition for 0–1 loss [Dietterich and Kong 1995; Kohavi and Wolpert 1996; Breiman 1996b; Tibshirani 1996]. Most of the effort centered around trying to find a decomposition that was additive in the components of noise, bias, and variance, as in Equation (1). Each author proposed a slightly different decomposition, depending on which properties they hoped to satisfy.

Unfortunately a thought exercise shows that bias and variance are not additive when the goal is classification. First, note that if a classifier predicts the correct class, P((x|T) = y) ≥ .5, when the true class is class 1, it is unbiased at x. If we have an unbiased classifier, we would desire for the classifier to also have low variance. However, if the classifier is poor, P((x|T) = y) < .5, we say it is biased. In this scenario we would actually desire the classifier to have high variance because we want to increase the chance that the classification “flips.” In this sense, to minimize PE, we see that for an unbiased (good) classifier we want low variance, but for a biased (poor) classifier we want high variance.

Friedman [1997] recognized this interaction between bias and variance. After averaging over all training sets, he decomposes the relationship as,

P(f^(X)y)=|2f(X)1|P(f^(X)f*(X))+P(f*(X)y) (3)

where f*(x) is the Bayes classifier, and X designates over all inputs, x. Friedman referred to P((X) ≠ f*(X)) as a decision boundary error. Making the simplifying assumption that P((X)) is normal, he showed this boundary could be represented by:

P(f^(X)f*(X))=Φ[sign(1/2f(X))Ef^(X)1/2var(f^(X))] (4)

The “boundary bias” is then represented by sign(1/2 − f(X)(E f̂(X) − 1/2). It is clear that when predicting to the correct class, “boundary bias” is negative, and PE decreases as [E f̂(X) − 1/2] increases. Furthermore, it is evident that decreasing the variance of the predictor is only beneficial when the classifier is on the correct side of the boundary. In this way we see the strong multiplicative interaction between bias and variance.

Gareth [2003] followed this up by suggesting a unified bias-variance decomposition, applicable to all symmetric loss functions. Specifically, he recognized that there is both bias, and the effect due to bias. Similarly there is variance and the effect due to variance. He showed that under squared-error loss these are equal, under other losses they are not.

In genetic studies, we are often more interested in variable importance than prediction (though prediction is growing in interest). It is then natural to ask why concern ourselves with these issues. The concern arises when it comes to tuning the algorithm. In the classification setting, the interest may not be in predicting the class outcome, but instead the underlying probability. Friedman [1997] showed that in the probability estimation setting the bias-variance again becomes additive (assuming squared-error loss). In an application to K-Nearest Neighbors (K-NN), he demonstrated that the implication of this is that different tuning parameters will be favored depending on the task at hand.

It is also possible to show that the prediction error is directly related to the quality of the variable importance measures. Figure 2 illustrates this point. The RF algorithm was trained on a data set containing 117 predictors. Of these predictors, 17 were associated with the binary outcome in a basic additive logistic model. The algorithm was run, and the top 17 variables were examined based on RF VI (see Section 4). The experiment was repeated, however, noise was injected by randomly “flipping” a certain percentage of the outcomes (a similar simulation was performed by Breiman in the original RF paper [Breiman 2001]), leading to an increase of PE, as measured by the Out-of-Bag (OOB) error-rate (see Section 5.1.1). As shown, as PE increases, the number of “true” associations among the top results decreases. This shows that appropriate minimization of PE is just as important for VI as it is for prediction.

Figure 2:

Figure 2:

The relationship between quality of variable importance and prediction error. As PE increases the quality of VI rankings decrease.

3.2. CART

Underlying Random Forests is the Classification and Regression Tree (CART) algorithm [Breiman et al. 1984]. The CART algorithm recursively searches for a binary split that partitions the data in such a way that minimizes a splitting criterion. This is referred to as a “greedy” search. After a stopping criterion is met, the final splits partition the predictor space into hyper-rectangles. These regions are referred to as leaves or terminal nodes of the tree.

The variable at the top of the tree represents the “strongest” splitting variables (see Figure 3). Subsequent variables are conditional on those variables above it. Trees are an appealing base learner because they present a straightforward means to represent complex relationships, particularly those that are present in genetic data. The tree structure represents a conditional model making it suited for finding interactions and higher order effects. Furthermore, since trees do not assume linearity in effects, instead performing binary splits, it is ideally suited for discovering recessive and dominant genetic effects. The main type of effect ill suited for trees are in additive effects since this requires consecutive splits of the same variable.

Figure 3:

Figure 3:

A CART tree representing the hierarchy of effects. An additive model was simulated where the effect of SNP1 > SNP2 > SNP3. This hierarchy is reflected in the ordering of the tree.

In the case of classification the splitting criterion, Qm(T), is typically the gini-index (though other convex losses can be used). For a node m, in region Rm, with Nm observations, we define

Qm(T)p^mk=1Nmi:xiRmI(yi=k)

where k is an outcome class. The gini index is then:

GI=k=1Kp^mk(1p^mk)=2p(1p)when K=2 (5)

This process continues until all of the leaves contain only members of one class. One nice feature of the gini criterion is that it is prefers such pure nodes. Other losses, notably misclassification loss, are not necessarily minimized by a pure node.

Since a fully grown tree, T0, will have high variance, (changes in the training data will lead to different tree structures), trees are typically pruned, by finding the sub-tree TαT0 which minimizes the criterion:

cα(T)=m=1|T|NmQm(T)+α|T| (6)

where |T| is the number of terminal nodes in the tree and α is a tuning parameter chosen typically by cross-validation. However, in Random Forests, the trees are not pruned but kept at their maximal depth. This results in each tree having low bias, but high variance. This variance is alleviated by bagging (see next section).

To illustrate the the value of trees for genetic studies, a simulation was performed emulating a GWA study.1 The number of true effects found among the top results by RF and marginal p-values (an allelic chi-square test) are compared in Figure 4. RF is more adapted to finding the non-linear dominant and recessive effects, while marginal testing is better at finding additive effects.

Figure 4:

Figure 4:

Comparison of type of effects found by RF and marginal allelic p-values. The black line in each graph indicates the percentage of the “true” effects found among the top results by looking at just the marginal p-values. RF VI is better at finding dominant and recessive (i.e. non-linear) effects while marginal p-values is more suited to finding additive effects.

3.3. Bagging

Brieman proposed the ensemble process Bagging (Bootstrap Aggregating) as a solution to the instability observed in classifiers such as CART trees [Breiman 1996a]. In ensemble methods such as bagging, the algorithm used (i.e. CART) is referred to as the “base learner.” Bagging is a straightforward procedure where successive bootstrap samples of the data are selected, (Xb, Yb), and a prediction, (xb), is derived from each of these samples. The final prediction, bag(x|T) is determined by either averaging each of the predictions, 1Bb=1Bf^b(x) (for a continuous outcome), or taking a majority vote, argmaxkbag(x) (for classification). To estimate the P(f(x) = k), the intuitive approach is to average the probability estimate of each of the base learners (in CART this would be the terminal nodes). However the better approach to estimate this quantity is to divide the number of bagged samples that vote for class k by the total number of bagged samples (see Hastie et al. [2009] pg.286 for discussion).

The motivation behind bagging is to simulate having multiple training sets. If all training sets, T, were used, then there would be no variance in the final prediction. Bagging then works by reducing the variance of the final predictor. Bühlmann and Yu [2002] and Friedman and Hall [2007] each showed that bagging works via smoothing out first order and higher order variance terms. With respect to bias, since the distribution of (Xb, Yb) ∼ (X, Y), the bias of bag(x) equals the bias of (x), so there is no (asymptotic) increase in bias induced by bagging, though there can be in finite samples.

Another perspective on bagging is that manipulation of the input space is able to increase the search space for an optimal solution [Dietterich 2000a]. This process works only with unstable predictors [Breiman 1996c], defined as as a predictor where a small change in the data can lead to large changes in the prediction. Conversely, bagging stable predictors will lead to worse outcomes. Procedures like CART are unstable while procedures such as regression are stable.

In the case of classification, Breiman [1996a] argued that bagging is effective in the case of order-correct predictors which he defined as a classifier that places the greatest probability on the true class for a given input x, i.e. is unbiased at x. Similar to the previous discussion of the bias-variance relationship for classification, Breiman noted that for such order-correct classifiers bagging can be very useful, but for ones that are not, bagging can actually be harmful. This is because for good classifiers we want to decrease the variance (to reduce overall PE) but for bad classifiers we want to increase the variance (to reduce the overall PE).

3.3.1. Bagging Type

Bühlmann and Yu [2002] and Friedman and Hall [2007] also both showed that m < n sampling without replacement, where m = n/2 is just as effective as bagging with replacement, and computationally more efficient. Bühlmann and Yu referred to this as subagging (subsample aggregating).

Dietterich [2000a] notes that large datasets don’t see the same benefits from bagging as do smaller ones because each bootstrap sample is more similar to each other than with a smaller data set. Subagging serves three benefits. First would be computational - since fewer observations are used in growing the trees. The second is through a reduction in tree correlation - the trees would be more different from each other. The third is in a reduction of tree size. This third component decreases the degrees of freedom of the final model.

3.3.2. Out-Of-Bag Error-rate

One of the appeals of bagging is that it presents a computationally efficient means to estimate the generalized error (GE), the PE of on an independent test set T′. The best way to estimate GE is on an independent validation set. In lieu of one, different analytic (e.g. AIC, BIC) and computational (e.g. Cross-Validation [CV]) approaches have been developed.

For bagged learners, analytic approaches are not feasible and CV is computationally very expensive. However in each iteration of bagging, approximately 37% of the sample is not part of the bootstrap sample.

P(observation ibootstrap sample b)=1(11N)N1e1=0.632

Breiman [1996d] showed that this Out-Of-Bag (OOB) sample can be used as a test set to get a measure of error, referred to as the out-of-bag error rate (OOB-ER). Over the entire bagging run, this error can be aggregated for each input vector x. This can provide a more stable estimate of GE than typical V-fold CV [Wolpert and Macready 1999]. If we define Ci as the set of indices not in bootstrap sample b, and |Ci| as the number of such samples, the OOB estimate becomes:

GE^OOB=1Ni=1N1|Ci|bCiL(Yi,f^b(xi)) (7)

Since each bootstrap sample will have a sample size of about .632N, this estimate will behave similar to 2-fold CV.

In a two-class problem the expected OOB-ER under the null (i.e. no improved prediction) is equal to the probability of the minor class. For example if the classes are equal the expected OOB-ER is 50%, however if 90% of the sample is from one class the expected OOB-ER would be 10%. This can be shifted back to 50% via weighting within the tree growing process and often is in RF (see Section 5.1.5).

Of particular interest, is that the OOB-ER provides a convenient means of choosing tuning parameters in RF. As will be discussed, RF involves the choice of multiple tuning parameters, and these can be chosen by determining the settings that minimize the OOB-ER.

3.4. Randomization

A final method for improving ensemble learners is by injecting randomization into the base learner. Many different procedures have been explored for this [Dietterich 2000a]. For example, RF injects randomness into the tree growing process by only searching over a subset of variables when searching for the optimal split. Other randomized tree algorithms have been proposed using different forms of randomization (see Cutler [1999]).

Like bagging, Dietterich showed that this randomization is able to expand the search space and alleviate what he termed the “statistical” burden. In another study, it was shown that injecting randomization can be more effective than bagging for large datasets [Dietterich 2000b].

As noted in Hastie et al. [2009] the variance reduction induced by bagging is limited by the correlation between the trees, since the trees are not independent, only identically distributed. If we denote the variance of each tree’s prediction as σ2 and the correlation between the predictions as ρ the variance of the average is:

ρσ2+1ρBσ2 (8)

As the number of bootstrap iterations, B, increases, the second term goes to 0, and we are left with the correlation between trees, ultimately limiting the benefits of the bagging process. As dataset size increases, the correlation between bagged samples increases, decreasing the effect of bagging. Injecting randomization into the tree growing process serves to further de-correlate the trees, further reducing the variance.

3.5. The Random Forests Algorithm

At this point we can consider the RF algorithm, proposed by Leo Breiman in 2001 [Breiman 2001]. At its core, it is bagged CART trees, with injected randomization (see Figure 5).

Figure 5:

Figure 5:

The RF algorithm begins by selecting a bootstrap sample of the data. A random subset of the variables is selected and searched over to find the optimal split. This is repeated at each node until an unpruned CART tree is formed. The data not part of the bootstrap sample is run down the tree to derive the error rate and measures of VI. This is repeated until a full forest is grown.

The first alteration is that instead of pruning the CART trees, they are grown to maximal depth. These fully grown trees will be fairly unbiased but will be highly variable (recall CART trees are pruned to reduce the variance). This variance is reduced via bagging and randomization.

In RF, the randomization comes in the tree growing process. Before each split, a subset, mp, of the predictor variables are selected to search over. The choice of m, denoted mtry, is the primary tuning parameter. The smaller mtry the less correlation between trees and the greater the potential variance reduction via bagging is possible. However, smaller mtry will also lead to more biased trees, hence resulting again in the classic bias-variance trade-off.

RF reduces PE only through a reduction of variance, as the bias stays the same (or gets a bit a worse). Breiman shows that unlike other methods (notably Boosting), RF does not over-fit as the number of trees increases, i.e. the observed PE approaches the expected PE. However, it is possible, particularly with noisy data, for the model itself to be too rich (i.e. over-fit) and result in a poorer predictor.

4. Variable Importance

When applying RF for classification there are two primary forms of Variable Importance: permutation importance & gini importance.

4.1. Permutation Importance

The permutation importance (pVI) is the increase in misclassification for OOB person i after variable j has been permuted in tree k. If we consider the quantities:

  • sijk = number of trees that split on variable j and misclassify observation i

  • rijk = number of trees that do not split on variable j and misclassify observation i

  • psijk = number of trees that split on variable j and misclassify observation i when variable j is permuted

  • prijk = number of trees that do not split on variable j and misclassify observation i when variable j is permute

We can represent it as:

pVIijk=(psijk+prijk)(sijk+rijk)=psijksijk since prijk=rijk

and we can calculate:

pVIij=1ntreek=intreepsijksijkpVIjk=1npi=inppsijksijkpVIj=1np×ntreei=inpk=intreepsijksijk (9)

where np and ntree are the number of people and the number of trees respectively.

The three quantities in (9) represent respectively the importance of variable j for person i, the importance of variable j for tree k and the overall importance of variable j. Each representation will have different utility depending on the question of interest.

pVI has some nice properties. Since it is calculated off of the OOB sample, it can be viewed as the predictive quality of that variable. A variable with no importance would be expected to have E(pVI) = 0 since permutation should neither increase nor decrease misclassification. There is also a notion of a population level effect of the variable importance since the probability of being permuted to a different value is determined by the observed population. It is also applicable for any outcome or predictor type.

4.1.1. Correcting Permutation Importance

An important consideration with applying RF to genetic data is the large degree of LD (correlation) among SNPs. There are a couple of ways to formulate the problem in calculating VI induced by correlation. Being a “greedy” algorithm, RF searches over all variables. In calculations of VI, this creates a smoothing and shrinkage of all VI measures - in an analogous way to Ridge regression (see Section 6). This creates problems for correlated variables as the relative importance is diminished. Another formulation is that since VI is calculated from the number of trees for which a variable appears two SNPs that are in perfect LD will appear in trees about half as often as each individual one may appear by itself, effectively lowering the VI of each SNP. While this does not present a problem for prediction, it can skew the VI rankings.

Genuer et al. [2008] examined the impact of correlated variables, and found that as the number of variables correlated with a true causal one increased, the variability of the true causal one increased and its average importance decreased. Similar effects were noted by other authors, notably Strobl et al. [2007].

Meng et al. [2009] proposed a correction for this. Since pVI is calculated by dividing pVIij by the total number of trees in the forest, the authors suggested dividing by the total number of trees of which variable j is a member. This has the appeal that two perfectly correlated variables will no longer “take away” from each other. In practice, with large p this leads to highly unstable pVI measures and works best with less sparse solutions or smaller p where all variables have a chance to be brought into the model.

4.2. Gini Importance

The gini importance (gVI) is the second primary form of RF VI. Unlike pVI, gVI is only applicable in the case of classification. The gini index (GI) is the criterion used when growing the trees in RF for classification. Recalling Equation (5), for binary classification,

GI=2p(1p)

where p is the proportion in the second class. The split which minimizes GI is the preferred split. If we index the node for a given tree by n, we can then define:

gVIjkn=(GIparentGIdaughter  left+GIdaughter  right)npkngVIjk=njTreekNgVIjkn(summing  over  the  nodes  containing  variablejin tree k)gVIj=1ntreek=1ntreegVIjk (10)

gVIjk directly measures the importance of variable j to tree k. The higher the value the better the variable was in splitting the data. In this sense it is very different from pVI. There is no notion of out of sample testing. Instead gVIjkn can be thought of as a χ2 test, conditional on what has already occurred in the tree (for the root node it is conditional on nothing).

Another property is that gVIj ≥ 0 with equality if variable j does not appear in any tree. Like pVI it will have trouble with correlated variables but can also be corrected by weighting. Since gVI is calculated based on the in-sample data it does not have a population level interpretation as pVI. Instead gVI only considers the relationship between the variable and the model.

pVI is the more commonly used form of VI. However, some intuition shows that gVI can be a preferential VI measure when the predictive quality of the trees is low (i.e. OOB-ER ≈ 50%). Since pVI is calculated based on the increase of misclassification after permuting variable j, if the baseline misclassification rate is already relatively high, there is little chance for permutation to make prediction worse. This will lead to an uniformly low pVI. Conversely, since gVI is calculated relative to the grown tree it does not suffer this problem. It is easy to show this via simulation. However, since variables have to be in the tree, there will always be variables with high gVI and it is questionable how “important” a variable is that doesn’t improve prediction. Moreover, it is much more challenging to consider distributional properties for gVI.

4.3. Other Variable Importance Measures

One of the appeals of RF is that advanced users can tailor the algorithm to calculate different VI measures depending on the question of interest. Lee et al. [2008] developed a VI measure relevant for looking at linkage data. Bureau et al. [2005] created a modification of pVI to look at the joint effects for pairs of SNPs. Jiang et al. [2009] proposed a sliding window VI measure for examining interactions.

Many of the modification to VI involve correcting for the “bias” incurred by the presence of correlated predictors. Strobl et al. [2008] suggested using a conditional permutation scheme to calculate VI. The variables correlated with the variable of interest are empirically determined, and then the partitions in the individual tree are utilized to permute the variable of interest within blocks. While effective when p is small, this has drawback of creating a VI measure that is more computational and not uniform across trees. Wang et al. [2010] developed a different VI measure that also takes into account conditional effects. The authors proposed a permutation based test to associate a p-value with their VI measure. While useful, many of these measures are ad hoc and as with pVI and gVI it is important to determine whether statistical properties exist since relying on permutation tests can be inefficient.

4.4. Determining Important Variables

The RF VI measures are fairly successfully in rank ordering associated predictors. However, in genetic studies the goal is often to determine which variables are worthy of future follow-up. Ideally, statistical properties for the VI measures would exist to determine when the observed value differed from an expected value. Some work has been performed in this area but to this point no formal approach has been adopted.

Another challenge is that RF will incorporate most predictors into the final collection of trees. Other algorithms, notably LASSO (see Section 6), will incorporate fewer variables, referred to as a sparse model, making it easier to drop seemingly unimportant or redundant variables. In genetics applications, where many SNPs or genes are not associated with the disease and are simply noise, this lack of sparsity makes variable selection more challenging.

Due to the lack of sparsity and defined statistical properties, it is not possible to divide the SNPs or genes into disjoint sets of “associated” and “not-associated.” This has led to a range of ad-hoc procedures. Díaz-Uriarte and Alvarez de Andrés [2006] suggested removing the bottom 10% and re-running until prediction decreased. Rodin et al. [2009] devised a method for selecting variables based on specification of optimal model size. Goldstein et al. [2010] examined the scree plots of the VI measures and used the “elbow” as the cut-off. However, these are unsatisfactory solutions and ideally some objective cut-off could be determined.

5. Applying Random Forests

5.1. Tuning Parameters

Running RF involves the choice of three primary tuning parameters: mtry, the number of trees (ntree), and tree size (nsplit, maxnodes). In the case of classification, class weights also can be varied. While RF is relatively robust to the settings of tuning parameters (fine changes will generally give similar results), each tuning parameters will contribute differently to the bias and variance of predictions and impact the quality of the final solution.

5.1.1. Using the OOB-ER

The optimal values of most of these tuning parameters are dataset dependent. By understanding how these tuning parameters contribute to bias and variance, it is possible to a priori speculate as to the optimal value, however they ultimately need to be empirically determined. The OOB-ER provides an unbiased estimate of the generalized error. Minimizing this error allows one to select the optimal tuning parameters to generate the best predictive model. However, when one begins augmenting the dataset (e.g. removing unimportant variables) the OOB-ER is no longer an unbiased estimate of the generalized error, though its minimization can still be used for tuning parameter selection [Svetnik et al. 2004].

Theoretical work has shown that the prediction error can be tied directly to the strength of association of the set of predictors [Goldstein et al. 2011] and our own internal testing has shown that as the OOB-ER improves, the quality of the VI rankings improve (see Figure 2). With this in mind, VI should be interpreted in conjunction with the OOB-ER. If the OOB-ER, is close to the null value it is likely that none of the predictors are associated with the outcome, regardless of the VI scores.

5.1.2. mtry

In each step of the tree growing process a different subset of variables is selected to search over in order to find the optimal split. When the size of this subset, mtry, is small, the trees will have lower correlation leading to a greater potential for variance reduction (see Equation (8)). As mtry increases, the variance reducing effect of the randomization decreases. When mtry is equal to the number of predictors, p, RF reduces to bagging.

mtry is the primary tuning parameter, and has its greatest impact on the complexity of the final model. Larger values of mtry lead to fewer variables brought into the tree, resulting in sparse solution (see Figure 6). Since CART is a greedy algorithm, the higher the mtry, the faster it will converge on the optimal splits, creating smaller and more efficient trees. When data are noisy, this is desirable since unrelated predictors will be ignored. However, if most data is related to the outcome this can potentially ignore important but weak predictors. In this way mtry can be thought of as controlling the degrees of freedom (df) of the model, with the higher mtry the fewer df used.

Figure 6:

Figure 6:

A RF with 100 trees and varying mtrys was grown. The data contains 1,000 predictors and 100 observations. Ten of the variables are associated on a logistic additive scale. As mtry increases the model complexity decreases as smaller trees are grown and fewer variables are brought into the final model.

In genetic studies, we expect that the vast majority of the input variables are simply noise. Therefore, if mtry is too small the chance of selecting an important variable to search over at a given node will be small. This will add additional noise into the trees counteracting the variance reducing potential of decorrellated trees.

The default recommended mtry is often p. Díaz-Uriarte and Alvarez de Andrés [2006] examined mtry values ranging from p to 13 p and generally found that the larger the mtry the more reliable the VI measure. Genuer et al. [2008] noted that mtry is more important for VI calculation than for prediction and that with sparse data, mtry= p leads to greatest stability. Goldstein et al. [2010], working with a large SNP dataset (p > 330,000) found that mtry values much larger than p were needed. In simulation work, the OOB error rate and VI measures were fairly similar with mtry values of .1p and p, indicating that the setting is fairly robust to a sensible choice.

Ultimately, a default mtry value cannot be prescribed, as the optimal choice will depend on the data problem. Generally it is advisable to perform a coarse search of different values to see how results are impacted.

5.1.3. Number or Trees

Another important consideration is how many trees to grow, ntree. Unlike with mtry and the other tuning parameters, there is clear “best” ntree, with the larger the value the better. The main limitation to increasing the number of trees is the extra computation required and the diminishing returns with larger values.

Stronger predictors will lead to quicker convergence. Glaser et al. [2007] using a much smaller data set (20 SNPs), grew forests with up to 5000 trees, and found that after 400 trees, the OOB-ER was stable. Díaz-Uriarte and Alvarez de Andrés examined forest sizes ranging from 1000 to 40,000 and found no differences in error-rates. While for prediction purposes, fewer trees are necessary and the OOB error rate will generally converge rapidly, Genuer et al., noted that for variable importance more trees will generally lead to refinement and stability in variable importance.

If the only concern is prediction, the OOB error-rate can be used as the only guide to determine ntree. For VI this is not the case as larger forests will often be necessary. Path plots (Figure 7) provide one means of examining whether VI measures have converged, by showing the cumulative VI score as the forest grows. Using the same data from Figure 6, forests with up to 5,000 trees were grown with mtry = 100 (10%p). The VI measures stabilized after about 2,000 trees.

Figure 7:

Figure 7:

Path plots for two variables (one associated with the outcome and one not). The associated variable is simulated on the log additive scale. For the associated variable, after about 1000 trees gVI has stabilized, while pVI stabilizes at about 2000 trees. The unassociated variable stabilizes sooner.

5.1.4. Controlling Tree Size

Controlling tree size is not an often discussed tuning parameter but can be useful for genetic data. The tree size can be controlled explicitly by limiting the number of splits (maxnodes) resulting in what are referred to as stumps (as in boosting, see Section 6). The size can also be controlled adaptively by telling the tree growing algorithm when it should stop splitting the node leaves of the tree (nsplit). In RF trees are grown to maximal depth (i.e. when nodes are pure of class) however one can choose to stop the splitting based on the number of people in the node.

Theoretically, limiting tree size is unnecessary. From a bias-variance perspective, the purpose of pruning is to increase the stability (i.e. lower the variance) at the cost of increased bias. The increased variance of unpruned trees is due to their tendency to over-fit the data. However, the bagging process specifically aims to reduce variance (with no cost to bias) and avoid over-fitting making pruning superfluous.

Practically, though, limiting tree size can prove beneficial. For one it will speed of computation. Moreover, particularly, with datasets with many noisy predictors, it will limit the number of superfluous splits. This will result in smaller trees that (hopefully) contain only the important predictors.

In practice, these tuning parameters have not received enough attention. The primary evaluation of it was by Segal [2004], who looked at nsplit. As with mtry the conclusion was that there was often an optimal nsplit though growing trees to maximal depth did not lead to over-fitting.

5.1.5. Class Weights

In the case of classification, the classes can be differentially weighted. By default most implementations of RF weight the classes so they are balanced. This ensures, in the case of grossly uneven classes (i.e. 90% vs 10%), that a degenerative solution is not found (i.e. one that predicts all observations to the larger class with a OOB-ER of 10%). However, if the interest is in determining which genes are important for cases, the class weights could be altered towards the diseased class. To our knowledge this issue has not been examined and is worthy of further exploration.

5.2. Modifying the Data

The final solution can also be affected by modifying the data that is input. It is important to note, that once the data is modified the OOB-ER no longer represents an unbiased estimate of GE [Svetnik et al. 2004].

5.2.1. Correlation

There are two approaches for dealing with this correlated data: computationally and in the set-up of the data. Computational approaches are discussed in Section 4.1.1. A second approach, applicable with SNP data, is to pre-process the data based on LD structure. This was the approach taken in Goldstein et al. [2010]. Different levels of LD pruning were used. It was found that pruning to an LD level of 90% (R2) did not lead to a degradation PE and allowed new results to be detected, while improving computational speed.

5.2.2. Removing Unimportant Variables and Important Ones

As mentioned, the sparsity of the final model is a function of both mtry and ntree. Such a sparse model will result in VI values equal to 0. It can be beneficial to remove these sparse results as they likely represent noise and will make finding the optimal solution more challenging. Díaz-Uriarte and Alvarez de Andrés outlined a strategy of sequentially removing genes by dropping the bottom 20% or 50% performing successive runs until there was a noticeable increase in PE.

Not only can we consider removing unimportant variables, we can also consider removing overly strong results. Since the final VI value is dependent on where a variable lies in the tree, a strong marker or set of strong markers, could over shadow weaker yet important effects by pushing them down the tree. By removing highly influential variables, other variables have the opportunity to rise to the top of the tree and have a stronger VI score. Goldstein et al. [2010] showed that after removing Chr 6p, a region known to be highly associated with the disease being studied, new SNPs were found that would not have been detected otherwise.

5.3. Interaction Detection

As GWAs have detected most “easy” to find effects a greater emphasis has been placed on the detection of interaction or epistatic effects. Many techniques and computational tools have been developed, each with different strengths and weaknesses [Cordell 2009]. Due to the conditional nature of trees they are well suited for modeling non-linear effects such as interactions. If a variable serves as a split on one side of a tree but not on the other, that is an indication of an interaction between that variable and one above it (see Figure 8).

Figure 8:

Figure 8:

Illustration of a tree with an interaction between SNP1 and SNP2. Both SNP3 and SNP1 have independent main effects.

Multiple studies have compared RF to other parametric and computational methods, showing the variable importance scores to be more powerful and stable than other current approaches ([García-Margariños et al. 2009; Lunetta et al. 2004; McKinney et al. 2006; Nicodemus et al. 2007]. In addition to gene x gene interaction RF has also been used to successfully detect gene x environment interactions [Maenner et al. 2009].

One of the primary limitations is that while RF can identify SNPs that are potentially epistatic, the typical output does not give the user any indication of which SNPs may be interacting and which they may be interacting with. While there is potentially room for the development of VI measures that explicitly examine interactions, little work has been performed to this point. In one of the few known methods, Jiang et al. [2009] proposed a sliding window method to scan for up to three-way interactions.

5.4. Two-Stage Analysis

While RF can successfully be used by itself, one of its greatest utilities comes in combination with other modeling approaches. Many authors have performed multistage analyses using RF as a first stage screening step and then followed up with a finer analytical method (usually logistic regression) [Briggs et al. 2010a,b; De Lobel et al. 2010; Schwarz et al. 2007; Meng et al. 2007].

While this can be a very powerful tool for SNP discovery such two-stage approaches make inference more challenging. As discussed by Svetnik et al. [2004] such techniques can be subject to selection bias and inference will subsequently be biased. Ideally there would be one dataset on which to run RF and select SNPs and a second dataset on which to draw inference. If a second dataset does not exist a multiple testing correction based on the original number of predictors needs to be performed [Tuglus and van der Laan 2009].

5.5. Risk Prediction

RF was originally developed as a prediction tool. Traditionally, genetic studies have not been concerned with risk prediction, however as more disease susceptible loci have been identified for many disease, interest has shifted to predicting genetic risk [Janssens et al. 2011]. At this time much of that work has centered on summing marginal effects to derive an overall risk score (e.g. De Jager et al. [2009]; The International Schizophrenia Consortium [2009]). While initially valuable such methods do not adequately take into account the joint effect of multiple SNPs. As more disease studies shift from a paradigm of SNP detection to risk prediction, RF provides a valuable tool for such analyses.

Using RF for prediction does not change the previous discussion relating to choosing optimal tuning parameters. In many ways the task is easier, as one no longer needs to be concerned with correlated variables and the intricacies of VI measures. As with all bagging algorithms the final model generates a predicted probability for a class outcome based on the number of trees that vote for a particular class.

5.6. Other Uses of Random Forests

The derived collection of trees provides a significant amount of information about the complex relationships between predictors and observations. This information can be exploited for many additional uses including clustering, imputing missing data, detecting which observations are related (proximities), detecting outliers and graphing. Most of these are detailed on the main website for Random Forests [Breiman and Cutler 2010]. Cutler and Stevens [2006] provide an overview of some of these uses for genomic applications.

While many of these methods are implemented in most versions of RF, few of these have seen application. Some of these methodologies (e.g. imputing [Schwarz et al. 2009] and outlier detection) are probably better accomplished via other methods that take better advantage of genetic structure. However, other analyses (e.g. clustering and proximities) do have the potential to provide insight into the structure of genetic data. However, the primary aspect limiting their use is the relative weakness of genetic data. Many of these analyses exploit very subtle relationships in the tree structure. Since genetic data is often weakly predictive [Clayton 2009], there is often not enough information content to accurately define these relationships. However these are methods worthy of further exploration.

5.7. Computational Considerations & Implementations of Random Forests

Computationally, RF is ideal for large genetic datasets. Trees are fast to grow, with the primary computational burden derived from pruning (which is not performed in RF). The algorithm scales very well with large data and is more than capable of handling GWA data. The number of observations often has a greater impact on computation, as more observations will lead to larger trees. This can be mitigated via subbagging or limiting tree size. More predictors increase computation only to the extent that mtry is increased requiring more predictors to be searched over at each node. One of the key appeals of RF is that since each tree is grown independently, the algorithm is very easy to parallelize, increasing its computational efficiency.

To date there are a number of different implementations of RF (see Table 2 for a brief list). Since there is always potential for implemented algorithms to get lost in translation we make no statement about each’s accuracy as we have not used all of them. However, each has slightly different features and may be suitable for different data problems. The original code was written in Fortran by Leo Breiman and Adele Cutler and is available on their website [Breiman and Cutler 2010]. Their code was adapted for use within the R environment by Andy Liaw and Matthew Wiener in the package randomForest. Other R packages have been created as either add-ons (e.g. varSelRF) or as amendments of the RF algorithm (e.g. cforest). The original Fortan code was licensed to Salford Systems, and implemented with a GUI and is available for a licensing fee. It was one of the first versions capable of handling GWA data. Numerous open source versions are available, most geared towards handling large data problems. Possibly, the most developed is Random Jungle [Schwarz et al. 2010] implemented in C++ and able to handle hundreds of thousands of predictors.

Table 2:

Some common implementations of the RF algorithm

Implementation Resource Notes
Random Forests www.stat.berkeley.edu/~breiman/RandomForests/ Fortran; No active development
randomForest cran.r-project.org/web/packages/randomForest/ R package
randomSurvivalForest cran.r-project.org/web/packages/randomSurvivalForest/index.html R package
Survival Data
Random Jungle www.randomjungle.org/rjungle/ Handles GWA data

6. Other Classifiers

Insight can be gained into an algorithm’s function and utility by comparing it to other algorithms. We briefly mention three here: K-Nearest Neighbors (K-NN), Penalized Regression and Boosting.

6.1. K-Nearest Neighbors

K-Nearest Neighbors (K-NN) goes back to the 1950s and is one of the simplest classification algorithms to implement. For simplicity, assume the input vector x0 is real valued, then calculate:

di=xix0 (11)

The K is a tuning parameter that determines how many neighbors to consider, ordered by di. The classification for x0 is the majority vote of those K. A primary limitation of K-NN is that it provides little insight into VI.

Lin and Jeon [2006] compared RF to an adaptive form of K-NN. Particularly for classification, the relationship to K-NN stems from the fact that trees are grown to maximal depth, where often there will be only one class in the terminal node of a tree. Therefore over the forest of trees, the classification for a new observation will be a weighted version of a certain number of neighbors.

6.2. Penalized Regression

Penalized regression [Hastie et al. 2009] is an important class of algorithms that has increased in popularity with new computational “tricks.” For classification this is done in the context of logistic regression. The general equation to optimize is:

maxβ0,β{i=1N[yi(β0+βTxi)log(1+eβ0+βTxi)]λj=1p|βj|α} (12)

α is a user set value that controls the type of penalty, while λ is a tuning parameter to optimize the function. When α = 1 the penalty represents an 1 penalty and the algorithm is referred to as the LASSO. When α = 2 the penalty represents an 2 penalty and the algorithm is the traditional Ridge regression. The Elastic Net algorithm allows the user to vary α between 1 & 2.

λ serves as a tuning parameter that controls the complexity of the model. The appeal of LASSO over Ridge regression is that the LASSO penalty will result in a more sparse solution. While the optimal Ridge solution, like RF, will shrink all coefficients towards 0 with few equal to 0, the optimal LASSO fit will have many coefficients at 0. This makes the LASSO ideal for variable selection.

The LASSO has been successfully implemented with GWA data [Wu et al. 2009]. The one concern with the LASSO, as opposed to RF, is that the method requires specifying a parametric linear model. The application by Wu et al. allows for a search for interaction terms, but it is unclear how successful it is with detecting complex genetic effects. In this sense tree structures are preferable.

The relationship between RF and Ridge regression stems from variable importance. Unlike algorithms such as LASSO and Boosting, which tend to produce a sparse solution, placing weight on only few variables, both RF and Ridge result in shrunken VI measures, allowing many variables to “speak.” This is desirable when most variables are associated with the outcome resulting in a more stable solution with emphasis spread across the variables. However, when association is due only to correlation (LD) with a true causal variant, this results in what have been called biased importance scores [Strobl et al. 2007].

6.3. Boosting

Boosting has seen a large application in machine learning fields but has had no known applications to genetic data (a pubmed search of the terms “boosting” and “SNP” yielded no results). While there is extensive literature on boosting we briefly mention their appeal as a learner and some thoughts as to why it may not be appropriate for genetic data.

Boosting is an ensemble algorithm that like RF has trees as the base learner. However, unlike RF, these trees are not fully grown trees, often containing only a few nodes referred to as stumps (the number of nodes is a tuning parameter). While the trees in RF (and bagging) are identically distributed, the trees in boosting are not. Instead, each observation in the training set receives a weight that is updated based on some classification error (generally an exponential loss), with greater error, resulting in greater weight. Therefore each iteration of Boosting attempts to fit the classifier on those observations which is hardest to classify. By doing this, Boosting is able to both decrease variance (because of the aggregation of classifiers) and bias (by doing a better job on those that are miss-classified).

Multiple studies have shown Boosting to be as good as and often better than bagging and RF [Breiman 1996b; Dietterich 2000b; Hastie et al. 2009]. Moreover, Boosting, has similarities with LASSO, in that it tends to result in a much sparser solution than does RF.

However there are two problems with Boosting, one computational and one more systemic. Due to this bias reduction mechanism, Boosting is prone to over-fitting. Therefore there needs to be constant CV to determine when to stop growing stumps. Computationally this is very expensive. However, a more systemic problem is that the performance of Boosting degrades quickly with noisy data, particularly compared to bagging and randomization procedures [Dietterich 2000b]. Therefore while a very attractive algorithm, Boosting is not entirely appropriate for genetic data which contains large samples and many irrelevant variables.

7. Conclusion

This paper walked through the theoretical background of RF, while highlighting relevant research. While in many ways RF is a black box algorithm, it can also easily be broken down into its components: classification, trees, bagging & randomization. Understanding how the algorithm works, particularly the components that control the bias and variance, allows the user to better control the output via the different tuning parameters.

Having worked with RF, we agree with Breiman, that it is a great “off-the-shelf” algorithm. Despite the particular challenges presented by genetic data, reasonable settings of the tuning parameters can often easily be found. The underlying algorithm is relatively fast and has the capacity to handle large GWA studies. The non-parametric tree structure allows for the existence of conditional and higher-order effects, ideal for capturing complex genetic relationships that are likely to exist. Users can manipulate tuning parameters to determine the settings most appropriate for their own data needs. VI measures relevant to the specific question of interest can be created. As genetic analyses transition from a focus on VI to risk prediction, RF should continue to prove valuable. In all, RF is well suited as a stand-alone analytic tool or a first step in conjunction with other methods.

This exposition barely touches on the many subtle questions that can be asked after fitting RF. Within the collection of trees there is a lot of information about the relationship between the variables and observations allowing for the exploration of clustering, proximities and visualization. While appealing, experience has shown that these subtle relationships can only be gleaned when the overall predictor is fairly strong. Furthermore, we concur with critics of RF, that it is not a perfect algorithm. The VI measure that it produces are inherently ad-hoc and lacks any statistical properties (some work has been attempted to explore some statistical properties but so far none have been determined). The lack of sparsity makes it ill suited for variable selection.

As the “No Free Lunch” states, there is no perfect algorithm [Wolpert 1996]. As with all modeling situations it is important to find the right tool for the job. For large genetic data, RF can be the right tool, though it is does need to be used appropriately and with insight, nonetheless it is an important addition to the arsenal of tools available for genetic epidemiologists.

Footnotes

1

For this simulation a 71 parameter logistic model was used to simulate an outcome. The 71 “causal SNPs” consisted of additive, dominant, recessive and interaction effects with varying minor allele frequencies and correlated variables. Real GWA data was used that had been pruned to an LD R2 of 90% (p = 163,231) as noise variables. An in-house written version of RF was used. 6,000 trees were grown with different mtry values. Top results were compared to those found in an allelic chi-square test.

Contributor Information

Benjamin A. Goldstein, Stanford University

Eric C. Polley, National Institutes of Health

Farren B. S. Briggs, University of California, Berkeley

References

  1. Breiman L. “Bagging predictors,”. Machine Learning. 1996a;24:123–140. doi: 10.1023/A:1018054314350. [DOI] [Google Scholar]
  2. Breiman L. Berkeley: 1996b. “Bias, variance, and arcing classifiers,”. Technical report, UC. [Google Scholar]
  3. Breiman L. “Heuristics of instability and stabilization in model selection,”. Annals of Statistics. 1996c;24:2350–2383. doi: 10.1214/aos/1032181158. [DOI] [Google Scholar]
  4. Breiman L. Berkeley: 1996d. “Out-of-bag estimation,”. Technical report, UC. [Google Scholar]
  5. Breiman L. “Random forests,”. Machine Learning. 2001;45:5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
  6. Breiman L, Cutler A. “Random forests,”. 2010. http://www.stat.berkeley.edu/breiman/RandomForests/.
  7. Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. New York: Chapman & Hall; 1984. [Google Scholar]
  8. Briggs FB, Bartlett SE, Goldstein BA, Wang J, McCauley JL, Zuvich RL, De Jager PL, Rioux JD, Ivinson AJ, Compston A, Hafler DA, Hauser SL, Oksenberg JR, Sawcer SJ, Pericak-Vance MA, Haines JL, Barcellos LF. “Evidence for crhr1 in multiple sclerosis using supervised machine learning and meta-analysis in 12,566 individuals,”. Human molecular genetics. 2010a;19:4286–95. doi: 10.1093/hmg/ddq328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Briggs FB, Goldstein BA, McCauley JL, Zuvich RL, De Jager PL, Rioux JD, Ivinson AJ, Compston A, Hafler DA, Hauser SL, Oksenberg JR, Sawcer SJ, Pericak-Vance MA, Haines JL, Barcellos LF. “Variation within dna repair pathway genes and risk of multiple sclerosis,”. American journal of epidemiology. 2010b;172:217–24. doi: 10.1093/aje/kwq086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Bühlmann P, Yu B. “Analyzing bagging,”. Annals of Statistics. 2002;30:927–961. [Google Scholar]
  11. Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, Van Eerdewegh P. “Identifying SNPs predictive of phenotype using random forests,”. Genetic Epidemiology. 2005;28:171–182. doi: 10.1002/gepi.20041. [DOI] [PubMed] [Google Scholar]
  12. Cantor RM, Lange K, Sinsheimer JS. “Prioritizing GWAS results: A review of statistical methods and recommendations for their application,”. Am. J. Hum. Genet. 2010;86:6–22. doi: 10.1016/j.ajhg.2009.11.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Clayton DG. “Prediction and interaction in complex diseases,”. Plos Genetics. 2009:5. doi: 10.1371/journal.pgen.1000540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Cordell HJ. “Detecting genegene interactions that underlie human diseases,”. Nature Review Genetics. 2009;10:392–404. doi: 10.1038/nrg2579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Cutler A. Utah State University; 1999. “Fast classification using perfect random trees,”. Technical report, [Google Scholar]
  16. Cutler A, Stevens JR. “Random forests for microarrays,”. Methods in enzymology. 2006;411:422–32. doi: 10.1016/S0076-6879(06)11023-X. [DOI] [PubMed] [Google Scholar]
  17. De Jager PL, Chibnik LB, Cui J, Reischl J, Lehr S, Simon KC, Aubin C, Bauer D, Heubach JF, Sandbrink R, Tyblova M, Lelkova P, Havrdova E, Pohl C, Horakova D, Ascherio A, Hafler DA, Karlson EW, Freedman MS, Edan G, Hartung HP, Polman CH, Kappos L, Montalban X, Miller D, O’Connor P, Hartung HP, Comi G, Filippi M, Kappos L, Arnason BG, Cook S, Goodin DS, Jeffery D, Traboulsee A, Ebers GC, Langdon D, Goodin DS, Reder AT, Zipp F, Schimrigk S, Hartung HP, Filippi M, Hillert J. “Integration of genetic risk factors into a clinical algorithm for multiple sclerosis susceptibility: a weighted genetic risk score,”. Lancet Neurol. 2009;8:1111–1119. doi: 10.1016/S1474-4422(09)70275-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. De Lobel L, Geurts P, Baele G, Castro-Giner F, Kogevinas M, Van Steen K. “A screening methodology based on random forests to improve the detection of gene-gene interactions,”. European journal of human genetics. 2010;18:1127–32. doi: 10.1038/ejhg.2010.48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Díaz-Uriarte R, Alvarez de Andrés S. “Gene selection and classification of microarray data using random forest,”. BMC Bioinformatics. 2006;7:3. doi: 10.1186/1471-2105-7-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Dietterich T. “Ensemble methods in machine learning,”. Lecture Notes in Computer Science. 2000a1857:1–15. doi: 10.1007/3-540-45014-9_1. [DOI] [Google Scholar]
  21. Dietterich T. “An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization,”. Machine Learning. 2000b;40:139–157. doi: 10.1023/A:1007607513941. [DOI] [Google Scholar]
  22. Dietterich T, Kong EB. Oregon State University; 1995. “Machine learning bias, statistical bias, and statistical variance of decision tree algorithms,”. Technical report, [Google Scholar]
  23. Friedman JH. “On bias, variance, 0/1—loss, and the curse-of-dimensionality,”. Data Mining and Knowledge Discovery. 1997;1:55–77. doi: 10.1023/A:1009778005914. [DOI] [Google Scholar]
  24. Friedman JH, Hall P. “On bagging and nonlinear estimation,”. Journal of Statistical Planning and inference. 2007;137:669–683. doi: 10.1016/j.jspi.2006.06.002. [DOI] [Google Scholar]
  25. García-Margariños M, López-de Ullbarri I, Cao R, Salas A. “Evaluating the ability of tree-based methods and logistic regression for the detection of snp-snp interaction,”. Ann Hum Genet. 2009;73:360–369. doi: 10.1111/j.1469-1809.2009.00511.x. [DOI] [PubMed] [Google Scholar]
  26. Gareth JM. “Variance and bias for general loss functions,”. Machine Learning. 2003;51:115–135. doi: 10.1023/A:1022899518027. [DOI] [Google Scholar]
  27. Genuer R, Poggi JM, Tuleau C. INRIA; 2008. “Random Forests: some methodological insights,”. Technical report, URL http://hal.inria.fr/inria-00340725/en/. [Google Scholar]
  28. Glaser B, Nikolov I, Chubb D, Hamshere ML, Segurado R, Moskvina V, Holmans P. “Analyses of single marker and pairwise effects of candidate loci for rheumatoid arthritis using logistic regression and random forests,”. BMC Proceedings. 2007;1(Suppl 1):S54. doi: 10.1186/1753-6561-1-s1-s54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Goldstein BA, Hubbard AE, Barcellos LF. Berkeley: 2011. “A generalized approach for testing the association of a set of predictors with an outcome: A gene based test,”. Technical Report 274, UC. [Google Scholar]
  30. Goldstein BA, Hubbard AE, Cutler A, Barcellos LF. “An application of random forests to a genome-wide association dataset: Methodological considerations & new findings,”. BMC Genetics. 2010;11:49. doi: 10.1186/1471-2156-11-49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Hastie T, Tibshirani R, Friedman J. Elements of Statistical Learning. 2 edition. New York: Springer; 2009. [DOI] [Google Scholar]
  32. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. “Random Survival Forests,”. The Annals of Applied Statistics. 2008;2:841–860. doi: 10.1214/08-AOAS169. [DOI] [Google Scholar]
  33. Janssens AC, Ioannidis JP, van Duijn CM, Little J, Khoury MJ, Grips Group “Strengthening the reporting of Genetic RIsk Prediction Studies: the GRIPS Statement,”. PLoS Med. 2011;8:e1000420. doi: 10.1371/journal.pmed.1000420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Jiang R, Tang W, Wu X, Fu W. “A random forest approach to the detection of epistatic interactions in case-control studies,”. BMC bioinformatics. 2009;10:S65. doi: 10.1186/1471-2105-10-S1-S65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Kohavi R, Wolpert DH. “Bias plus variance decomposition for zero-one loss functions,”. Machine Learning: Proceedings of the Thirteenth International Conference.1996. [Google Scholar]
  36. Lee SS, Sun L, Kustra R, Bull SB. “Em-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis,”. Bioinformatics. 2008;24:1603–10. doi: 10.1093/bioinformatics/btn239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Lin Y, Jeon Y. “Random forests and adaptive nearest neighbors,”. Journal of the American Statistical Association. 2006:101. [Google Scholar]
  38. Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P. “Screening large-scale association study data: exploiting interactions using random forests,”. BMC Genetics. 2004;5:32. doi: 10.1186/1471-2156-5-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Maenner MJ, Denlinger LC, Langton A, Meyers KJ, Engelman CD, Skinner HG. “Detecting gene-by-smoking interactions in a genome-wide association study of early-onset coronary heart disease using random forests,”. BMC proceedings. 2009;3:S88. doi: 10.1186/1753-6561-3-s7-s88. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP, Hirschhorn JN. “Genome-wide association studies for complex traits: consensus, uncertainty and challenges,”. Nat. Rev. Genet. 2008;9:356–369. doi: 10.1038/nrg2344. [DOI] [PubMed] [Google Scholar]
  41. McKinney BA, Reif DM, Ritchie MD, Moore JH. “Machine learning for detecting gene-gene interactions: a review,”. Applied bioinformatics. 2006;5:77–88. doi: 10.2165/00822942-200605020-00002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Meng Y, Yang Q, Cuenco KT, Cupples LA, Destefano AL, Lunetta KL. “Two-stage approach for identifying single-nucleotide polymorphisms associated with rheumatoid arthritis using random forests and bayesian networks,”. BMC proceedings. 2007;1:S56. doi: 10.1186/1753-6561-1-s1-s56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL. “Performance of random forest when SNPs are in linkage disequilibrium,”. BMC Bioinformatics. 2009;10:78. doi: 10.1186/1471-2105-10-78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Nicodemus KK, Wang W, Shugart YY. “Stability of variable importance scores and rankings using statistical learning tools on single-nucleotide polymorphisms and risk factors involved in gene x gene and gene x environment interactions,”. BMC proceedings. 2007;1:S58. doi: 10.1186/1753-6561-1-s1-s58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. “Principal components analysis corrects for stratification in genome-wide association studies,”. Nature Genetics. 2006;38:904–909. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
  46. Rodin AS, Litvinenko A, Klos K, Morrison AC, Woodage T, Coresh J, Boerwinkle E. “Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies,”. Journal of computational biology. 2009;16:1705–18. doi: 10.1089/cmb.2008.0037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Schwarz DF, Knig IR, Ziegler A. “On safari to random jungle: a fast implementation of random forests for high-dimensional data,”. Bioinformatics. 2010;26:1752–8. doi: 10.1093/bioinformatics/btq257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Schwarz DF, Szymczak S, Ziegler A, Konig IR. “Evaluation of single-nucleotide polymorphism imputation using random forests,”. BMC Proc. 2009;3(Suppl 7):S65. doi: 10.1186/1753-6561-3-s7-s65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Schwarz DF, Szymczak S, Ziegler A, Knig IR. “Picking single-nucleotide polymorphisms in forests,”. BMC proceedings, 2007;1:S59. doi: 10.1186/1753-6561-1-s1-s59. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Segal MR. “Machine learning benchmarks and random forests regression,”. 2004. Technical report, CBMB Working Paper.
  51. Statnikov A, Wang L, Aliferis CF. “A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification,”. BMC bioinformatics. 2008;9:319. doi: 10.1186/1471-2105-9-319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A. “Conditional variable importance for random forests,”. BMC bioinformatics. 2008;9:307. doi: 10.1186/1471-2105-9-307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Strobl C, Boulesteix AL, Zeileis A, Hothorn T. “Bias in random forest variable importance measures: illustrations, sources and a solution,”. BMC Bioinformatics. 2007;8:25. doi: 10.1186/1471-2105-8-25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Sun YV. “Multigenic modeling of complex disease by random forests”. Advances in Genetics. 2010;72:73–99. doi: 10.1016/B978-0-12-380862-2.00004-7. [DOI] [PubMed] [Google Scholar]
  55. Sun YV, Cai Z, Desai K, Lawrance R, Leff R, Jawaid A, Kardia SL, Yang H. “Classification of rheumatoid arthritis status with candidate gene and genome-wide single-nucleotide polymorphisms using random forests,”. BMC proceedings. 2007;1(Suppl 1):S62. doi: 10.1186/1753-6561-1-s1-s62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Svetnik V, Liaw A, Tong C. “Variable selection in random forest with application to quantitative structureactivity relationship,”. In: Intrator N, Masulli F, editors. Proceedings of the 7th Course on Ensemble Methods for Learning Machines; Springer-Verlag; 2004. [Google Scholar]
  57. The 1000 Genomes Project Consortium “A map of human genome variation from population-scale sequencing,”. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. The International HapMap Consortium “The International HapMap Project,”. Nature. 2003;426:789–796. doi: 10.1038/nature02168. [DOI] [PubMed] [Google Scholar]
  59. The International Schizophrenia Consortium “Common polygenic variation contributes to risk of schizophrenia and bipolar disorder,”. Nature. 2009;460:748–752. doi: 10.1038/nature08185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Tibshirani R. University of Toronto; 1996. “Bias, variance and prediction error for classification rules,”. Technical report, [Google Scholar]
  61. Tuglus C, van der Laan MJ. “Modified fdr controlling procedure for multi-stage analyses,”. Statistical applications in genetics and molecular biology. 2009;8 doi: 10.2202/1544-6115.1397. Article 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Wang M, Chen X, Zhang H. “Maximal conditional chi-square importance in random forests,”. Bioinformatics (Oxford England) 2010;26:831–7. doi: 10.1093/bioinformatics/btq038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Wolpert D. “The lack of a priori distinctions between learning algorithms,”. Neural Computation. 1996:1341–1390. doi: 10.1162/neco.1996.8.7.1341. [DOI] [Google Scholar]
  64. Wolpert DH, Macready WG. “An efficient method to estimate bagging’s generalization error,”. Machine Learning. 1999;35:41–55. doi: 10.1023/A:1007519102914. [DOI] [Google Scholar]
  65. Wu TT, Chen YF, Hastie T, Sobel E, Lange K. “Genome-wide association analysis by lasso penalized logistic regression”. Bioinformatics. 2009;25:714–21. doi: 10.1093/bioinformatics/btp041. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of Berkeley Electronic Press

RESOURCES