Skip to main content
PLOS One logoLink to PLOS One
. 2012 Nov 8;7(11):e47281. doi: 10.1371/journal.pone.0047281

LBoost: A Boosting Algorithm with Application for Epistasis Discovery

Bethany J Wolf 1,*, Elizabeth G Hill 1, Elizabeth H Slate 2, Carola A Neumann 3, Emily Kistner-Griffin 1
Editor: Jérémie Bourdon4
PMCID: PMC3493573  PMID: 23144812

Abstract

Many human diseases are attributable to complex interactions among genetic and environmental factors. Statistical tools capable of modeling such complex interactions are necessary to improve identification of genetic factors that increase a patient's risk of disease. Logic Forest (LF), a bagging ensemble algorithm based on logic regression (LR), is able to discover interactions among binary variables predictive of response such as the biologic interactions that predispose individuals to disease. However, LF's ability to recover interactions degrades for more infrequently occurring interactions. A rare genetic interaction may occur if, for example, the interaction increases disease risk in a patient subpopulation that represents only a small proportion of the overall patient population. We present an alternative ensemble adaptation of LR based on boosting rather than bagging called LBoost. We compare the ability of LBoost and LF to identify variable interactions in simulation studies. Results indicate that LBoost is superior to LF for identifying genetic interactions associated with disease that are infrequent in the population. We apply LBoost to a subset of single nucleotide polymorphisms on the PRDX genes from the Cancer Genetic Markers of Susceptibility Breast Cancer Scan to investigate genetic risk for breast cancer. LBoost is publicly available on CRAN as part of the LogicForest package, http://cran.r-project.org/.

Introduction

Many common diseases are heterogeneous, developing as a result of complex gene-gene and gene-environment interactions [1][3]. The heterogeneity of cancer, for example, is well documented and many authors note that distinct genetic patterns in cancer result in significant differences in disease outcome [4][6]. While a particular disease pathway may account for a majority of cases, there may be alternative pathways that account for only a small proportion of cases. Statistical methods capable of identifying key components in multiple disease pathways can aid in understanding an individual's risk of developing disease, in disease prognosis, and in prediction of response to therapy [7], [8].

Logic regression (LR) is a single tree-based method capable of modeling high-order interactions [9]. LR generates classification rules by constructing Boolean (and = Inline graphic, or = Inline graphic, and not = !) combinations of binary (0/1) predictors for classification of a binary response. For example, LR might produce the tree, Inline graphic, which predicts a response value of 1 if either Inline graphic or Inline graphic are true. Otherwise, the predicted response is 0. All LR trees can be expressed as a disjunction of conjunctions as in the second expression for tree Inline graphic. The conjunctive interactions described by the tree are referred to as prime implicants (PIs). Tree Inline graphic is composed of the two PIs, Inline graphic and Inline graphic, both of size 2 as each includes two variables. LR can identify PIs ranging in size from 1 to 8 predictors, and thus PI is a general term describing main effects and interactions. LR has been used in the development of screening and diagnostic tools for prostate and colorectal cancer, and to identify single nucleotide polymorphisms (SNPs) that confer risk in cardiovascular disease [10][13].

Tree-based classifiers are unbiased base classifiers but they are highly variable. The predictive accuracy of a tree-based classifier can be improved by using an ensemble of learners when predicting an observation's class [14][16]. The ensemble allows averaging across base learners resulting in an unbiased aggregated learner with reduced variability. One powerful approach to constructing ensemble-based learners is bagging, that is, the construction of classifiers from multiple bootstrap samples drawn from training data. Logic Forest (LF) is a bagged version of LR that generates an ensemble of logic regression-grown trees of varying sizes [17]. LF shows improved predictive performance over LR and is better able to discover PIs significantly associated with response, even in data with predictors measured with error and in data in which not all variables significantly associated with the response are observed. However, the ability of both LR and LF to recover PIs associated with response degrades for infrequently occurring PIs [17]. A rare PI would occur if, for example, the PI is highly predictive of disease for a patient subpopulation that represents only a small proportion of the overall patient population.

Boosting is a powerful alternative algorithm for constructing ensemble learners that reweights the training data at successive iterations to improve prediction of observations poorly classified at previous iterations [18]. In this paper we present a boosted version of LR we refer to as LBoost, and introduce a measure of predictor importance. We compare the performance of LBoost relative to LF considering varying frequency of occurrence for PIs associated with response and varying model complexity. We also apply LBoost to a subset of SNP data from the Cancer Genetic Markers of Susceptibility (CGEMS) Breast Cancer Scan [19][21] to investigate genetic risk variants.

Methods

Define data Inline graphic where Inline graphic is a vector of Inline graphic binary responses and Inline graphic is an Inline graphic matrix of Inline graphic binary predictors with Inline graphic. The algorithm for constructing an LBoost model is shown below.

LBoost Algorithm

For data set Inline graphic

  1. Initialize a collection of observation-specific weights Inline graphic where Inline graphic and Inline graphic indexes the number of observations, Inline graphic.

  2. For Inline graphic where Inline graphic is the number of boosted LR trees constructed from data Inline graphic

    1. Randomly select a positive integer Inline graphic where Inline graphic is the maximum number of predictors in an LR tree. (Random selection of tree size has been shown to modestly improve recovery of small PIs [17].)

    2. Fit an LR tree, Inline graphic, to data Inline graphic using weights Inline graphic and with no more than Inline graphic predictors.

    3. Compute the weighted error for Inline graphic according to:
      graphic file with name pone.0047281.e032.jpg
      where Inline graphic is the predicted value for the Inline graphicth observation from tree Inline graphic
    4. Using the weighted error compute a tree-specific weight for tree Inline graphic according to:
      graphic file with name pone.0047281.e037.jpg
    5. Update observation-specific weights according to:
      graphic file with name pone.0047281.e038.jpg
  3. The forest of Inline graphic boosted trees is Inline graphic.

In step 2b, LBoost fits the LR tree using simulated annealing with misclassification error to choose between LR trees. Simulated annealing is the default search algorithm in LR. Use of misclassification for identifying the “best” LR model limits the number of trees fit at a given iteration of LBoost/LF to one tree with a maximum of 8 predictors.

We also use cross validation (CV) when constructing the forest for development of measures of model fit (Equation 2) and PI importance (Prime Implicant Importance Measures Section). For Inline graphic-fold CV, let Inline graphic be one of Inline graphic approximately equally sized, non-overlapping subdivisions of the data. Given Inline graphic, let Inline graphic be the collection of all data subdivisions other than Inline graphic such that Inline graphic, and let Inline graphic be the number of observations in Inline graphic. We construct the Inline graphicth LBoost model using Inline graphic according to the LBoost algorithm and use Inline graphic as the Inline graphicth test data set for the measures of model fit and PI importance. The final LBoost model therefore includes Inline graphic boosted trees and is denoted Inline graphic.

Now consider an observation Inline graphic from the Inline graphicth test data set Inline graphic. All trees within the boosted forest Inline graphic predict class membership for this observation. If predictor values in Inline graphic produce a value of 1 for one or more of the PIs in tree Inline graphic within Inline graphic, that tree predicts class membership Inline graphic of 1; otherwise the tree predicts the class to be 0.

If we consider test data Inline graphic as new data, we can make a CV prediction for the observations in Inline graphic by taking a weighted average of the predictions for those trees in Inline graphic which were constructed from the corresponding training data Inline graphic. We can use the test data set predictions to calculate an unbiased estimate of model error rate. For observation Inline graphic in the test set corresponding to data Inline graphic (that is, for Inline graphic), the boosted CV prediction from Inline graphic is

graphic file with name pone.0047281.e072.jpg (1)

Since predictions from a logic regression tree take values of either 0 or 1, the expression Inline graphic in equation 2 takes on values of Inline graphic or Inline graphic, thereby allowing inclusion of all tree-specific weights Inline graphic in the final prediction. The CV misclassification rate for Inline graphic is

graphic file with name pone.0047281.e078.jpg (2)

Prime Implicant Importance Measure

In contrast to bagging, which applies the base learner to a bootstrap sample of the data, boosting is generally applied to the whole data set making it difficult to define an importance measure. To address this difficulty, we use CV to develop a measure of PI importance that can be estimated from an LBoost model, Inline graphic. For tree Inline graphic, the CV misclassification rate for test data Inline graphic is

graphic file with name pone.0047281.e082.jpg (3)

Let Inline graphic be a PI occurring in tree Inline graphic, such that Inline graphic is an Inline graphic-dimensional column vector of 0 s and 1 s corresponding to the PI's value for the Inline graphic observations. We extract PIs from Inline graphic using the prime.implicant function available in the logicFS package [22]. Let Inline graphic denote the matrix of all PIs in Inline graphic with Inline graphic randomly permuted. Let MC.TInline graphic denote the tree-specific misclassification rate for Inline graphic applied to Inline graphic. The permutation based variable importance measure for Inline graphic is defined by

graphic file with name pone.0047281.e096.jpg (4)

Simulation Studies

We conduct several simulations to examine the ability of LF and LBoost to recover PIs representing epistatic interactions between SNPs that are associated with disease. Two types of epistatic interactions are considered for the simulations comparing LBoost and LF (Table 1). An interaction of type 1 confers increased risk of disease when at least one copy of the minor allele is present from both loci; this type 1 interaction is referred to as the jointly dominant-dominant model (DD) [23][26]. An interaction of type 2 confers increased risk of disease if two copies of the minor allele are present from both loci; this type 2 interaction is referred to as the jointly recessive-recessive model (RR).

Table 1. Two-locus interaction models.

Type 1 AA Aa aa Type 2 AA Aa aa
BB 0 0 0 BB 0 0 0
Bb 0 1 1 Bb 0 0 0
bb 0 1 1 bb 0 0 1

Type 1 represents a DD interaction between SNPs a and b while Type 2 represents RR interaction between a and b. A value of 1 indicates SNP combinations conferring increased risk of disease.

We consider three simulation scenarios: (1) the response is associated with a single DD interaction; (2) the response is associated with two DD interactions; and (3) the response is associated with a single RR interaction. We use the liability threshold model [27], [28] to define all interaction models. Specifically, all simulated data are defined by the minor allele frequencies (MAFs) of the risk alleles, the disease prevalence, and the heritability of the epistatic interaction(s). For simplicity, risk alleles in an epistatic interaction have the same MAF. Also, for all simulations, the disease prevalence is set at 0.1 and the heritability for all epistatic interactions is set at 0.02. The disease prevalence was chosen to simulate a common disease such as breast cancer. The population level parameters for specific MAFs, threshold, and heritability are given in Table 2.

Table 2. Population values for simulation parameters.

Model Minor Allele Frequency Prob(PI+) Prob(D+Inline graphicPI+) Prob(D+Inline graphic PI−) OR
0.1 0.0361 0.2890 0.0930 3.961
Dominant-dominant 0.3 0.2601 0.1460 0.0839 1.866
0.5 0.5625 0.1213 0.0726 1.763
0.1 0.0001 1.0000 0.0975 Inf
Recessive-recessive 0.3 0.0081 0.6127 0.0955 14.98
0.5 0.0625 0.2293 0.0915 2.952

The disease prevalence is set at 0.1 and heritability is set a 0.02 for all simulations.

MAFs are the same for risk alleles in an epistatic interaction. Prob(PI+) is the probability that a subject has the PI. Prob(D+Inline graphicPI+) and Prob(D+Inline graphicPI−) are the probabilities a subject has disease given that they have the PI and do not have the PI respectively. OR is the population odds ratio given the model, MAF, and heritability.

In addition to the SNPs in the epistatic interaction(s), additional non-causal SNPs are generated such that there are 100 SNPs in the final dataset. Minor allele frequencies for the non-causal SNPs are randomly selected from between 0.05 and 0.5. For simulation scenarios 1 and 2, all SNPs are coded as an indicator for at least one copy of the minor allele. For simulation scenario 3, SNPs are coded as the indicator for two copies of the minor allele. In scenarios 1 and 3 the response is associated with the DD or RR interaction between Inline graphic and Inline graphic, thus the PI of interest is Inline graphic. In scenario 2 the response is associated with two independent DD interactions, Inline graphic and Inline graphic.

We consider sample sizes ranging from 400 to 2400, generating 500 datasets for each simulation study. We examine the ability of LF and LBoost to recover the PIs known to be associated with the response using the variable importance measure for LF, V.LF [17], and Inline graphic for LBoost. Define Inline graphic as the set of all PIs identified in either Inline graphic or Inline graphic. Let Inline graphic (Inline graphic) be the set of 20 PIs in Inline graphic or Inline graphic with maximum absolute V.LF and V.LB (4) values, respectively. We say that the PI Inline graphic, known to be associated with disease, has been recovered when Inline graphic. We select the top 20 identified PIs because in the context of studying gene-gene interactions, 20 interactions represents Inline graphic% of all possible 2 locus combinations given 100 geneotyped SNPs.

We use the Logic Forest package in R v.2.14.1 [29] with simulated annealing optimization to fit all LF models [9], [17]. For LBoost we use 5-fold CV and construct 20 trees for each dataset Inline graphic resulting in an LBoost model with 100 LR trees. For comparisons, all LBoost and LF models include the same number of LR trees. The same starting and ending annealing temperatures are selected for LF and LBoost. The starting temperature of 2 is selected such that approximately 90% of “new” models are accepted. The final temperature of Inline graphic is set to achieve a score where fewer than 5% of new models are accepted. The cooling schedule is set so that 50,000 iterations are required to get from start to end temperaure. Increasing the number of iterations to 250,000 does not affect our findings. With these settings, the LBoost algorithm constructs a model in less than a minute on a Windows 2.26 GHz machine.

Results

Scenario 1: One Dominant-Dominant Interaction

n scenario 1, we investigate the ability of LBoost and LF to recover a single DD interaction that is associated with the response from among 100 binary variables. The minor allele frequencies of 0.1, 0.3, and 0.5 are considered. In data in which the MAFs for Inline graphic and Inline graphic were 0.1, LBoost identified the combination Inline graphic more frequently than LF, although this difference was only significant for Inline graphic (Figure 1A). LBoost recovers Inline graphic in a maximum of 88.4% of simulations, while LF recovers this PI in a maximum of 81.0% of simulation runs. When the minor allele frequencies for Inline graphic and Inline graphic are increased to 0.3, the ability of both LF and LBoost to recover Inline graphic improves. Under these conditions, LF recovers Inline graphic significantly more frequently than LBoost for Inline graphic (Figure 1B). Both LF and LBoost recover the PI in Inline graphic% of simulation runs for Inline graphic and in more than 90% of simulation runs for Inline graphic. In data in which the MAFs for Inline graphic and Inline graphic are 0.5, LF and LBoost identify Inline graphic equally well, recovering this PI in Inline graphic% of simulation runs for Inline graphic.

Figure 1. Recovery of the dominant-dominant interaction Inline graphic for MAFs of 0.1, 0.3, and 0.5.

Figure 1

Each panel shows the proportion of times in 500 simulation runs the DD PI Inline graphic is recovered among the top 20 PIs by each method for different MAFs for Inline graphic and Inline graphic. A) MAFs for Inline graphic and Inline graphic are 0.1, panel B) MAFs for Inline graphic and Inline graphic are 0.3, and panel C) MAFs for Inline graphic and Inline graphic are 0.5. Error bars represent 95% confidence intervals.

Scenario 2: Two Independent Dominant-Dominant Interactions

In the second scenario, we investigate the ability of LBoost and LF to recover 2 DD interactions that occur with different frequency. The MAFs for the two SNPs in the PI Inline graphic are held constant at 0.1 while the MAFs for Inline graphic are set at 0.1, 0.3 or 0.5. In the first case, the MAFs for Inline graphic are set at 0.1, thus the expected frequency of occurrence of the two PIs Inline graphic and Inline graphic are equivalent. For Inline graphic, LF and LBoost recover the PIs equally well. However LF recovers both PIs significantly more frequently than LBoost for Inline graphic (see Figures 2A and 2B). LBoost recovers both PIs in Inline graphic% of simulation runs for Inline graphic, however LF recovers both PIs Inline graphic% of simulation runs for the largest sample size.

Figure 2. Recovery of the dominant-dominant interactions Inline graphic and Inline graphic for MAFs of 0.1, 0.3, and 0.5.

Figure 2

Each panel shows the proportion of times in 500 simulation runs the DD PIs Inline graphic and Inline graphic are recovered among the top 20 PIs by each method for different MAFs. Specifically, Panels A) and B) show the proportion of times each method recovers Inline graphic and Inline graphic respectively when MAFs for Inline graphic and Inline graphic are 0.1 and MAFs for Inline graphic and Inline graphic are 0.1. Panels C) and D) show the proportion of times each method recovers Inline graphic and Inline graphic respectively when MAFs for Inline graphic and Inline graphic are 0.3 and MAFs for Inline graphic and Inline graphic are 0.1. Panels E) and F) show the proportion of times each methods recovers Inline graphic and Inline graphic respectively when MAFs for Inline graphic and Inline graphic are 0.5 and MAFs for Inline graphic and Inline graphic are 0.1. Error bars represent 95% confidence intervals.

In the second case, the MAFs for Inline graphic and Inline graphic are increased to 0.3, but the frequencies of Inline graphic and Inline graphic are held at 0.1. In this case the PI Inline graphic occurs more frequently than Inline graphic. Both LF and LBoost recover Inline graphic more frequently than in the previous case. However, LF recovers this PI significantly more frequently than LBoost for Inline graphic (Figure 2C). Both methods identify this PI in Inline graphic% of simulation runs for Inline graphic. LBoost identifies the less frequently occurring PI, Inline graphic, significantly more often than LF for Inline graphic (Figure 2D).

In the third case, the MAFs for Inline graphic are increased to 0.5 holding the frequencies for Inline graphic and Inline graphic at 0.1. There is no significant difference in the proportion of times LF and LBoost recover Inline graphic. Both methods recover this PI in Inline graphic% of simulation runs for Inline graphic (Figure 2E). However, LBoost recovers Inline graphic significantly more frequently than LF for Inline graphic (Figure 2F).

We also compare LBoost models with varying K-fold CV (K = 5, 10, and 20) with forest size KA = 100 for cases 1 and 3 for two independent DD interactions. In case 1 (MAF Inline graphic), LBoost identifies both PIs significantly more frequently using 5-fold CV relative to 20-fold CV for Inline graphic (Figure S1, panels A and B). However, there is not a significant difference between 5 and 10-fold CV. In case 3 ( MAF Inline graphic and Inline graphic and MAF Inline graphic and Inline graphic), there is not a significant difference in the proportion of times LBoost recovers Inline graphic or Inline graphic for 5, 10, and 20-fold CV at any sample size (Figure S1, panels C and D).

Additionally we examine the performance of LBoost in models with 100 (with 5-fold CV) and 200 (with 10-fold CV) trees holding the ratio of total number of trees, Inline graphic, to number of CV data set, Inline graphic, constant at 20∶1. Increasing the number of trees from 100 to 200 improves the proportion of times LBoost recovers the PIs in case 1 for Inline graphic though the difference is not significant (Figure S2, panels A and B). In case 3, there is no significant differences in the proportion of times LBoost recovers Inline graphic in models with 100 versus 200 trees (Figure S2, Panel C). However, LBoost identifies Inline graphic significantly more often in models with 200 trees for Inline graphic though the difference is only significant for Inline graphic.

Scenario 3: One Recessive-Recessive Interaction

In simulation scenario 3, we consider the ability of LF and LBoost to recover a single RR interaction. The probability of occurrence of the PI given disease status is less than in previous scenarios. As in the first scenario we consider MAFs of 0.1, 0.3, and 0.5 for Inline graphic and Inline graphic. When both Inline graphic and Inline graphic have MAFs of 0.1, the probability of observing Inline graphic given a subject is disease positive is 0.1%. This PI occurs so infrequently, neither LBoost nor LF identified Inline graphic at any sample size under consideration (results not shown).

When the MAFs of Inline graphic and Inline graphic are increased to 0.3, the probability of the PI given a subject has disease increases to approximately 5%. In this case, LBoost identifies Inline graphic significantly more frequently than LF for Inline graphic (Figure 3, Panel A). LBoost identified this PI in a maximum of 67% of simulation runs, however LF identified it in a maximum of only 17% of simulation runs. When the minor allele frequencies for Inline graphic and Inline graphic are increased to 0.5, the probability of Inline graphic given the subject has disease increases to 0.1433. The ability of both LF and LBoost to identify this PI is improved and both recover this PI in Inline graphic% of simulation runs for Inline graphic (Figure 3, Panel B). There is not a significant difference in the proportion of times each method recovers this PI at any sample size.

Figure 3. Recovery of the recessive-recessive interaction Inline graphic for MAFs 0.3 and 0.5.

Figure 3

Each panel shows the proportion of times in 500 simulation runs the RR PI Inline graphic is recovered among the top 20 PIs by each method for different MAFs for Inline graphic and Inline graphic. Panel A) MAFs for Inline graphic and Inline graphic are 0.3 and panel B) MAFs for Inline graphic and Inline graphic are 0.5. Error bars represent 95% confidence intervals.

We also examine the performance of LF and LBoost in models with 100 (5-fold CV) and 200 (10-fold CV) trees holding the ratio of total number of trees, Inline graphic, to number of CV data set, Inline graphic, constant at 20∶1. Increasing the number of trees from 100 to 200 significantly improves the proportion of times LBoost recovers the Inline graphic for Inline graphic (Figure S3). In models with 200 trees, LBoost recovers Inline graphic in Inline graphic% of simulations for Inline graphic but recovers the PI in a maximum of Inline graphic% of simulations when LBoost models include 100 trees. Increasing the number of trees in a LF model does not significantly impact the ability of LF to recover Inline graphic (Figure S3).

Summary of Simulation Results

LF and LBoost exhibit similar ability to recover frequently occurring PIs. However, LBoost performs better than LF when PIs occur rarely (5 to 10% of the time among individuals with disease) and is better at recovering less frequent PIs in the presence of a frequently occurring PI. There is also a trend towards improved recovery of PIs with increasing the number of trees in an LBoost model regardless of frequency.

CGEMS Analysis

Peroxiredoxins (Prdxs) are a newly identified group of peroxidases upregulated in breast cancer [30][33]. No genetic analysis has been done so far to investigate the genomic integrity of the PRDX genes in breast or any other cancer. We investigate single nucleotide polymorphisms (SNPs) in the Cancer Genetic Markers of Susceptibility study (CGEMS) [19], [20] data, available in dbGaP (dbGaP accession number: ps000147.v1.p1). In total, 94 SNPs that are within 50 kb of the six PRDX genes are included in the LF and LBoost analyses.

The CGEMS study is an NCI-sponsored project begun in 2005 as a pilot study to identify genetic variants associated with increased risk of breast and prostate cancers. The CGEMS breast cancer data was derived from incident post-menopausal breast cancer cases in the Nurses' Health Study (NHS) arising between 1990 and 2004 [21]. Women in the CGEMS study provided a blood sample in 1989 or 1990 as part of the NHS and were cancer free at the time of sampling. In total 1145 incident cases were matched to 1142 controls from the NHS on age, blood collection time, ethnicity (all are self-reported Caucasian), and menopausal status at blood draw (all are menopausal at blood draw). Participants were genotyped using the Ilumina HumanHap550 chip. For each subject approximately 528,000 SNPs were genotyped providing coverage of 90% of the common SNPs.

Our analysis data comprised 94 SNPs on the six PRDX genes, coded for analysis by an indicator variable that takes value 1 if the subject has at least one copy of the minor allele in order to test the dominant effect of the minor allele. The LF and LBoost models constructed for these data both contain 100 trees. The LBoost model uses 5-fold cross-validation in model construction. The LF and LBoost models each identified over 300 unique PIs involving the 94 SNPs. PI importance was ranked from least to greatest according the VIMP.LF for the LF model and according to Inline graphic for LBoost. Empirical p-values were obtained for all PIs using a permutation approach.

Both LF and LBoost identified the PIs rs11198819 (pInline graphic) and rs11198819 Inline graphic rs2297696 (pInline graphic) among the top 5 most important PIs. The SNP rs2297696 is upstream of PRDX3 on the sideroflexin 4 gene. The SNP rs11198819 is downstream from PRDX3 in a non-coding region however, it is in strong linkage disequilibrium (Inline graphic) with rs3749562 which is on the PRDX3 gene. The remaining PIs in the LF model included rs11198819 in conjunction with at least one additional SNP. LBoost identified two additional PIs not identified by LF, rs1205171 (pInline graphic) and rs1205171 Inline graphic rs1461024 (pInline graphic). The SNP rs1205171 is found on the PRDX2 gene and rs1461024 is found on the PRDX6 gene.

The moderate significance of these SNPs and SNP interactions is likely due to the fact that the PRDX family of genes does not play a dominant role in breast cancer. However, these results suggest possible associations of genetic variants within the PRDX family of genes with breast cancer. Additionally, LBoost identified SNPs not identified by LF. Further laboratory studies are necessary to explore the SNP interactions identified by LF and LBoost.

Discussion

Logic Forest, an ensemble adaptation of logic regression, has the ability to model complex interactions among binary predictors to describe disease state. However, LF is less adept in recovering rare PIs associated with disease, particularly in the presence of more frequent, predictive PIs. We introduced a boosting adaptation of LR referred to as LBoost in order to address this weakness of LF. Additionally we presented a predictor/PI importance measure based on permutation of a predictor or PI in the data, Inline graphic.

The results of the simulation study indicate that the ability of LF and LBoost to recover PIs associated with disease depends on the frequency with which a PI occurs in subjects that have disease and whether or not an additional predictive PI is present. In the scenario where the data only included a single DD interaction, LF and LBoost performed similarly, although LBoost showed modest improvement over LF in recovering Inline graphic when the minor allele frequency was low (0.1). In this case, the PI occurred in approximately 10% of subjects with disease. However, when the minor allele frequency increased to 0.3 (PI occurring in approximately 38% of subjects with disease) LF has better ability to recover the PI at smaller sample sizes. The greatest difference in ability to recover a single PI occurred in data where the interaction of interest was a recessive-recessive interaction in which the MAFs for Inline graphic and Inline graphic were 0.3. In this case only 5% of subjects with disease were expected to have the PI Inline graphic and LBoost identified this PI significantly more often than LF for Inline graphic.

In data with two interactions, LBoost recovered the less frequently occurring PI, Inline graphic significantly more frequently than LF and performed similarly to LF in recovering the more frequently occurring PI. This difference in the ability to recover the rarer PI is more pronounced as the difference in frequency of occurrence between the two PIs increases.

For a fixed number of trees, increasing the number of CV sets, Inline graphic, in an LBoost model moderately improves the ability of LBoost to identify frequent PIs. However, increasing the number of CV sets also decreases the ability LBoost to identify rare PIs. This effect is most pronounced in data with two or more PIs where both PIs are infrequent. However, the impact of varying the number of CV sets is small and choice of Inline graphic and Inline graphic should not greatly impact the ability of LBoost to identify PIs. From experience we have found that selecting the total number of trees, Inline graphic, and the number of CV data sets, Inline graphic, such that the ratio of total trees to number of CV data sets Inline graphic provides good balance for identifying frequent and rare PIs.

Increasing the total number of trees improves LBoost's ability to identify rare PIs. This effect is most noticeable when the PI is rare (i.e. the PI occurs in 5% of the cases), and is not evident for PIs that occur with greater than 10% frequency among cases. However, little additional computational time is necessary when increasing the forest size from 100 to 200 trees and therefore is advisable.

Both LBoost and LF are best suited for targeted investigation of SNP interactions associated with disease (e.g. pathway analysis). LBoost performs similarly to LF for frequently occurring PIs although LF performs better for mid-range sample sizes (Inline graphic to Inline graphic). However, LBoost is better able than LF to identify rare interactions that occur in approximately 5–10% of subjects with disease. LBoost is also better adapted to identify multiple PIs in situations where PI frequency varies among the PIs predictive of disease, a scenario more closely resembling a complex disease such as cancer. Since we can not know the data structure a priori, it is helpful to explore the predictor space using both methods.

Although we described the LBoost algorithm using LR with misclassification as the measure of goodness of fit, there are additional fit measures available in LR (e.g. deviance and least squares). There are also search algorithms other than simulated annealing that could be used to search for logical combinations of binary predictors. In subsequent work we will explore use of other LR measures of fit and additional search algorithms for identifying combinations of binary predictors in constructing LBoost models.

Supporting Information

Figure S1

Recovery of DD interactions Inline graphic and Inline graphic in LBoost models with 100 trees and 5, 10, or 20-fold CV. Each panel shows the proportion of times in 500 simulation runs the DD PIs Inline graphic and Inline graphic are recovered among the top 20 PIs by LBoost when the number of CV sets, Inline graphic, is set to either 5, 10 or 20. The total number of LR trees in all models is held constant at Inline graphic. In all panels, black is LBoost with 5-fold CV, red is LBoost with 10-fold CV, and green is LBoost with 20-fold CV. Specifically, Panels A) and B) show the proportion of times LBoost recovers Inline graphic and Inline graphic respectively for different values of Inline graphic when MAFs for Inline graphic and Inline graphic are 0.1 and MAFs for Inline graphic and Inline graphic are 0.1. Panels C) and D) show the proportion of times LBoost recovers Inline graphic and Inline graphic respectively for different values of Inline graphic when MAFs for Inline graphic and Inline graphic are 0.5 and MAFs for Inline graphic and Inline graphic are 0.1. Error bars represent 95% confidence intervals.

(BMP)

Figure S2

Recovery of DD interactions Inline graphic and Inline graphic in LBoost models with 100 or 200 trees. Each panel shows the proportion of times in 500 simulation runs the DD PIs Inline graphic and Inline graphic are recovered among the top 20 PIs by LBoost when the number of LR trees in the LBoost model is either 100 or 200. We use 5-fold CV in LBoost models with 100 LR trees and 10-fold CV in models with 200 trees. Thus the ratio of total trees to Inline graphic-fold CV is held constant at Inline graphic. In all panels, black is LBoost with 100 trees and red is LBoost models with 200 trees. Specifically, Panels A) and B) show the proportion of times LBoost recovers Inline graphic and Inline graphic respectively for models with 100 and 200 trees when MAFs for Inline graphic and Inline graphic are 0.1 and MAFs for Inline graphic and Inline graphic are 0.1. Panels C) and D) show the proportion of times LBoost recovers Inline graphic and Inline graphic respectively for models with 100 and 200 trees when MAFs for Inline graphic and Inline graphic are 0.5 and MAFs for Inline graphic and Inline graphic are 0.1. Error bars represent 95% confidence intervals.

(BMP)

Figure S3

Recovery of the RR interaction Inline graphic for MAF of 0.1 in LBoost models with 100 or 200 trees. The graph shows the proportion of times in 500 simulation runs the RR PI Inline graphic is recovered among the top 20 PIs by both when the number of LR trees in the LBoost or LF models is either 100 or 200. We use 5-fold CV in LBoost models with 100 LR trees and 10-fold CV in models with 200 trees. Thus the ratio of total trees to Inline graphic-fold CV in all LBoost models is held constant at Inline graphic. In all panels, black is LF models with 100 trees, red is LF models with 200 trees, green is LBoost models with 100 trees, and blue is LBoost models with 200 trees. Error bars represent 95% confidence intervals.

(BMP)

Acknowledgments

We appreciate the insightful comments of the reviewers which have significantly improved this manuscript.

Funding Statement

This research was supported in part by pilot funding from an American Cancer Society Institutional Research Grant awarded to the Hollings Cancer Center, Medical University of South Carolina, by National Institutes of Health/National Institute of Dental and Craniofacial Research Grant K25DE016863, and by the South Carolina Clinical and Translational Research Institute, Medical University of South Carolina's CTSA, National Institutes of Health/National Center for Research Resources grant UL1RR029882. The contents are solely the responsibility of the authors and do not necessarily represent the official views of Natitional Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Kumar S, Mohan A, Guleria R (2006) Biomarkers in cancer screening, research and detection: present and future: a review. Biomarkers 11: 385–405. [DOI] [PubMed] [Google Scholar]
  • 2. Alvarez-Castro JM, Carlborg O (2007) A unified model for functional and statistical epistasis and its application in quantitative trait loci analysis. Genetics 176: 1151–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Kotti S, Bickeboller H, Clerget-Darpoux F (2007) Strategy for detecting susceptibility genes with weak or no marginal effects. Hum Hered 63: 85–92. [DOI] [PubMed] [Google Scholar]
  • 4. Pepe MS, Etzioni R, Feng Z, Potter JD, Thompson ML, et al. (2001) Phases of biomarker development for early detection of cancer. J Natl Cancer Inst 93: 1054–1061. [DOI] [PubMed] [Google Scholar]
  • 5. Srlie T, Perou CM, Tibshirani R, Aas T, Geisler S, et al. (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A 98: 10869–10874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ertel A (2010) Bimodal gene expression and biomarker discovery. Cancer Inform 9: 11–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Kaklamani VG, Gradishar WJ (2006) Gene expression in breast cancer. Curr Treat Options Oncol 7: 123–128. [DOI] [PubMed] [Google Scholar]
  • 8. Baird AE (2010) Genetics and genomics of stroke: novel approaches. J Am Coll Cardiol 56: 245–253. [DOI] [PubMed] [Google Scholar]
  • 9. Ruczinski I, Kooperberg C, LeBlanc M (2003) Logic regression. J Comput Graph Stat 12: 475–511. [Google Scholar]
  • 10. Etzioni R, Kooperberg C, Pepe M, Smith R, Gann PH (2003) Combining biomarkers to detect disease with application to prostate cancer. Biostatistics 4: 523–38. [DOI] [PubMed] [Google Scholar]
  • 11. Etzioni R, Falcon S, Gann PH, Kooperberg CL, Penson DF, et al. (2004) Prostate-specific antigen and free prostate-specific antigen in the early detection of prostate cancer: do combination tests improve detection? Cancer Epidem Biomark 13: 1640–5. [PubMed] [Google Scholar]
  • 12. Janes H, Pepe M, Kooperberg C, Newcomb P (2005) Identifying target populations for screening or not screening using logic regression. Stats Med 24: 1321–38. [DOI] [PubMed] [Google Scholar]
  • 13. Kooperberg C, Bis JC, Marciante KD, Heckbert SR, Lumley T, et al. (2007) Logic regression for analysis of the association between genetic variation in the renin-angiotensin system and myocardial infarction or stroke. Amer J Epidemiol 165: 334–43. [DOI] [PubMed] [Google Scholar]
  • 14. Breiman L (1994) Bagging predictors. Technical Report 421, Department of Statistics, University of California at Berkley 1–19. [Google Scholar]
  • 15. Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40: 139–57. [Google Scholar]
  • 16. Friedman J (2001) Greedy function approximation: a gradient boosting machine. Annals Stat 29: 1189–1202. [Google Scholar]
  • 17. Wolf BJ, Hill EG, Slate EH (2010) Logic forest: an ensemble classifier for discovering logical combinations of binary markers. Bioinformatics 26: 2183–2189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Freund SRY (1997) A decision-theoretic generalization of online learning and an application to boosting. J Comput Sys Sci 55: 119–139. [Google Scholar]
  • 19.National Cancer Institute (2005). Cancer genetic markers of susceptibility (cgems). Available: http://cgems.cancer.gov/data/. Accessed 2010 October.
  • 20. Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, et al. (2007) A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet 39: 870–874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Tworoger SS, Eliassen AH, Sluss P, Hankinson SE (2007) A prospective study of plasma prolactin concentrations and risk of premenopausal and postmenopausal breast cancer. J Clin Oncol 25: 1482–1488. [DOI] [PubMed] [Google Scholar]
  • 22.Schwender H (2007) logicFS: Identifying interesting SNP interactions with logicFS. Bioconductor package.
  • 23. Li W, Reich J (2000) A complete enumeration and classification of two-locus disease models. Hum Hered 50: 334–349. [DOI] [PubMed] [Google Scholar]
  • 24. Hallgrmsdttir IB, Yuster DS (2008) A complete classification of epistatic two-locus models. BMC Genet 9: 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Cordell HJ (2002) Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum Mol Genet 11: 2463–2468. [DOI] [PubMed] [Google Scholar]
  • 26. VanderWeele TJ, Laird NM (2011) Tests for compositional epistasis under single interactionparameter models. Ann Hum Genet 75: 146–156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Dempster ER, Lerner IM (1950) Heritability of threshold characters. Genetics 35: 212–236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Wray NR, Goddard ME (2010) Multi-locus models of genetic risk of disease. Genome Med 2: 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available: http://www.R-project.org. URL http://www.R-project.org. Accessed 2012 March.
  • 30. Bae JY, Ahn SJ, Han W, Noh DY (2007) Peroxiredoxin I and II inhibit h2o2-induced cell death in mcf-7 cell lines. J Cell Biochem 101: 1038–1045. [DOI] [PubMed] [Google Scholar]
  • 31. Cao J, Schulte J, Knight A, Leslie NR, Zagozdzon A, et al. (2009) Prdx1 inhibits tumorigenesis via regulating pten/akt activity. EMBO J 28: 1505–1517. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Noh DY, Ahn SJ, Lee RA, Kim SW, Park IA, et al. (2001) Overexpression of peroxiredoxin in human breast cancer. Anticancer Res 21: 2085–2090. [PubMed] [Google Scholar]
  • 33. Wang T, Tamae D, LeBon T, Shively JE, Yen Y, et al. (2005) The role of peroxiredoxin II in radiation-resistant MCF-7 breast cancer cells. Cancer Res 65: 10338–10346. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

Recovery of DD interactions Inline graphic and Inline graphic in LBoost models with 100 trees and 5, 10, or 20-fold CV. Each panel shows the proportion of times in 500 simulation runs the DD PIs Inline graphic and Inline graphic are recovered among the top 20 PIs by LBoost when the number of CV sets, Inline graphic, is set to either 5, 10 or 20. The total number of LR trees in all models is held constant at Inline graphic. In all panels, black is LBoost with 5-fold CV, red is LBoost with 10-fold CV, and green is LBoost with 20-fold CV. Specifically, Panels A) and B) show the proportion of times LBoost recovers Inline graphic and Inline graphic respectively for different values of Inline graphic when MAFs for Inline graphic and Inline graphic are 0.1 and MAFs for Inline graphic and Inline graphic are 0.1. Panels C) and D) show the proportion of times LBoost recovers Inline graphic and Inline graphic respectively for different values of Inline graphic when MAFs for Inline graphic and Inline graphic are 0.5 and MAFs for Inline graphic and Inline graphic are 0.1. Error bars represent 95% confidence intervals.

(BMP)

Figure S2

Recovery of DD interactions Inline graphic and Inline graphic in LBoost models with 100 or 200 trees. Each panel shows the proportion of times in 500 simulation runs the DD PIs Inline graphic and Inline graphic are recovered among the top 20 PIs by LBoost when the number of LR trees in the LBoost model is either 100 or 200. We use 5-fold CV in LBoost models with 100 LR trees and 10-fold CV in models with 200 trees. Thus the ratio of total trees to Inline graphic-fold CV is held constant at Inline graphic. In all panels, black is LBoost with 100 trees and red is LBoost models with 200 trees. Specifically, Panels A) and B) show the proportion of times LBoost recovers Inline graphic and Inline graphic respectively for models with 100 and 200 trees when MAFs for Inline graphic and Inline graphic are 0.1 and MAFs for Inline graphic and Inline graphic are 0.1. Panels C) and D) show the proportion of times LBoost recovers Inline graphic and Inline graphic respectively for models with 100 and 200 trees when MAFs for Inline graphic and Inline graphic are 0.5 and MAFs for Inline graphic and Inline graphic are 0.1. Error bars represent 95% confidence intervals.

(BMP)

Figure S3

Recovery of the RR interaction Inline graphic for MAF of 0.1 in LBoost models with 100 or 200 trees. The graph shows the proportion of times in 500 simulation runs the RR PI Inline graphic is recovered among the top 20 PIs by both when the number of LR trees in the LBoost or LF models is either 100 or 200. We use 5-fold CV in LBoost models with 100 LR trees and 10-fold CV in models with 200 trees. Thus the ratio of total trees to Inline graphic-fold CV in all LBoost models is held constant at Inline graphic. In all panels, black is LF models with 100 trees, red is LF models with 200 trees, green is LBoost models with 100 trees, and blue is LBoost models with 200 trees. Error bars represent 95% confidence intervals.

(BMP)


Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES