Abstract
One common strategy for detecting disease-associated genetic markers is to compare the genotype distributions between cases and controls, where cases have been diagnosed as having the disease condition. In a study of a complex disease with a heterogeneous etiology, the sampled case group most likely consists of people having different disease subtypes. If we conduct an association test by treating all cases as a single group, we maximize our chance of finding genetic risk factors with a homogeneous effect, regardless of the underlying disease etiology. However, this strategy might diminish the power for detecting risk factors whose effect size varies by disease subtype. We propose a robust statistical procedure to identify genetic risk factors that have either a uniform effect for all disease subtypes or heterogeneous effects across different subtypes, in situations where the subtypes are not predefined but can be characterized roughly by a set of clinical and/or pathologic markers. We demonstrate the advantage of the new procedure through numeric simulation studies and an application to a breast cancer study.
Keywords: Breast cancer, Etiology heterogeneity, Genetic association study, Multiple-comparison adjustment, Tree-based model
1. Introduction
Recent genome-wide association studies (GWAS) have identified many common genetic variants associated with a variety of disease outcomes, including cancers. One common strategy for detecting the disease-associated genetic markers (e.g. single-nucleotide polymorphisms, called SNPs) is to compare the genotype distributions between cases and controls. The case group usually consists of subjects who are diagnosed as having the disease. In some complex diseases, it is well known that there is heterogeneity in the disease etiology, representing potentially different biological mechanisms or pathways to the disease. For example, in breast cancer it has been shown that estrogen receptor (ER)-positive and -negative breast tumors arise from somewhat different etiologic pathways. Among the 67 established loci associated with overall breast cancer risk, 7 are specific to ER-negative tumors (Garcia-Closas and others, 2013).
In a genetic association study of a complex disease with a heterogeneous etiology, the sampled case group most likely consists of people having different disease etiologies. If we conducted a genetic association test by treating all of the cases as a single group, we would maximize our likelihood of finding genetic risk factors that have a uniform effect on the disease risk, regardless of the underlying etiology. But this strategy could diminish the power for detecting genetic risk factors with a heterogeneous effect, whose effect size (such as the odds ratio) varies by clinical and/or pathological characteristics of the disease. To detect a risk factor with a heterogeneous effect of this kind, it would be ideal to classify cases into different groups such that the risk factor under consideration has a constant effect on each individual case group. In real applications, the disease subtype is not known a priori. Instead, we usually have a set of biomarkers (different from the set of genetic markers under testing) that measure various pathologic and molecular characteristics of the disease. We assume that those features can be used to predict disease subtypes. We aim to develop a robust statistical procedure to detect genetic risk factors that have either a uniform effect on all disease subtypes or heterogeneous effects on different subtypes, in situations where the disease subtype is not defined a priori, but can be approximately characterized by a set of biomarkers.
A common approach in the literature is to use a polytomous logistic regression (PLR) model to evaluate the effect of the genetic risk factor by allowing its effect to vary among a set of predefined disease subtypes (Begg and Zabor, 2012; Schroeder and Weinberg, 2001). If the focus is on the comparison of the risk effect (as measured by the odds ratio) on two predefined disease subtypes, a simpler case-series approach using only cases is sufficient (Begg and Zhang, 1994). A two-stage regression model has been proposed to model the heterogeneous effect of a risk factor among different levels of measured disease characteristics (Chatterjee, 2004). The focus of the two-stage approach is to evaluate whether the effect of the risk factor varies by multiple disease characteristics, but not to test the global effect of the risk factor.
Here we propose to use the binary partitioning algorithm (Breiman and others, 1984; Zhang and Singer, 2010) to search for an appropriate disease subtype classification that can be interpreted through a binary decision tree, with the splitting rules defined on the basis of biomarkers. The global test for the genetic risk factor is derived from the PLR model, with the multi-level outcome being the disease subtype defined by the final tree model.
2. Methods
2.1. Notations and assumptions
We consider a case–control study consisting of
cases and
controls. For the
th subject, we have the observation
, with
being the case–control status (1 for case, 0 for control),
being the (germline) genotype of the genetic marker under study, coded as 0, 1, or 2 representing the number of minor alleles, and
being the vector of covariates. We assume all covariates are discrete due to the constraint imposed by the resampling-based procedure described later. For the
th case,
, we have measurements on a set of
biomarkers,
, characterizing various properties of the disease condition. Throughout this paper, we reserve the term “biomarker” for those measurements on cases (e.g. tumor characteristics) that are used for disease subtype classification.
Since in our procedure we use binary splitting rules derived from biomarkers to build a tree model that defines potential disease subtypes, we assume all considered biomarkers have been converted to binary splitting variables (used as splitting rules) according to certain user defined schemes. For example, we can derive a total of
binary splitting variables from a categorical biomarker with
classes (that is the number of ways to bipartition the
classes). We can derive
splitting variables for a discrete ordinal biomarker with
levels if we want the splitting rules to be consistent with the ordering restriction. A continuous biomarker can be first converted to an ordinal variable with a limited number of levels, and then be treated the same way as an ordinal variable. We can consider more complicated splitting variables based on multiple biomarkers. For instance, in the simulation studies and real data application described below, we consider splitting variables defined by every two biomarkers simultaneously. That is, for each pair of considered biomarkers, we can derive a set of binary splitting variables, each representing a way to bipartition the space of the joint observation on the two biomarkers. For example, for two binary biomarkers, their joint observation has four different configurations. We can derive seven binary splitting variables based on these two biomarkers.
As articulated in Section 1, we expect that there is etiologic heterogeneity in the disease group, that is, each case has a latent variable
, classifying its disease outcome into one of
subtypes. We let
for a control. Here, we caution that the subtype is defined specifically for the genetic risk factor under study since its definition is tailored to the risk factor’s effect size. The effect of a genetic marker (as measured by the odds ratio) is expected to vary over the (unobserved) disease subtypes. We assume that there is no interaction between
and any of covariates x. The null hypothesis we want to test is that the genotype
is independent of case–control status
, as well as of any of the biomarkers
.
2.2. The top-down procedure
Although we do not know the latent disease subtype
, we propose to use the binary tree-based partition algorithm (Breiman and others, 1984; Zhang and Singer, 2010) to identify an optimal classification rule that classifies the case group into non-overlapping “homogeneous” subgroups (as surrogates for the latent disease subtype) such that the SNP has a systematically higher or lower effect on cases in the same subgroup. The final association test is designed to assess the association between the testing SNPs and any of the disease subtypes, with its significance level being evaluated by a bootstrap procedure that accounts for multiple comparisons in the search for the optimal classification rule. Here is a brief summary of the procedure, for which details will be provided below.
Step 1: Starting from the top node, which includes all cases, apply a top-down “goodness-of-split” tree building algorithm to form additional branches by further splitting nodes selectively in turn until a total of
terminal nodes (twigs of the tree) have been defined. Each split results in a tree model
,
, with
terminal nodes defining
different disease subtypes.Step 2: Based on the criteria to be described below, identify the “most significant” disease subtype classification
among
,
, and evaluate the association between the candidate SNP and any of the disease subtypes defined by
.
2.2.1. Details on Step 1
We adopt the following searching strategy (Yu and others, 2010) to identify a sequence of nested tree-structure models as candidates for the disease subtype classification. We start from the tree model with just the root node that includes all cases, called
, then split the root node into two offspring nodes to obtain the tree model
. To find
, we want to expand
by splitting one of its two terminal nodes. We choose the one with a larger “goodness-of-split” (to be defined below), and expand the
by including the two offspring nodes of the node that is chosen for splitting. Suppose we have obtained a tree model with
terminal nodes,
, we then can choose the terminal node of
with the largest “goodness-of-split” and expand
by including two offspring nodes of the chosen node to obtain a tree model
with
terminal nodes. The process can continue until a tree with
terminal nodes is reached.
In the above procedure, a node (i.e. cases in this node) is split according to a chosen splitting rule
, which is defined by one of the splitting variables derived from considered biomarkers. For example, the top node in Figure 1 is divided by measurement at biomarker CK5, with cases having
or 1 being sent to the left or right offspring nodes, respectively. The procedure evaluates all possible splitting rules defined by individual splitting variables, identifies an optimal one meeting certain criteria, and uses this rule to split a given node. A splitting rule is ignored if either of its two resultant offspring nodes has the number of subjects less than the given threshold (e.g. 40) to ensure the robustness of the procedure. For the same reason, a node is not considered for splitting if the number of cases inside is less than the given threshold (e.g. 40).
Figure 1.

The identified optimal tree model for the disease subtype definition for the SNP rs10941679 based on the top-down-1 procedure that uses one biomarker to define each splitting rule. All considered biomarkers are defined in Table S1 of supplementary material available at Biostatistics online.
We consider the following metric to evaluate each splitting rule for a given node
. Given a candidate splitting rule
, which partitions cases in node
into the left and right offspring nodes
and
, we define a “subtype” for a case as
if it is assigned to node
, or
if it is assigned to node
. We consider the following PLR model, using cases in node
(
or 2) and all available controls (with their subtype defined as
as,
![]() |
with
,
, being the subtype−specific regression coefficients for the intercept term, covariates x, and genotype
at the test ing SNP. A good splitting rule should generate two subtypes with very different
,
. Based on the above model, we have
![]() |
This suggests a simpler way to evaluate the splitting rule. In fact, we can fit the following logistic regression model, using only cases in node
, with the binary outcome defined by
(
1 or 2),
![]() |
(2.1) |
with
being the unknown regression coefficients. The same argument has been used in the case-series approach (Begg and Zhang, 1994). The corresponding Wald test statistic for the effect of
in model (2.1) can be used to evaluate the difference between
and
. We can define the Wald test statistic as
![]() |
(2.2) |
where
is the maximum likelihood estimate (MLE) for the coefficient
, and
is the estimated variance for
.
can be treated as a measure for the “goodness-of-split” (LeBlanc and Crowley, 1993) of the candidate splitting rule
at node
. A large
would suggest that the testing SNP probably has different effects on the two subtypes defined by
. For a given node
, among all its possible splitting rules, we choose the one that yields the maximum Wald statistic, as given by (2.2), and use it to split the node. We define the “goodness-of-split” for node
as
![]() |
(2.3) |
where the maximum is taken over all permissible splitting rules at node
.
Each candidate tree model
provides a classification rule that categorizes cases into
subtypes, and equivalently, defines a new disease outcome
with
subtypes, with
, or
, corresponding to cases falling into each of the terminal nodes of
, and
for all controls. We fit the following PLR model to assess the overall effect of the testing SNP on the derived disease outcome
,
![]() |
(2.4) |
with
, being the subtype-specific regression coefficients. We can use the Wald statistic for testing whether or not
to assess the overall effect of the testing SNP. The Wald statistic can be written as
![]() |
(2.5) |
where
is the MLE for
, and
is the estimated covariance matrix for
.
2.2.2. Details on Step 2
Because of the selection process involved in identifying each candidate tree
, under the null hypothesis
no longer follows the
distribution. The
-value estimated from the
distribution would be too small. We propose the following resampling-based procedure to estimate the
-value
for each candidate tree
,
. We assume that all covariates x are categorical variables. To regenerate the case–control studies under the null hypothesis, we randomly permute the genotype among subjects with the same observed covariates x, while keeping other observations on each subject unchanged. By using this type of permutation procedure, we can regenerate datasets that maintain the same relationship among the outcome, the covariates, and the set biomarkers characterizing the disease conditions, as that observed in observed dataset.
Below is a summary of the resampling-based procedure for estimating the
-value
for candidate trees
,
.
Based on the observed data, obtain the candidate trees
and their associated Wald statistics
,
.Use the resampling-based procedure described above to generate
sets of null datasets. For the
th null dataset,
, obtain the candidate trees
andtheirassociatedWaldstatistics
¡/inlinegif¿,
.Estimate the
-value
for the candidate tree
as
, with
being the indicator function.
After we have identified the
-values for individual candidate trees, we pick the candidate tree
with the smallest
-value as the definition for the disease subtype, and use the
as the final test statistic to test whether the testing SNP is associated with the disease outcome. We want to caution that the identified
is just one of several possible ways for the disease subtype classification. In fact, we find in many numerical experiments that there are usually multiple candidate trees with their
-values close to the minimum one. We do not claim that the one with the smallest
-value is the best one to estimate the disease subtype. Our focus is on the detection of association, but not on the identification of disease subtypes.
To evaluate the significance level of the test statistic, which is
, one generally needs a two-layer permutation procedure, with the inner layer in order to evaluate
for the observed and resampled datasets. Obviously, this process could be very computationally intensive. Instead, we use a computationally efficient minP algorithm (Ge and others, 2003) to evaluate the
, as well as the
-value for
, through a single-layer resampling procedure.
We point out that
is the
-value corresponding to the standard association test that treats all cases as a single group. So the proposed test indeed allows for the possibility that the testing SNP has a homogeneous effect on all cases, regardless of their disease conditions characterized by the set of biomarkers.
2.3. The bottom-up procedure
The top-down procedure searches for the optimal partition among the ones with a binary tree structure. One advantage of focusing on a tree-structured partition is that the derived disease subtype classification is relatively easy to interpret. But the top-down procedure may not lead to an optimal sub-classification of cases. In addition to the top-down procedure, we propose a bottom-up procedure to enhance the search for the optimal disease subtype definition. The proposed bottom-up procedure is not restricted to partitions with a tree structure. This would permit a more thorough search for the appropriate disease subtype, with some sacrifice of interpretability. More details on the bottom-up procedure are given in supplementary material available at Biostatistics online.
2.4. Adjusting covariates with constant effects
So far we have considered the PLR model given by (2.4) as the basic model to evaluate the effect of the testing SNP on a set of derived disease subtypes. In this model, all covariates (x) are allowed to have different effects on different subtypes. This modeling strategy is robust, but could yield unstable estimates when the number of covariates or the number of considered subtypes is relatively large. If we know some covariates are not likely to have strongly heterogeneous effects, we can consider a more parsimonious model, and modify the procedure slightly. More details are given in supplementary material available at Biostatistics online.
3. Simulation Studies
In the simulation studies, we considered a genetic association study consisting of 1500 cases and 1500 controls. For each subject (
or 1 for a control or case, respectively), we assumed that we had the measures on a binary covariate
and the genotype
on the testing SNP. For each case, we had measurements on
binary biomarkers
, characterizing the disease property. Details on how those measurements were generated are given in supplementary material available at Biostatistics online.
Under one of the seven considered disease risk models given in supplementary material available at Biostatistics online, we simulated 1000 datasets, each with 1500 cases and 1500 controls. We analyzed each dataset using one of the following four tests: the trend test based on the standard logistic regression model that treats all cases as one group, and models the genotype (coded as 0, 1, or 2) as a continuous variable (Trend); the test based on the proposed top-down procedure using splitting variables derived from individual biomarkers (top-down); the test based on the bottom-up procedure using splitting variables derived from individual biomarkers (bottom-up); and the one based on a simple-minded procedure (called one-split) that assumes that cases can be classified by one of the considered biomarkers into two subtypes. More specifically, in the one-split procedure, for each
,
, we divide the cases into two subgroups according to their measures on
, and obtain the 2-d.f. Wald test statistic
, which is based on the PLR model for the derived three-level outcome (0 for controls, 1 and 2 for the two case subgroups). The final test statistic used in the one-split procedure is defined as
. The significance level of the test can be obtained using a resampling-based procedure similar to the one used in the top-down procedure.
We compared performances of various tests at the nominal level of 0.05. We used 5000 resampling steps to evaluate the significance level of the three tests, one-split, top-down, and bottom-up. For the top-down and bottom-up procedures, we set the minimum number of subjects in a node at 40, and the maximum number of terminal nodes
at 6. The same setup was also applied to the real data application. Simulation results are summarized in Table 1. Under Models 1 and 2 (both are null models), it is clear that all considered tests can maintain their type I error rates properly. Under the homogeneous model (Model 3), the standard trend test has the best performance as it targets the right risk model. Among the other tests designed for more complicated risk models, the top-down and bottom-up tests outperform the one-split test, since the homogeneous risk model is one of their considered risk models. The one-split test has the best performance under Model 4, where the disease subtype is defined by one biomarker. This is to be expected, as the one-split test is designed to target such this kind of risk model, while the top-down and bottom-up tests consider a wider range of risk models. Under the other 3 more complicated risk models (Models 5–7), the top-down and bottom-up procedures clearly outperform the trend and one-split tests. It is interesting to note that the bottom-up test can have a noticeable advantage over the top-down test in some cases (e.g. Model 5).
Table 1.
Power and Type I error evaluations for all considered tests
| Risk model | Trend | One-split | Top-down | Bottom-up |
|---|---|---|---|---|
| Model 1 | 0.043 | 0.046 | 0.048 | 0.045 |
| Model 2 | 0.045 | 0.044 | 0.044 | 0.043 |
| Model 3 | 0.764 | 0.606 | 0.667 | 0.669 |
| Model 4 | 0.204 | 0.730 | 0.668 | 0.579 |
| Model 5 | 0.485 | 0.674 | 0.752 | 0.817 |
| Model 6 | 0.428 | 0.638 | 0.786 | 0.799 |
| Model 7 | 0.197 | 0.393 | 0.474 | 0.482 |
Results are based on 1000 generated datasets under each considered risk model.
We also conducted additional simulation studies to see whether we can improve the power of top-down and bottom-up procedures by considering more complicated splitting rules. We considered two more procedures, the top-down-2 and bottom-up-2, which were based on the proposed top-down and bottom-up procedures using splitting variables derived from pairs of biomarkers. Results are summarized in supplementary material available at Biostatistics online.
In summary, the top-down and bottom-up tests appear to have the most robust performance under various kinds of risk models.
4. Real Data Application
We used the genetic association study from the NCI Polish Breast Cancer Study (Garcia-Closas and others, 2006) to demonstrate the application of our methods. The dataset consisted of
1200 breast cancer cases and 2400 controls, with genotypes at 19 breast-cancer-related SNPs that had been established by previous GWAS (Broeks and others, 2011; Figueroa and others, 2011). We focused on breast cancer cases that had complete information on the following tumor characteristics: ER, PR, HER2, EGFR, CK5, tumor grade (GD), tumor histology (HT), tumor size (SZ), and number of positive nodes (NP). A detailed description of these variables is given in Table S2 of supplementary material available at Biostatistics online. Since there was no evidence that any considered SNP was associated with the only potential confounder, the age at enrollment, we did not adjust for any covariate in the analysis.
Four of the 19 SNPs (rs13387042, rs2981582, rs614367, and rs10995190) had
-values
0.01 according to the standard trend test (Trend). We excluded those four SNPs from further analysis, as our intent was to find SNPs that were missed by the Trend test due to the existence of disease subtypes. We analyzed each one of the remaining 15 using 3 additional tests: top-down, bottom-up, and one-split.
The testing results on the 15 SNPs are given in Table 2. The result on SNP rs10941679 looks very interesting. With all cases considered as one group, the standard association test (Trend) based on the logistic regression model had a
-value of 0.099. The one-split test detected some suggestive association evidence
by allowing the SNP to have different effects on two disease subtypes defined by one of the considered biomarkers. The more sophisticated tests (top-down and bottom-up), which conducted a more thorough search for disease subtypes, identified a much stronger SNP-disease association signal (
-values were 0.0061, 0.0043 for top-down and bottom-up, respectively). The optimal tree model identified by the top-down procedure predicted five disease subtypes, represented by the terminal nodes of the tree (Figure 1). We further fitted a PLR model for this derived outcome (0 for control, 1–5 for each of the defined groups). We present the OR for the SNP rs10941679 in Table S3 of supplementary material available at Biostatistics online. It should be noted that the estimated ORs are biased because of the model selection. The bottom-up procedure identified two disease subtypes, which were formed by merging several subtypes identified by the top-down procedure. Estimated ORs based on that fitted PLR model are summarized in Table S4 of supplementary material available at Biostatistics online.
Table 2.
Testing results (p-values) for SNP-disease association by four considered tests in the NCI Polish Breast Cancer Study
| SNP ID | Position
|
Trend | One-split | Top-down | Bottom-up |
|---|---|---|---|---|---|
| rs11249433 | 1:121280613 | 3.14E-2 | 3.59E-2 | 8.24E-2 | 7.26E-2 |
| rs1045485 | 2:202149589 | 5.18E-2 | 4.30E-1 | 1.17E-1 | 1.10E-1 |
| rs4973768 | 3:27416013 | 8.32E-1 | 4.46E-2 | 1.06E-1 | 2.24E-1 |
| rs4415084 | 5:44662515 | 1.92E-1 | 1.50E-1 | 3.19E-1 | 3.51E-1 |
| rs10941679 | 5:44706498 | 9.88E-2 | 5.03E-2 | 6.12E-3 | 4.26E-3 |
| rs889312 | 5:56031884 | 2.69E-1 | 5.99E-1 | 5.16E-1 | 4.87E-1 |
| rs2046210 | 6:151948366 | 2.24E-1 | 1.00E-1 | 2.29E-1 | 4.01E-1 |
| rs13281615 | 8:128355618 | 3.07E-2 | 2.26E-1 | 7.90E-2 | 7.07E-2 |
| rs1011970 | 9:22062134 | 8.02E-2 | 1.47E-1 | 2.03E-1 | 1.79E-1 |
| rs2380205 | 10:5886734 | 7.76E-1 | 8.19E-1 | 9.64E-1 | 9.54E-1 |
| rs704010 | 10:80841148 | 4.82E-1 | 4.54E-1 | 7.00E-1 | 6.44E-1 |
| rs3817198 | 11:1909006 | 4.04E-1 | 3.50E-1 | 2.09E-1 | 2.03E-1 |
| rs10483813 | 14:69031284 | 3.83E-2 | 1.25E-1 | 1.05E-1 | 9.18E-2 |
| rs6504950 | 17:53056471 | 2.03E-1 | 1.64E-1 | 3.46E-1 | 3.80E-1 |
| rs8170 | 19:17389704 | 5.72E-1 | 4.86E-1 | 6.56E-1 | 7.59E-1 |
The first number represents the chromosome ID, the second number indicates the SNP position on the chromosome.
In the above application of the top-down and bottom-up procedures, we considered splitting rules defined by one biomarker. We also conducted additional analyses using the top-down and bottom-up procedures by considering splitting rules defined by two biomarkers simultaneously (top-down-2 and bottom-up-2) in the tree-building algorithm. We found 2 SNPs (rs10941679 and rs11249433) among the 15 SNPs that had at least one
-value
0.05. Results on those two SNPs are summarized in Table 3. In particular, on SNP rs11249433, the top-down-2
and bottom-up-2
procedures detected more significant results than the other tests. The final optimal tree model from the top-down-2 procedure defined six disease subtypes (Figure 2), with estimated ORs from the fitted PLR model given in Table S5 of supplementary material available at Biostatistics online. The bottom-up procedure identified four disease subtypes for rs11249433 (see the subtype definition in Table S6 of supplementary material available at Biostatistics online). On the other hand, the top-down and bottom-up procedures using splitting rules defined by individual biomarkers find no obvious disease subtypes for rs1129433 (i.e. the best tree model has just one root node).
Table 3.
Testing results on rs10941679 and rs11249433
| rs10941679 | rs11249433 | |
|---|---|---|
| Trend | 9.88E-2 | 3.14E-2 |
| Top-down | 6.12E-3 | 8.24E-2 |
| Bottom-up | 4.26E-3 | 7.26E-2 |
| Top-down-2 | 2.69E-2 | 9.66E-3 |
| Bottom-up-2 | 8.53E-2 | 7.53E-3 |
Figure 2.
The optimal tree model for the disease subtype definition for the SNP rs11249433 based on the top-down-2 procedure that uses two biomarkers simultaneously to define each splitting rule. All considered biomarkers are defined in Table S1 of supplementary material available at Biostatistics online.
In summary, using the proposed methods we can detect stronger signals on SNPs rs10941679, and rs11249433 than the conventional approach that assumes a homogeneous SNP effect on all breast cancer subtypes. In fact, there is evidence that both rs10941679 and rs11249433 have non-homogeneous effects (Figueroa and others, 2011; Milne and others, 2011).
5. Discussion
For some diseases with a complex etiology, such as breast cancer, it has been shown that for a considerable number of SNPs, their effects varied by disease subtype. Here, we have developed a robust association-testing procedure that aims at detecting those SNPs with a heterogeneous effect, while maintaining a reasonably good power to assess SNPs with a homogeneous or near-homogeneous effect.
When there is no etiology heterogeneity, an SNP can still appear to have a varied effect among cases with different expression levels on a biomarker if this SNP happens to influence the biomarker’s expression. In this situation, the standard approach that considers all cases as a homogenous group is most appropriate for detecting the SNP-disease association, as it is derived from the correct SNP-disease model.
We considered splitting rules derived from individual biomarkers, as well as the ones derived from pairs of biomarkers. Although the use of splitting rules defined by multiple biomarkers has the potential to find an optimal partition that would have been missed by looking at one biomarker a time, it also increases the penalty for multiple comparison. If the number of considered biomarkers is relatively small (e.g.
10), it might be worthwhile to consider more complicated splitting rules. But as shown in the simulation studies (supplementary material available at Biostatistics online), the use of more complicated splitting rules is not always better. We have only explored the use of splitting rules defined by two biomarkers. More studies are needed if we want to consider splitting rules defined by more than two biomarkers.
The proposed procedure relies on measurements on a set of biomarkers, which characterize various pathologic and molecular characteristics of the disease, to classify cases into different subtypes. In real applications, it is possible that not all biomarkers have been measured for some cases. If there are a substantial number of cases with missing values, removing those cases would result in a loss of power. Instead, we can use the idea of surrogate variable in the classification and regression tree model (Breiman and others, 1984; Therneau and Atkinson, 1997) to decide to which of the offspring nodes a case should be sent if he or she has a missing value on the biomarker chosen to split the current node. We can use the algorithm implemented in RPART (Therneau and Atkinson, 1997) to identify surrogate variables. Further investigations are needed to achieve a better understanding of the performance of the surrogate variables and other techniques, such as the EM algorithm, for dealing with missing data.
We have implemented the proposed procedure in an R package called Tree-Het, which relies on functions provided by R packages RPART (Therneau and Atkinson, 1997) and VGAM (Yee, 2010). The R package is freely available from http://dceg.cancer.gov/tools/analysis/het-tree.
Supplementary material
Supplementary Material is available at http://biostatistics.oxfordjournals.org.
Funding
The work of Kai Yu, Han Zhang, Hisani N. Horne, and Jinbo Chen was supported in part by the Intramural Program of the NIH and the National Cancer Institute. The work of Jinbo Chen was supported by R01-ES016626. The authors of the study thank Pei Chao and Michael Stagner from Information Management Services (Silver Spring, MD, USA) for data management support; the participants, physicians, pathologists, nurses, and interviewers from participating centers in Poland for their efforts during field work; and Drs. Montse Garcia-Closas, Louise Brinton, Mark Sherman, Beata Peplonska, and Jolanta Lissowska for their contributions to the study design of the NCI Polish Breast Cancer Study.
Supplementary Material
Acknowledgements
This research utilized the high-performance computational capabilities of the Biowulf PC/Linux cluster at the National Institutes of Health, Bethesda, MD, USA (http://biowulf.nih.gov). Conflict of Interest: None declared.
Footnotes
In the original version, the surname of author Hisani N. Horne was actually spelt as Home. This has now been corrected.
References
- Begg C. B., Zabor E. C. Detecting and exploiting etiologic heterogeneity in epidemiologic studies. American Journal of Epidemiology. 2012;176:512–518. doi: 10.1093/aje/kws128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Begg C. B., Zhang Z. F. Statistical analysis of molecular epidemiology studies employing case-series. Cancer Epidemiology, Biomarkers & Prevention. 1994;3:173–175. [PubMed] [Google Scholar]
- Breiman L., Friedman J. H., Olshen R. A., Stone C. J. Classification and Regression Trees. Belmont, CA: Wadsworth Publishing Co; 1984. [Google Scholar]
- Broeks A., Schmidt M. K., Sherman M. E. Low penetrance breast cancer susceptibility loci are associated with specific breast tumor subtypes: findings from the Breast Cancer Association Consortium. Human Molecular Genetics. 2011;20:3289–3303. doi: 10.1093/hmg/ddr228. others. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chatterjee N. A two-stage regression model for epidemiological studies with multivariate disease classification data. Journal of American Statistical Association. 2004;99:127–138. [Google Scholar]
- Figueroa J. D., Garcia-Closas M., Humphreys M. Associations of common variants at 1p11.2 and 14q24.1 (RAD51L1) with breast cancer risk and heterogeneity by tumor subtype: findings from the Breast Cancer Association Consortium. Human Molecular Genetics. 2011;20:4693–4706. doi: 10.1093/hmg/ddr368. others. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garcia-Closas M., Couch F. J., Lindstrom S. Genome-wide association studies identify four ER negative-specific breast cancer risk loci. Nat Genet. 2013;45:398. doi: 10.1038/ng.2561. e391–392., others. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garcia-Closas M., Egan K. M., Newcomb P. A. Polymorphisms in DNA double-strand break repair genes and risk of breast cancer: two population-based studies in USA and Poland, and meta-analyses. Human Genetics. 2006;119:376–388. doi: 10.1007/s00439-006-0135-z. others. [DOI] [PubMed] [Google Scholar]
- Ge Y., Dudoit S., Speed T. P. Resampling-based multiple testing for microarray data analysis. Test. 2003;18:1–44. [Google Scholar]
- LeBlanc M., Crowley J. Survival trees by goodness of split. Journal of American Statistical Association. 1993;88:457–467. [Google Scholar]
- Milne R. L., Goode E. L., Garcia-Closas M. Confirmation of 5p12 as a susceptibility locus for progesterone-receptor-positive, lower grade breast cancer. Cancer Epidemiology, Biomarkers & Prevention. 2011;20:2222–2231. doi: 10.1158/1055-9965.EPI-11-0569. others. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schroeder J. C., Weinberg C. R. Use of missing-data methods to correct bias and improve precision in case-control studies in which cases are subtyped but subtype information is incomplete. American Journal of Epidemiology. 2001;154:954–962. doi: 10.1093/aje/154.10.954. [DOI] [PubMed] [Google Scholar]
- Therneau T. M., Atkinson E. J. An introduction to recursive partitioning using the RPART routines. Mayo Foundation Technical Report. 1997 [Google Scholar]
- Yee T. W. The VGAM package for categorical data analysis. Journal of Statistical Software. 2010;32:1–34. [Google Scholar]
- Yu K., Wheeler W., Li Q. A partially linear tree-based regression model for multivariate outcomes. Biometrics. 2010;66:89–96. doi: 10.1111/j.1541-0420.2009.01235.x. others. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang H. P., Singer R. Recursive Partitioning and Applications. New York: Springer; 2010. 2nd edition. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.









