I would like to congratulate Kang, Janes, and Huang for their interesting article on a novel boosting algorithm for combining multiple biomarkers to optimize patient treatment recommendations. Their results show the potential of modern machine learning methods especially when one cannot correctly specify a regression model for the complex underlying relationship of interest. Since the publication of the AdaBoosting procedure by Freund and Schapire (1997), there has been great interest in seeking an explanation for its excellent performance and generalizing the method in various settings. For example, AdaBoosting can be viewed as a numerical algorithm performing forward stagewise regression with a set of weak classifiers to minimize a regularized exponential loss function. Along this line, several versions of boosting procedures have been proposed by changing the loss function to be minimized (Friedman, Hastie, and Tibshirani, 2000). Boosting is arguably one of the best off-the-shelf methods and it is not a surprise that when appropriately adapted, it can perform competitively well for differentiating patients with positive treatment benefit from those without. On the other hand, after carefully examining the proposed boosting algorithm in the article, I see some crucial departures from the classical version of boosting. Therefore, I will first discuss the differences between the proposed method and the conventional boosting algorithm and their potential implications. Secondly, I will propose simple alternatives to resolve some of the difficulties. To simplify the discussion, I will assume P(T = 1) = P(T = 0) = 0.5 throughout.
1. The Difference between the Proposal and Convention Boosting
Suppose for the time being that the directions of the treatment effect, Si = I{Δ(Yi) ≤ 0}, i = 1, …, n are observed, and the real AdaBoosting (a generalized version of AdaBoosting returns a class probability) proceeds as follows:
Start with weights for subjects i = 1, …, n, and fit a working model to calculate , the estimated probability P(S = 1|Y = y) based on the working model.
- For m = 1, …, M,
- update the weights according to , where
for i=1, …, n. - Refit the working model with updated weights to obtain , the updated estimated probability P(S = 1|Y = y).
- After the last iteration, the estimated treatment rule is
Although the assumption that Si are all observed is artificial, it reveals an important message in comparing the proposed boosting algorithm with the ideal counterpart above. Firstly, while the updated weight in the ideal Real AdaBoosting depends on the fitted probability, the old weight and the true outcomes Si, the weight in the proposed algorithm depends only on the fitted probability. Intuitively, the Real AdaBoosting always tries to upweight misclassified observations in the previous step to ensure that the new classification rule likely adds value to existing ones, especially on subgroup of observations for which the performance of the old classifiers is unsatisfactory. Since Si is unobserved, the new algorithm upweights observations close to the estimated decision boundary, whose classification (optimal treatment recommendation) is relatively ambiguous. It is clear that there is a genuine difference in the reweighting scheme: following the same logic of the new algorithm, the Real AdaBoosting would upweights observations with the estimated class probability close to 0.5, which may or may not be the same observations misclassified in the previous step. One certainly can imagine that if the initial working model is a poor approximation to the truth, then some misclassified observations could mistakenly have an estimated class probability close to 0 or 1 and receive less weight. Furthermore, even when the working model yields reasonable class probability estimates, the misclassification that occurred near the decision boundary affects the value-function θ the least. Therefore, one may wonder if the new proposal sometimes fails to upweight the “right” observations. Secondly, while the Real AdaBoosting can be viewed as an algorithm to minimize a regularized loss function, the proposed algorithm can not be easily fit into the same framework. One consequence is that it is not clear if the algorithm provides a consistent estimate to the optimal treatment rule I{Δ(Y) ≤ 0}. To gain some insight the consistency issue, lets consider the convergence limit of in the proposed boosting procedure. The final estimator should be close to the difference between the two limits after sufficient number of iterations. When the simple logistic working model is used, the limit of is and βt solves the equation
as n → ∞, which implies that
where can be viewed as a reasonable approximation to Δ−(Y), especially in the region where . However, it seems that in general is different from Δ(Y) unless the simple logistic working model is correctly specified. The extra weight function does not solve this problem. This is in contrast to the appealing “consistency” property of the original AdaBoosting procedure, which does not require the “weak” learners to be “strong” or “correct.”
2. Alternative Boosting Procedure
From the discussion in the previous section, it seems that the new proposal may sacrifice some important components in the AdaBoosting procedure due to the simple fact that Si, i = 1, …, n are not observable in practice. This may cause concerns about its performance in settings beyond those investigated in the comprehensive simulation study. On the other hand, noticing the fact that
one may treat the binary variable as the surrogate to the unobserved Si = I{−Δ(Yi) ≥ 0} (Signorovitch, 2007) and simply replace Si by in the Real AdaBoosting procedure described above. To justify this approach, one notes that AdaBoosting can be viewed as a numerical algorithm to minimize the empirical version of the loss function
in terms of d(·), a function of y. In this case, the minimizer of J(d) is
It suggests that I{d1(Y) ≥ 0, the treatment assignment rule based on d1(·), is equivalent to our target I{Δ(Y) ≤ 0}. Therefore, as a numerical algorithm to estimate I{d1(Y) ≥ 0}, the output of AdaBoosting with , 1 ≤ i ≤ n, as responses can be used to assign patients according to their individual treatment used to assign patients according to effects.
The second alternative is to estimate the optimal treatment rule by maximizing the mean outcome. As the authors advocated, the mean outcome is a clinically relevant and interpretable quantity to measure the value of a given treatment assignment rule. In the current setting, maximizing the mean outcome is equivalent to minimizing
the misclassification error for predicting Ti among the subgroup of patients having outcome Di = 1, with respect to the binary classification rule g : Y → {0, 1}. This again provides a very natural platform for applying the AdaBoosting algorithm. Specifically, one may use AdaBoosting procedure to construct a classification rule to classify a patient to his/her actual randomized treatment assignment based on Y for the subgroup of patients with D = 1. To justify the validity of this boosting procedure, one only needs to note that the minimizer of the corresponding loss function J(d) = E[exp{−(2T − 1)d(Y)} | D = 1] is
Thus, the d2(·)-based treatment assignment rule I{d2(Y) ≥ 0} = I{Δ(Y) ≤ 0} and the corresponding boosting procedure provides a flexible algorithm to approximate the decision boundary of the optimal treatment rule.
I have selected the two most challenging scenarios (4 and 6) in the simulation study performed by Kang, Janes, and Huang to examine the relative performance of the above two proposals. For both cases, the “ada” function with default parameters in R is used to perform the boosting algorithm. The sample size of the training set is 500 and final results are based on 1000 Monte-Carlo replications (Table 1). It seems that the empirical performances of these two simple alternatives are comparable to those proposed by Kang, Janes, and Huang with classification tree as the base learner.
Table 1.
The results of simulation study with a sample size of n = 500 for scenarios 4 and 6 from Table 2 of the article by Kang, Janes, and Huang. The results of the two new methods are summarized in the last two columns under the names “” and “value-function.”
| Working model | Linear logistic | Classification tree | Separate AdaBoosting | New boosting | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Scenario | True θ | Algorithm | MLE | Boosting | IPWE | AIPWE | Single tree | Boosting | Value-function | |||
| 4 | 0.0657 | θ | Mean | 0.0574 | 0.0607 | 0.0561 | 0.0567 | 0.0221 | 0.0378 | 0.0352 | 0.0396 | 0.0351 |
| MCR | Mean | 0.1511 | 0.1206 | 0.1719 | 0.1653 | 0.5397 | 0.3251 | 0.5304 | 0.3270 | 0.3374 | ||
| 6 | 0.1393 | θ | Mean | 0.0236 | 0.0438 | 0.0498 | 0.0544 | 0.0978 | 0.1186 | 0.1010 | 0.1006 | 0.1011 |
| MCR | Mean | 0.3865 | 0.3542 | 0.3452 | 0.3330 | 0.2697 | 0.1762 | 0.2433 | 0.2290 | 0.2227 | ||
3. Remarks
We statisticians can learn a great deal from the rapid development of machine learning techniques, which oftentimes offer robust performance for a broad range of problems. The authors convincingly demonstrated the power of a version of the generalized boosting method for estimating the optimal strategy for assigning treatment. On the other hand, the proposed boosting method is different from its conventional counterpart in several key aspects and new explanations are needed to account for its good performance. Furthermore, there are alternative boosting procedures, which in my opinion can circumvent the difficulties caused by unobservable class labels Si = I{Δ(Yi) ≤ 0}, i = 1, …, n, in more natural ways. Further research in these directions is certainly warranted.
References
- Freund Y and Schapire RE (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119–139. [Google Scholar]
- Friedman J, Hastie T, and Tibshirani R (2000). Additive logistic regression: A statistics view of boosting (with discussion and a rejoinder by the authors). Annals of Statistics, 28, 337–407. [Google Scholar]
- Signorovitch JE (2007). Identifying informative biological markers in high-dimensional genomic data and clinical trials. Ph.D. thesis, Harvard University. [Google Scholar]
