. Author manuscript; available in PMC: 2017 Jun 15.

Published in final edited form as: Biometrics. 2016 Oct 4;73(2):391–400. doi: 10.1111/biom.12593

Table 1.

Pseudo algorithm: subject-weighted ensemble trees

Input: training data $L_{n} = {X_{i}, w_{i}, Y_{i}}_{i = 1}^{n}$ for regression and ${X_{i}, w_{i}, A_{i}}_{i = 1}^{n}$ for classification
output: an ensemble tree model
1. Draw bootstrap sample ℒ_b from the training data, with replacement and equal probability on each subject. Denote the out-of-bag data as ℒ_o.
2. At an internal node T, stop if the sample size is sufficiently small. Otherwise, randomly generate candidate splitting variables and cutting point. For each candidate split, denote T_L and T_R as the two daughter notes resulting from the candidate split. Calculate the score:
  $\begin{matrix} {score}_{reg} (T, T_{L}, T_{R}) = V_{T}^{w} - \frac{w_{T_{L}}}{w_{T}} V_{T_{L}}^{w} - \frac{w_{T_{R}}}{w_{T}} V_{T_{R}}^{w} for regression, \\ {score}_{cla} (T, T_{L}, T_{R}) = G_{T}^{w} - \frac{w_{T_{L}}}{w_{T}} G_{T_{L}}^{w} - \frac{w_{T_{R}}}{w_{T}} G_{T_{R}}^{w} for classification . \end{matrix}$
  
  In the above definitions, w_T, w_{T_L} and w_{T_R} are the sum of subject weights within the corresponding node. $V_{T}^{w} = \frac{1}{w_{T}} \sum_{X_{i} \in T} w_{i} {(Y_{i} - {\bar{Y}}_{T}^{w})}^{2}$ is the weighted variance, with ${\bar{Y}}_{T}^{w} = \frac{1}{w_{T}} \sum_{X_{i} \in T} w_{i} Y_{i}$ . $G_{T}^{w} = \sum_{k} ϕ_{T_{k}}^{w} (1 - ϕ_{T_{k}}^{w})$ is the weighted gini impurity, with $ϕ_{T_{k}}^{w} = \frac{1}{w_{T}} \sum_{X_{i} \in T} w_{i} 1_{{A_{i} = k}}$ for class k = 1, …, K. The other quantities are defined accordingly.
3. Select the candidate split with the highest score, and apply b) and c) to each of the resulting daughter nodes.
4. Repeat a)–c) until the desired number of trees are fited.