Skip to main content
. Author manuscript; available in PMC: 2017 Jun 15.
Published in final edited form as: Biometrics. 2016 Oct 4;73(2):391–400. doi: 10.1111/biom.12593

Table 1.

Pseudo algorithm: subject-weighted ensemble trees

  • Input: training data Ln={Xi,wi,Yi}i=1n for regression and {Xi,wi,Ai}i=1n for classification

  • output: an ensemble tree model

    1. Draw bootstrap sample ℒb from the training data, with replacement and equal probability on each subject. Denote the out-of-bag data as ℒo.

    2. At an internal node T, stop if the sample size is sufficiently small. Otherwise, randomly generate candidate splitting variables and cutting point. For each candidate split, denote TL and TR as the two daughter notes resulting from the candidate split. Calculate the score:
      scorereg(T,TL,TR)=VTw-wTLwTVTLw-wTRwTVTRwforregression,scorecla(T,TL,TR)=GTw-wTLwTGTLw-wTRwTGTRwforclassification.

      In the above definitions, wT, wTL and wTR are the sum of subject weights within the corresponding node. VTw=1wTXiTwi(Yi-Y¯Tw)2 is the weighted variance, with Y¯Tw=1wTXiTwiYi. GTw=kϕTkw(1-ϕTkw) is the weighted gini impurity, with ϕTkw=1wTXiTwi1{Ai=k} for class k = 1, …, K. The other quantities are defined accordingly.

    3. Select the candidate split with the highest score, and apply b) and c) to each of the resulting daughter nodes.

    4. Repeat a)–c) until the desired number of trees are fited.