Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Apr 4.
Published in final edited form as: Stat (Int Stat Inst). 2016 Apr 4;5(1):119–131. doi: 10.1002/sta4.110

Multinomial probit Bayesian additive regression trees

Bereket P Kindo a,*, Hao Wang b, Edsel A Peña a
PMCID: PMC4909838  NIHMSID: NIHMS779475  PMID: 27330743

Abstract

This article proposes multinomial probit Bayesian additive regression trees (MPBART) as a multinomial probit extension of BART - Bayesian additive regression trees. MPBART is flexible to allow inclusion of predictors that describe the observed units as well as the available choice alternatives. Through two simulation studies and four real data examples, we show that MPBART exhibits very good predictive performance in comparison to other discrete choice and multiclass classification methods. To implement MPBART, the R package mpbart is freely available from CRAN repositories.

Keywords: Bayesian methods, Classification, Statistical computing, Machine learning

1. Introduction

Multinomial probit (MNP) model for discrete choice modeling is often used in economics, market research, political sciences and transportation. It models the choices made by agents given their demographic characteristics and/or the features of the available choice alternatives. In this article, we focus on cases where there are at least three choices. Examples include the study of consumer purchasing behavior (McCulloch et al., 2000; Imai & van Dyk, 2005); voting behavior in multi-party elections (Quinn et al., 1999); and choice of different modes of transportation (Bolduc, 1999). Details of the MNP model in which choices depend on predictors in a linear fashion is studied in McFadden et al. (1973); McFadden (1989); Keane (1992); McCulloch & Rossi (1994); Nobile (1998); McCulloch et al. (2000); Imai & van Dyk (2005); Train (2009); Burgette & Nordheim (2012) among others.

Among widely used multinomial choice modeling procedures are the multinomial logit model (McFadden et al., 1973; Train, 2009) and multinomial probit model (McFadden, 1989; McCulloch & Rossi, 1994; Imai & van Dyk, 2005). The former relies on an assumption that a choice outcome is independent of removal (or introduction) of an irrelevant choice alternative while the latter including MPBART does not make this restrictive assumption. In the multinomial probit regression framework, it is assumed that each decision maker faced with K ≥ 3 alternatives uses a (K – 1) vector of latent variables in order to arrive at their choice. Alternative k is chosen if the kth entry of the latent vector is positive and greater than the other entries, for k = 1, . . . , (K – 1). If none of the entries of the latent vector are positive, then the “reference” alternative K is chosen.

MPBART can also be used as a multiclass classification procedure to classify units into one of K ≥ 3 classes based on their observed characteristics. Multiclass classification is common in many disciplines. In biology, tumors are classified into tumor sub-types based on their gene expression profiles (Khan et al., 2001). In environmental sciences, clouds are classified as clear, liquid clouds, or ice clouds based on their radiance profiles (Lee et al., 2004). Other areas of multiclass classification application include text recognition, spectral imaging, chemistry, and forensic science (Li et al., 2004; Fauvel et al., 2006; Evett & Spiehler, 1987; Vergara et al., 2012).

The effect of predictors on the response may be linear or non-linear, of much or little significance, and at times magnified with interactions. When such complicated relationships exist, models that use ensemble of trees often provide appealing framework since variable selection and inclusion of interactions are intrinsic in construction of trees. Some popular “tree-based” classification methods include CART (Breiman et al., 1984; Quinlan, 1986), Bayesian CART (Chipman et al., 1998), random forests (Breiman, 2001), and gradient boosting (Friedman, 2001). There is a gap in the literature for “tree based” statistical procedures that directly deal with the MNP model in which choice specific predictors can readily be incorporated. This article, thus, seeks to fill that void using Bayesian tree ensembles for multinomial probit regression.

A newcomer to the “tree-based” family is the Bayesian additive regression trees (BART) (Chipman et al., 2010). The innovative idea of BART is to approximate an unknown function f(x) for predicting a continuous variable z given values of input x using a sum-of-trees model

f(x)j=1nTg(x,Tj,Mj),

where g(x, Tj, Mj) is the jth tree that consists of sets of partition rules Tj and parameters Mj associated with its terminal nodes. Conceptually, the sum-of-trees structure makes BART adaptive to complicated nonlinear and interaction effects, and the use of Bayesian regularization prior on regression trees minimizes the risk of over-fitting. Empirically, a variety of experiments and applications of BART has confirmed that it has robust and accurate out-of-sample prediction performance (Liu & Zhou, 2007; Chipman et al., 2010; Abu-Nimeh et al., 2008; Bonato et al., 2011). The standard BART further extends to binary classification problems and shows competitive classification performance (Zhang & Härdle, 2010; Chipman et al., 2010).

The success of BART on predicting continuous and binary variables naturally motivates the question of whether the sum-of-trees structure also helps in predicting multinomial choices and classes, thus, we are interested in the utility of the sum-of-trees for discrete choice modeling. We utilize a Bayesian probit model formulation in Albert & Chib (1993); McCulloch & Rossi (1994); McCulloch et al. (2000); Imai & van Dyk (2005) in conjunction with the idea of sum-of-trees regression to propose multinomial probit Bayesian additive regression trees (MPBART). Through a comprehensive simulation study with various data generating schemes, we find that it is a serious contender in its predictive performance to existing multinomial choice models and multiclass classification methods and that it usually ranks among the topmost when a nonlinear relationship exists between the predictors and choice alternatives.

A related work to this article is Agarwal et al. (2013), which utilizes BART for the purpose of satellite image classification. Their multiclass classification procedure combines binary BART and one-versus-all technique of transforming a multiclass problem to a series of binary classification problems. Our work is different from theirs in that we consider the problem within the traditional multinomial probit regression framework rather than the one-versus-all framework.

The article proceeds as follows. Section 2 formally outlines the multinomial probit model in general and MPBART in particular along with the associated data structure, Section 3 delves into the prior specifications and posterior computation for MPBART. Sections 4 and 5 use simulated data sets and real data examples, respectively to illustrate the predictive performance of MPBART. Section 6 closes the article with concluding remarks.

2. MPBART: Multinomial probit Bayesian additive regression trees

Suppose we have a data set (yi, Xi) for i = 1, . . . , n, where yi ∈ {1, . . . , K} denotes the available choice alternatives and Xi the predictors for the ith observation. We are interested in estimating the conditional choice probability p(yi = k | Xi) for k = 1, . . . , K. The observed choice yi can be viewed as arising from a vector of latent variables ziK1 as in Albert & Chib (1993); Geweke et al. (1994); Imai & van Dyk (2005) via

yi(zi)={kifmax(zi)=zik>0,Kifmax(zi)<0,} (1)

for k = 1, . . . , (K – 1), where max (zi) denotes the largest element of zi = (Zi1, . . . , Zi,K–1)′. The latent vector zi depends on Xi as follows:

zi=G(Xi;T,M)+ϵifori=1,,n, (2)

where G(Xi; T, M) = (G1(Xi; T, M), . . . , GK–1(Xi; T, M))′ is a vector of K – 1 regression functions of i = (i1, . . . , i,K–1)′ ~ N(0, 𝚺).

The predictors for the ith observation are comprised of two compoments vi and Wi (i.e., Xi = (vi, Wi) ). The first component is a vector of q - demographic variables viq that describe the subject. The second component Wi = (wi1, . . . , wi(K–1))), where wikr, is a matrix of r predictors that vary along the choice alternatives in relation to the reference choice. For example, in a market research scenario, the price of the choices faced by individuals in a study is a choice specific predictor that varies along alternatives and the difference between the prices of kth choice and the reference choice K will be part of wik, for k = 1, . . . , (K – 1).

The tree splitting rules of the kth sum of trees

Gk(Xi;T,M)=j=1nTg(Xi,Tkj,Mkj),fork=1,,(K1) (3)

depend on Xi through xik = (vi, wik). The jth tree of the kth sum of trees, g(·, Tkj, Mkj), consists to Tkj, a set of partition rules based on the predictor space, and Mkj = {μkjl, l = 1, . . . , bkj}, a set of parameters associated with the terminal nodes. The partition rules Tkj are recursive binary splits of the form {x < s} versus {xs}, where x is one of the predictors that make up xik, and s is a value in the range of x. The complete set of parameters of MPBART (1)(3) is thus

{(Tkj,Mkj)k=1,,(K1),j=1,,nT,Σ},

where Mkj denotes the collection of terminal nodes of the jth tree in the kth sum-of-trees.

3. Prior specifications and posterior computation

3.1. Prior specifications

3.1.1. The 𝚺 prior

The MNP model specification in (2) exhibits a well documented identifiability issue, for example the multiplication of both sides of (2) by a positive constant does not alter the implied choice outcome (Keane, 1992; McCulloch & Rossi, 1994; McCulloch et al., 2000; Nobile, 1998). To circumvent this issue, McCulloch & Rossi (1994); McCulloch et al. (2000); Imai & van Dyk (2005) among others restrict the first diagonal element of 𝚺 to equal one, while Burgette & Nordheim (2012) restricts the trace of 𝚺 to equal K. We implement the latter.

Consider an augmented latent model

z~i=G(Xi;T,M~)+ϵ~i, (4)

where z~i=αzi, ϵ~i=αϵi, ϵ~iN(0,Σ~), Σ~=α2Σ and M~kj={αμkjl,l=1,,bkj}. Following Imai & van Dyk (2005); Burgette & Nordheim (2012), we place the prior p(Σ)=p(Σ,α2)p(α2Σ)dα2Σ(v+K)2(tr[SΣ1])v(K1)2, with a restriction tr(𝚺) = K; a constrained inverse Wishart distribution induced by Σ~ Inv-Wish (ν,α02S) and α2Σα02tr[SΣ1]χv(K)2.

3.1.2. The Tkj prior

As in Chipman et al. (1998) and Chipman et al. (2010), the prior on a single tree Tkj is specified through a “tree-generating stochastic process” apriori independent of 𝚺. The tree prior consists of (i) the probability of splitting a terminal node, (ii) the distribution of the splitting variable if the node has to split, and (iii) the distribution of the splitting rule given the splitting variable. For step (i), the probability that a terminal node η splits is given by

γ(1+dη)β,γ(0,1),β[0,),

where dη is the depth of the node. A small γ and a big β result in a tree with small number of terminal nodes. In other words, influence of individual trees in the sum can be controlled by carefully choosing γ and β. For step (ii), the splitting variable is uniformly selected from all possible predictors, representing a prior belief of equal level of importance placed on each predictor. For step (iii), given a splitting predictor, the splitting value s is taken to be a random sample from discrete uniform distribution of the set of observed values of the selected predictor, provided that such a value does not result in an empty partition.

3.1.3. The μkjl|Tkj prior

Given a tree Tkj with bkj terminal nodes, the prior distribution on the terminal node parameters is taken to be

μkjlTkjiidN(μk,τk2)fork=1,,(K1).

For binary classification probelems (i.e., K = 2), Chipman et al. (2010) propose choosing μ1 = 0 and τ1=3(rnT) so that the sum-of-tree effect j=1nTg(x,T1j,μ1j) assigns high probability to the interval (−3,3). We extend their method to the multinomial probit setting by assuming μk = 0 and τk=3(rnT) for all k. The hyper-parameters r and nT play the role of adjusting the level of shrinkage on the contribution of each individual tree. Default values r = 2 and nT =200 are recommended by Chipman et al. (2010) which we also find reasonable in the multinomial probit setup.

3.2. Posterior computation

Our posterior sampling scheme relies on the partial marginal data augmentation strategy van Dyk (2010). Marginal data augmentation (MDA) and partial marginal data augmentation (Meng & van Dyk, 1999; Imai & van Dyk, 2005; van Dyk, 2010; Burgette & Nordheim, 2012) introduce a “working parameter” that is identifiable given an augmented data, but not identifiable given the observed data. By strategically augmenting the data, MDA and partial MDA result in a computationally tractable posterior distribution and an MCMC chain with improved convergence.

Our posterior computing is accomplished via cycling through the following three steps (for convenience the intermediate draws are flagged with an asterisk).

  • (i)

    Sample from (z, α2) | T, M, 𝚺, y by obtaining random draws of p{(zi)i=1,...,n | T, M, ∑, y}, and (α*)2 ~ p{α2 | 𝚺, M, T, (zi)i=1,...,n} = p{α2 | 𝚺} followed by transforming to obtain z~i=αzi for all i.

  • (ii)

    Sample from (T,M~)p{T,M~(z~i)i=1,,n,Σ,(α)2,y} followed by recording M=M~α.

  • (iii)

    Sample from (Σ,α2)p{Σ,α2T,M~,(z~i)i=1,,n,y} by random draws of p{Σ~T,M~,(z~i)i=1,,n,y} followed by transforming Σ~ to (𝚺, α2).

Our algorithm utilizes a “partial marginalization” strategy since the working parameter α2 is updated in steps (i) and (iii), but not in (ii) (cf. the marginalization strategy in Imai & van Dyk (2005) where the working parameter is updated in every step).

The first part of obtaining a sample from (i) is iterative random draws of truncated normals from the conditional distribution zik | zi(–k), T, M, 𝚺 ~ N (mik, ψik) with max {0, max(zi(–k))} as a lower truncation point if yi = k and as an upper truncation point of if yik. The conditional first moment and variance mik, and ψik are given by

mik=Gk(Xi;T,M)+σk(k)Σ(k)(k)1[zi(k)G(k)(Xi;T,M)],andψik=σkkσk(k)Σ(k)(k)1σk(k), (5)

where σk(–k) is the kth column of 𝚺 that excludes σkk and 𝚺(–k)(–k) is the matrix 𝚺 that excludes the kth column and row.

For (ii), we sample (Tkj,M~kj) for k = 1, . . . , (K – 1), j = 1, . . . , nT via the following. Given all the trees and their terminal node parameters but the jth tree in the kth sum of trees, Σ~,z~i(K) and (α)2, we observe that

z~ij=g(Xi,Tkj,M~kj)+ϵ~ik,ϵ~ikN(0,ψ~ik), (6)

where z~ik=z~ikljntg(Xi,Tkl,M~kl)σ~k(k)Σ~(k)(k)1[z~i(k)G(k)(Xi;T,M~)] and ψ~ik=(α)2ψik. We use the back-fitting algorithm, also used in Chipman et al. (2010), to obtain posterior samples of (Tkj,M~kj) by considering (6) as the single tree model of Chipman et al. (1998). Finally, the posterior sample in (iii) is Σ~Inv-Wish(ν+n,S~+i=1n[z~iG(Xi;T,M~)][z~iG(Xi;T,M~)]) then taking α2 as tr(Σ~)K and transforming to obtain Σ=Σ~α2.

3.3. Posterior-based prediction

In our Bayesian setting, predictions of future observations y at new values X are based upon the posterior predictive distribution p(yy)=p(yX,,y)p(,y)d, where consists of full unknown parameters of MPBART. For a given loss function, predictions of y are made using the optimal choice a ∈{1, . . . , K} that minimizes the expected posterior predictive loss

EyyL(y,a)=L(y,a)p(yy)dy,

where L (y, a) is the loss function of using class a to predict the unknown choice outcome y. We assume that the loss function L(y, a) assigns a pre-specified non-negative loss to every combination of action a ∈ {1, . . . , K} and true choice y ∈ {1, . . . , K}. These pre-specified loss combinations are described in Table 1 and can equivalently be expressed as

L(y,a)=l=1Km=1KClmI(y=l,a=m), (7)

where I (·) is the usual indicator function.

Table 1.

Pre-specified costs for the loss function L(y, a).

Prediction a
Loss 1 2 ... K
True Choice y 1 C 11 C 12 ... C 1K
2 C 21 C 22 ... C 2K
K CK 1 CK 2 ... CKK

Under the loss function (7), the expected posterior predictive loss is given by

EyyL(y,a)=l=1KClap(y=ly). (8)

We assume that the costs associated with a wrong prediction are all equal to the constant C and correct prediction costs equal to 0 (i.e., Clm = C > 0 for lm, and Cll = 0). Then the expected posterior predictive loss (8) simplifies to Ey|yL(y, a) = C{1 – p(y = a|y)}, which is minimized at

a=argmaxk{p(y=ky),k=1,,K}. (9)

The posterior predictive distribution p(y = l | y) does not have closed form representation and is thus approximated using Monte Carlo samples drawn from the posterior distributions p(Θ | y). Once computed, they enable the estimation of the predictions 9 through a search over the space a ∈ {1, . . . , K}.

4. Synthetic data examples

4.1. A simulation study in the multinomial choice framework

In this three choice simulation study, we use a function similar to the one used in Friedman (1991) to induce a nonlinear relationship between five choice specific predictors wk5,k=1,2,3 and the choice alternatives. The choice specific predictors are from i.i.d Unif[0,1]. In addition, we include a predictor viidUnif[0,2] that describes the observed unit. Suppose that f(u) = 20sin(πu1u2) – 20(u3 – 0.5)2 + 10u4 + 5u5, g(v) = 8v, and

[z1z2]=[f(w1w3)+g(v)f(w2w3)+g(v)]+ϵ,ϵN(0,[10.50.51]). (10)

The response variable is then recorded using

y(z)={kifmax(z)=zk>0,3ifmax(z)<0,}fork=1,2.

This true model contains linear, nonlinear, and interaction effects, making it interesting benchmark data set. We are mainly interested in how well MPBART is able to predict the choices on a test data. Hence, we simulate training and test data sets of 500 observations each and compare the predictive performance on the test data for MPBART, Bayesian multinomial probit model (Bayes-MNP) in Imai & van Dyk (2005), the Multinomial logit (MNL) model in Train (2009); McFadden et al. (1973), and the following multiclass classification procedures: support vector machines with linear (SVM-L) and radial (SVM-R) kernels in Cortes & Vapnik (1995); Vapnik (2013), random forest (RF) in Breiman (2001), linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) in Duda et al. (2012); Friedman et al. (2001), multinomial logistic regression (MNL) in McFadden et al. (1973), classification and regression trees (CART) in Breiman et al. (1984); Quinlan (1986), neural networks (NNET) in Ripley (2007), K-nearest neighbors (KNN) in Cover & Hart (1967) and One vs. All BART (OvA-BART) in Agarwal et al. (2013). We note that for the multiclass classification procedures, a choice specific predictor makes up three separate predictors in this simulation study, one describing each of the choices, putting the total number of predictors for this simulation study at sixteen. For each competing procedure and MPBART, we selected the tuning parameters via a 10-fold cross-validation based on the training data. Table 2 lists the names of these competing procedures, the corresponding R packages utilized and tuning parameters.

Table 2.

List of competing classifiers, the R packages utilized, and tuning parameters that are chosen by cross-validation. The abbreviations in the first column stand for the procedures mentioned in the second paragraph of Section 4.2.

Procedure R Package Tuning parameter(s)
RF randomForest mtry
CART rpart no tuning parameters
SVM-L kernlab C
SVM-R kernlab C and σ
QDA MASS no tuning parameters
LDA MASS no tuning parameters
NNET nnet size and decay
MNL mlogit no tuning parameters
KNN caret k
OvA-BART dbarts k, power, base

The comparison metric we use in this example and all that follow is test error rate

1mi=1mI(y^iyi), (11)

where yi and ŷi are the actual and predicted classes for the ith observation in a given test data set of size m. This metric makes use of the loss function in (7) with a misclassification cost of Clm = 1 and a cost of Cll = 0 for a correct prediction. As can be seen from Table 3, MPBART exhibits a very good out-of-sample predictive accuracy. This is not surprising given the data generating scheme with nonlinear effects.

Table 3.

Comparison of MPBART, and the procedures listed in Table 2 on the first simulation study generated via (10) and the waveform recognition example (12). Training and test data sets of each 500 observations are used for the first simulation study. Training and test data sets of 300 and 500 observations, respectively are used for the waveform recognition example. Average test error rates (with standard errors in parentheses) are reported on 20 replications.

Procedure Simulation Study - I Waveform Recognition

Test Error Rate Rank Test Error Rate Rank

MPBART 0.2725 (0.0060) 1 0.1589 (0.0047) 2
Bayes-MNP 0.3976 (0.0065) 7 0.2167 (0.0197) 11
MNL 0.3921 (0.0064) 6 0.1721 (0.0052) 5
RF 0.4023 (0.0059) 8 0.1676 (0.0043) 3
CART 0.4791 (0.0080) 12 0.3113 (0.0068) 12
SVM-L 0.4072 (0.0058) 9 0.1844 (0.0043) 6
SVM-R 0.3254 (0.0057) 3 0.1708 (0.0053) 4
LDA 0.4095 (0.0064) 10 0.1997 (0.0048) 8
QDA 0.3381 (0.0045) 4 0.2125 (0.0043) 10
NNET 0.2917 (0.0065) 2 0.2012 (0.0071) 9
KNN 0.4195 (0.0070) 11 0.1847 (0.0048) 7
OvA-BART 0.3908 (0.0059) 5 0.1550 (0.0035) 1

4.2. A simulation study for multiclass classification

In this simulation study the waveform recognition problem in Breiman et al. (1984), often used as a benchmark artificial data in multiclass classification studies (Gama et al., 2003; Hastie & Tibshirani, 1996; Keerthi et al., 2005) is employed. The model has 21 predictors and a multiclass response with 3 classes. For each observation, the ith predictor xi is generated from

xi={uh1(i)+(1u)h2(i)+ϵi,ify=1,uh1(i)+(1u)h3(i)+ϵi,ify=2,uh2(i)+(1u)h3(i)+ϵi,ify=3,} (12)

where i = 1, . . . , 21, u ~ Unif[0, 1], i ~ N(0, 1), and hi are three waveform functions: h1(i) = max(6 – |i – 11|, 0), h2(i) = h1(i – 4), and h3(i) = h1(i + 4).

We generate 20 replications of training and test data sets with 300 and 500 observations, respectively from (12) and compare MPBART with classifiers listed in Table 2. Our choice of sample sizes is the same as those in Hastie & Tibshirani (1996) so the results can be compared with them. Table 3 summarizes the average error rates and standard errors in parentheses based on 20 simulations. For LDA, QDA and CART, the error rates are consistent with those reported in Table 1 of Hastie & Tibshirani (1996). MPBART is among best for this data generating scheme exhibiting low average test error rate. Note that Hastie & Tibshirani (1996) report an error rate of 0.157 on test data sets achieved by penalized mixture discriminant analysis.

5. Real data examples

5.1. Multinomial choice data examples

Two discrete choice data sets, dealing with fishing and travel mode choices, are used to illustrate MPBART. Fishing mode choice data is a survey of 1,182 individuals who reported their most recent saltwater fishing modes as either “beach”, “pier”, “boat” or “charter”. The choice specific variables in this data set are expected catch rates per hour and price for each mode of fishing, while the individual specific predictor is monthly income. Details of this data are in Kling & Thomson (1996); Herriges & Kling (1999) and we use the version of data available in the R package mlogit. The second data records the choice of travel mode between Sydney and Melbourne, Australia as either “air”, “train”, “bus” or “car” (Greene, 2003; Kleiber & Zeileis, 2008). It includes 210 individuals’ choice of travel and the following choice specific predictors: general cost associated with the travel mode choice, waiting time at a terminal (with zero recorded for a travel choice of “car”), cost of travel mode and travel time. In addition, the individual specific predictors logarithms of household income, and traveling party size are used. We use the version of the data set in the R package AER (Kleiber & Zeileis, 2008).

After splitting the fishing mode data into ten and the travel mode data into five nearly equal random folds, we implement the procedures MPBART, Bayesian multinomial probit model (Bayes-MNP), the Multinomial logit (MNL) and the multiclass classification procedures listed in Table 2 with one fold set aside as a test data and the remaining folds utilized for training the models. Table 4 reports the average test error rates along with their standard errors. MPBART is again among the procedures with the lowest error rates.

Table 4.

Comparison results on the fishing mode and choice of travel mode data sets. Classification error rates (with standard errors in parentheses) are reported.

Procedure Fishing Mode Travel Mode

Test Error Rate Rank Test Error Rate Rank

MPBART 0.3960 (0.0160) 1 0.0571 (0.0086) 2
Bayes-MNP 0.5546 (0.0171) 10 0.3286 (0.0394) 10
MNL 0.5600 (0.0160) 11 0.3143 (0.0332) 9
RF 0.4746 (0.0148) 3 0.0429 (0.0089) 1
CART 0.5372 (0.0147) 8 0.1048 (0.0161) 3
SVML 0.5034 (0.0139) 6 0.2143 (0.0345) 7
SVMR 0.4882 (0.0194) 4 0.1381 (0.0254) 5
LDA 0.4975 (0.0193) 5 NA NA
NNET 0.5211 (0.0064) 7 0.3048 (0.0739) 8
KNN 0.5406 (0.0189) 9 0.1810 (0.0358) 6
OvA-BART 0.4434 (0.0144) 2 0.1143 (0.0158) 4

5.2. Multiclass classification data examples

Forensic glass and vertebral column classification data sets, both of which are publicly available at the University of California at Irvine (UCI) machine learning data repository (Bache & Lichman, 2013), are used to illustrate MPBART as a multiclass classification procedure in comparison to the multiclass classification procedures listed in Table 2. The forensic glass classification data set consists of 9 features collected on 214 glass samples classified as one of the 6 glass types: building windows float processed, building windows non-float processed, vehicle windows float processed, containers, tableware, or headlamps. The vertebral column data contains 310 patients diagnosed either as normal, having Disk Hernia or Spondylolisthesis. This data set records the pathology of the human vertebral column, whose main function is the protection of the spine, and its dependence on the characteristics of the pelvis and lumbar spine. Further detail on the data set is available in da Rocha Neto et al. (2011); Calle-Alonso et al. (2013).

In our analysis, we split the forensic glass and vertebral column data sets into five and ten nearly equal random folds, respectively. One fold is set aside as test data and the classification methods in Table 2 and MPBART are trained on the remaining folds. Table 5 shows the average classification error rates with standard errors in parenthesis. QDA could not be implemented in this data set since the representation of observations classified as tableware is very small. For the same reason, we only considered five-fold partitioning of the forensic glass data. MPBART, RF and OvA-BART are the top performing procedures in terms of having the lowest classification error.

Table 5.

Classification error rates and standard errors (in parentheses) for vertebral column and forensic glass data sets.

Procedure Vertebral Column Forensic Glass

Test Error Rate Rank Test Error Rate Rank

MPBART 0.1466 (0.0324) 1 0.2946 (0.0182) 2
RF 0.1645 (0.0265) 4 0.2056 (0.0089) 1
CART 0.1839 (0.0160) 8 0.3272 (0.0356) 5
SVML 0.1484 (0.0285) 2 0.3741 (0.0294) 8
SVMR 0.1742 (0.0216) 6 0.3086 (0.0222) 4
LDA 0.1968 (0.0335) 0 0.3833 (0.0145) 9
QDA 0.1548 (0.0254) 3 NA NA
NNET 0.2161 (0.0259) 10 0.3740 (0.0172) 7
MNL 0.6129 (0.0304) 11 0.3834 (0.0269) 10
KNN 0.1806 (0.0334) 7 0.3506 (0.0316) 6
OvA-BART 0.1645 (0.0282) 5 0.3083 (0.0196) 3

6. Conclusion

We have proposed and tested through simulation studies and real data examples the utility of Bayesian ensemble of trees for multinomial probit regression and multiclass classification. Regression trees and their ensembles are widely used for the purpose of classification. However, their use in multinomial probit regression which allows the introduction of choice specific predictors is less explored. MPBART fills that gap in the literature. It exhibits very good predictive performance in a range of examples and is among the best when the relationship between the predictors and choice response is nonlinear. The software implementation of MPBART is freely available as an R package mpbart. For the simulation studies and real data examples, the MPBART tuning parameters selected via cross-validation are available at https://github.com/bpkindo/mpbart_cv_selection/.

Acknowledgements

This research is partially supported by NSF Grant DMS1106435 and NIH Grants R01CA154731, RR17698, and 1P30GM103336-01A1. The authors thank Professor Edward I. George for his 2012 Palmetto Lecture at the University of South Carolina, which partly motivated this research. The authors also thank Professor James Lynch and Professor Edsel Peña's research group (A.K.M Rahman, Lillian Wanda, Piaomu Liu) for their comments and discussions.

References

  1. Abu-Nimeh S, Nappa D, Wang X, Nair S. Bayesian additive regression trees-based spam detection for enhanced email privacy. Availability, Reliability and Security, 2008. ARES 08. Third International Conference on. 2008:1044–1051. [Google Scholar]
  2. Agarwal R, Ranjan P, Chipman H. A new Bayesian ensemble of trees classifier for identifying multi-class labels in satellite images. arXiv preprint arXiv. 2013:1304–4077. [Google Scholar]
  3. Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88(422):669–679. [Google Scholar]
  4. Bache K, Lichman M. UCI Machine Learning Repository. 2013 [Google Scholar]
  5. Bolduc D. A practical technique to estimate multinomial probit models in transportation. Transportation Research Part B: Methodological. 1999;33(1):63–79. [Google Scholar]
  6. Bonato V, Baladandayuthapani V, Broom BM, Sulman EP, Aldape KD, Do KA. Bayesian ensemble methods for survival prediction in gene expression data. Bioinformatics. 2011;27(3):359–367. doi: 10.1093/bioinformatics/btq660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Breiman L. Random forests. Machine Learning. 2001;45(1):5–32. [Google Scholar]
  8. Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. Chapman & Hall/CRC; 1984. [Google Scholar]
  9. Burgette LF, Nordheim EV. The trace restriction: An alternative identification strategy for the Bayesian multinomial probit model. Journal of Business & Economic Statistics. 2012;30(3):404–410. [Google Scholar]
  10. Calle-Alonso F, Pérez C, Arias-Nicolás J, Martín J. Computer-aided diagnosis system: A Bayesian hybrid classification method. Computer Methods and Programs in Biomedicine. 2013;112(1):104–113. doi: 10.1016/j.cmpb.2013.05.029. [DOI] [PubMed] [Google Scholar]
  11. Chipman H, George E, McCulloch R. BART: Bayesian additive regression trees. The Annals of Applied Statistics. 2010;4(1):266–298. [Google Scholar]
  12. Chipman HA, George EI, McCulloch RE. Bayesian CART model search. Journal of the American Statistical Association. 1998;93(443):935–948. [Google Scholar]
  13. Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–297. [Google Scholar]
  14. Cover T, Hart P. Nearest neighbor pattern classification. Information Theory. IEEE Transactions on. 1967;13(1):21–27. [Google Scholar]
  15. da Rocha Neto AR, Sousa R, Barreto GdA, Cardoso JS. Diagnostic of pathology on the vertebral column with embedded reject option. Pattern Recognition and Image Analysis. Springer. 2011:588–595. [Google Scholar]
  16. Duda RO, Hart PE, Stork DG. Pattern classification. John Wiley & Sons; 2012. [Google Scholar]
  17. Evett IW, Spiehler E. Rule induction in forensic science. KBS in Goverment, Online Publications. 1987:107–118. [Google Scholar]
  18. Fauvel M, Chanussot J, Benediktsson JA. Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on. Vol. 2. IEEE; 2006. Evaluation of kernels for multiclass classification of hyperspectral remote sensing data; pp. II–II. [Google Scholar]
  19. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. Vol. 1. Springer series in statistics Springer; Berlin: 2001. [Google Scholar]
  20. Friedman JH. Multivariate adaptive regression splines. The Annals of Statistics. 1991:1–67. [Google Scholar]
  21. Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics. 2001:1189–1232. [Google Scholar]
  22. Gama J, Rocha R, Medas P. Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2003. Accurate decision trees for mining high-speed data streams; pp. 523–528. [Google Scholar]
  23. Geweke J, Keane M, Runkle D. Alternative computational approaches to inference in the multinomial probit model. The Review of Economics and Statistics. 1994:609–632. [Google Scholar]
  24. Greene WH. Econometric Analysis. 5th. Ed. Upper Saddle River, NJ.: 2003. [Google Scholar]
  25. Hastie T, Tibshirani R. Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society. Series B (Methodological) 1996:155–176. [Google Scholar]
  26. Herriges JA, Kling CL. Nonlinear income effects in random utility models. Review of Economics and Statistics. 1999;81(1):62–72. [Google Scholar]
  27. Imai K, van Dyk DA. A Bayesian analysis of the multinomial probit model using marginal data augmentation. Journal of Econometrics. 2005;124(2):311–334. [Google Scholar]
  28. Keane MP. A note on identification in the multinomial probit model. Journal of Business & Economic Statistics. 1992;10(2):193–200. [Google Scholar]
  29. Keerthi SS, Duan K, Shevade S, Poo AN. A fast dual algorithm for kernel logistic regression. Machine Learning. 2005;61(1-3):151–165. [Google Scholar]
  30. Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu C, Peterson C, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine. 2001;7(6):673–679. doi: 10.1038/89044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kleiber C, Zeileis A. Applied Econometrics with R. Springer Science & Business Media; 2008. [Google Scholar]
  32. Kling CL, Thomson CJ. The implications of model specification for welfare estimation in nested logit models. American Journal of Agricultural Economics. 1996;78(1):103–114. [Google Scholar]
  33. Lee Y, Wahba G, Ackerman SA. Cloud classification of satellite radiance data by multicategory support vector machines. Journal of Atmospheric and Oceanic Technology. 2004;21(2):159–169. [Google Scholar]
  34. Li T, Zhang C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004;20(15):2429–2437. doi: 10.1093/bioinformatics/bth267. [DOI] [PubMed] [Google Scholar]
  35. Liu JS, Zhou Q. Predictive modeling approaches for studying protein-DNA binding. Proceedings of ICCM 2007. 2007 [Google Scholar]
  36. McCulloch R, Rossi PE. An exact likelihood analysis of the multinomial probit model. Journal of Econometrics. 1994;64(1):207–240. [Google Scholar]
  37. McCulloch RE, Polson NG, Rossi PE. A Bayesian analysis of the multinomial probit model with fully identified parameters. Journal of Econometrics. 2000;99(1):173–193. [Google Scholar]
  38. McFadden D. A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica: Journal of the Econometric Society. 1989:995–1026. [Google Scholar]
  39. McFadden D, et al. Conditional logit analysis of qualitative choice behavior. 1973 [Google Scholar]
  40. Meng XL, van Dyk DA. Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika. 1999;86(2):301–320. [Google Scholar]
  41. Nobile A. A hybrid Markov chain for the Bayesian analysis of the multinomial probit model. Statistics and Computing. 1998;8(3):229–242. [Google Scholar]
  42. Quinlan JR. Induction of decision trees. Machine Learning. 1986;1(1):81–106. [Google Scholar]
  43. Quinn KM, Martin AD, Whitford AB. Voter choice in multi-party democracies: a test of competing theories and models. American Journal of Political Science. 1999:1231–1247. [Google Scholar]
  44. Ripley BD. Pattern recognition and neural networks. Cambridge University Press; 2007. [Google Scholar]
  45. Train KE. Discrete choice methods with simulation. Cambridge University Press; 2009. [Google Scholar]
  46. van Dyk DA. Marginal markov chain monte carlo methods. Statistica Sinica. 2010;20(4):1423. [Google Scholar]
  47. Vapnik V. The nature of statistical learning theory. Springer Science & Business Media; 2013. [Google Scholar]
  48. Vergara A, Vembu S, Ayhan T, Ryan MA, Homer ML, Huerta R. Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical. 2012 [Google Scholar]
  49. Zhang JL, Härdle WK. The Bayesian additive classification tree applied to credit risk modelling. Computational Statistics & Data Analysis. 2010;54(5):1197–1205. [Google Scholar]

RESOURCES