Skip to main content
Entropy logoLink to Entropy
. 2020 Apr 30;22(5):513. doi: 10.3390/e22050513

Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations

Ang Li 1,†,, Luis Pericchi 1,*,†,, Kun Wang 1,†,
PMCID: PMC7517004  PMID: 33286285

Abstract

There is not much literature on objective Bayesian analysis for binary classification problems, especially for intrinsic prior related methods. On the other hand, variational inference methods have been employed to solve classification problems using probit regression and logistic regression with normal priors. In this article, we propose to apply the variational approximation on probit regression models with intrinsic prior. We review the mean-field variational method and the procedure of developing intrinsic prior for the probit regression model. We then present our work on implementing the variational Bayesian probit regression model using intrinsic prior. Publicly available data from the world’s largest peer-to-peer lending platform, LendingClub, will be used to illustrate how model output uncertainties are addressed through the framework we proposed. With LendingClub data, the target variable is the final status of a loan, either charged-off or fully paid. Investors may very well be interested in how predictive features like FICO, amount financed, income, etc. may affect the final loan status.

Keywords: objective Bayesian inference, intrinsic prior, variational inference, binary probit regression, mean-field approximation

1. Introduction

There is not much literature on objective Bayesian analysis for binary classification problems, especially for intrinsic prior related methods. By far, only two articles have explored intrinsic prior related methods on classification problems. Reference [1] implements integral priors into the generalized linear models with various link functions. In addition, reference [2] considers intrinsic priors for probit models. On the other hand, variational inference methods have been employed to solve classification problem with logistic regression ([3]) and probit regression ([4,5]) with normal priors. Variational approximation methods have been reviewed in [6,7], and more recently [8].

In this article, we propose to apply variational approximations on probit regression models with intrinsic priors. In Section 4, we review the mean-field variational method that will be used in this article. In Section 3, procedures for developing intrinsic priors for probit models will be introduced following [2]. Our work is presented in Section 5. Our motivations for combining intrinsic prior methodology and variational inference is as following

  • Avoiding manually set ad hoc plugin priors by automatically generating a family of non-informative priors that are less sensible.

  • Reference [1,2] do not consider inference of posterior distributions of parameters. Their focus is on model comparison. Although the development of intrinsic priors itself comes from a model selection background, we thought it would be interesting to apply intrinsic priors on inference problems. In fact, some recently developed priors that proposed to solve inference or estimation problems turned out to be also intrinsic priors. For example, the Scaled Beta2 prior [9] and the Matrix-F prior [10].

  • Intrinsic priors concentrate probability near the null hypothesis, a condition that is widely accepted and should be required of a prior for testing a hypothesis.

  • Also, intrinsic priors have flat tails that prevents finite sample inconsistency [11].

  • For inference problems with large data set, variational approximation methods are much faster than MCMC-based methods.

As for model comparison, due to the fact that the output of variational inference methods cannot be employed directly to compare models, we propose in Section 5.3 to simply make use of the variational approximation of the posterior distribution as an importance function and get the Monte Carlo estimated marginal likelihood by importance sampling for model comparison.

2. Background and Development of Intrinsic Prior Methodology

2.1. Bayes Factor

The Bayesian framework of model selection coherently involves the use of probability to express all uncertainty in the choice of model, including uncertainty about the unknown parameters of a model. Suppose that models M1,M2,...,Mq are under consideration. We shall assume that the observed data x=(x1,x2,...,xn) is generated from one of these models but we do not know which one it is. We express our uncertainty through prior probability P(Mj),j=1,2,...,q. Under model Mi, x has density fi(x|θi,Mi), where θi are unknown model parameters, and the prior distribution for θi is πi(θi|Mi). Given observed data and prior probabilities, we can then evaluate the posterior probability of Mi using Bayes’ rule

P(Mi|x)=pi(x|Mi)P(Mi)j=1qpj(x|Mj)P(Mj), (1)

where

pi(x|Mi)=fi(x|θi,Mi)πi(θi|Mi)dθi (2)

is the marginal likelihood of x under Mi, also called the evidence for Mi [12]. A common choice of prior model probabilities is P(Mj)=1q, so that each model has the same initial probability. However, there are other alternatives of assigning probabilities to correct for multiple comparison (See [13]). From (1), the posterior odds are therefore the prior odds multiplied by the Bayes factor

P(Mj|x)P(Mi|x)=P(Mj)pj(x)P(Mi)pi(x)=P(Mj)P(Mi)×Bji. (3)

where the Bayes factor of Mj to Mi is defined by

Bji=pj(x)pi(x)=fj(x|θj)πj(θj)dθjfi(x|θi)πi(θi)dθi. (4)

Here we omit the dependence on models Mj,Mi to keep the notation simple. The marginal likelihood, pi(x) expresses the preference shown by the observed data for different models. When Bji>1, the data favor Mj over Mi, and when Bji<1 the data favor Mi over Mj. A scale for interpretation of Bji is given by [14].

2.2. Motivation and Development of Intrinsic Prior

Computing Bji requires specification of πi(θi) and πj(θj). Often in Bayesian analysis, when prior information is weak, one can use non-informative (or default) priors πiN(θi). Common choices for non-informative priors are the uniform prior, πiU(θi)1; the Jeffreys prior, πiJ(θi)det(Ii(θi))1/2 where Ii(θi) is the expected Fisher information matrix corresponding to Mi.

Using any of the πiN in (4) would yield

BjiN=pjN(x)piN(x)=fj(x|θj)πjN(θj)dθjfi(x|θi)πiN(θi)dθi. (5)

The difficulty with (5) is that πiN are typically improper and hence are defined only up to an unspecified constant ci. So BjiN is defined only up to the ratio cj/ci of two unspecified constants.

An attempt to circumvent the ill definition of the Bayes factors for improper non-informative priors is the intrinsic Bayes factor introduced by [15], which is a modification of a partial Bayes factor [16]. To define the intrinsic Bayes factor we consider the set of subsamples x(l) of the data x of minimal size l such that 0<piN(x(l))<. These subsamples are called training samples (not to be confused with training sample in machine learning). In addition, there is a total number of L such subsamples.

The main idea here is that training sample x(l) will be used to convert the improper πiN(θi) to proper posterior

πiN(θi|x(l))=fi(x(l)|θi)πiN(θi)piN(x(l)) (6)

where piN(x(l))=fi(x(l)|θi)πiN(θi)dθi. Then, the Bayes factor for the remaining of the data x(nl), where x(l)x(nl)=x, using πiN(θi|x(l)) as prior is called a “partial” Bayes factor,

BjiN(x(nl)|x(l))=fj(x(nl)|θj)πjN(θj|x(l))dθjfi(x(nl)|θi)πiN(θi|x(l))dθi (7)

This partial Bayes factor is a well-defined Bayes factor, and can be written as BjiN(x(nl)|x(l))=BjiN(x)Bij(x(l)), where BjiN(x)=pjN(x)piN(x) and Bij(x(l))=piN(x(l))pjN(x(l)). Clearly, BjiN(x(nl)|x(l)) will depend on the choice of the training samples x(l). To eliminate this arbitrariness and increase stability, reference [15] suggests averaging over all training samples and obtained the arithmetic intrinsic Bayes factor (AIBF)

BjiAIBF(x)=BjiN(x)1Ll=1LBijN(x(l)). (8)

The strongest justification of the arithmetic IBF is its asymptotic equivalence with a proper Bayes factor arising from Intrinsic priors. These intrinsic priors were identified through an asymptotic analysis (see [15]). For the case where Mi is nested in Mj, it can be shown that the intrinsic priors are given by

πiI(θi)=πiN(θi)andπjI(θj)=πjN(θj)EMjmiN(x(l))mjN(x(l))|θj. (9)

3. Objective Bayesian Probit Regression Models

3.1. Bayesian Probit Model and the Use of Auxiliary Variables

Consider a sample y=(y1,...,yn), where Yi,i=1,...,n, is a 01 random variable such that under model Mj, it follows a probit regression model with a j+1-dimensional vector of covariates xi, where jp. Here, p is the total number of covariate variables under our consideration. In addition, this probit model Mj has the form

Yi|β0,...,βj,MjBernoulli(Φ(β0x0i+β1x1i+...+βjxji)),1in, (10)

where Φ denotes the standard normal cumulative distribution function and βj=(β0,...,βj) is a vector of dimension j+1. The first component of the vector xi is set equal to 1 so that when considering models of the form (10), the intercept is in any submodel. The maximum length of the vector of covariates is p+1. Let π(β), proper or improper, summarize our prior information about β. Then the posterior density of β is given by

π(β|y)=π(β)i=1nΦ(xiβ)yi(1Φ(xiβ)1yi)π(β)i=1nΦ(xiβ)yi(1Φ(xiβ)1yi)dβ,

which is largely intractable.

As shown by [17], the Bayesian probit regression model becomes tractable when a particular set of auxiliary variables is introduced. Based on the data augmentation approach [18], introducing n latent variables Z1,...,Zn, where

Zi|βN(xiβ,1).

The probit model (10) can be thought of as a regression model with incomplete sampling information by considering that only the sign of zi is observed. More specifically, define Yi=1 if Zi>0 and Yi=0 otherwise. This allows us to write the probability density of yi given zi

p(yi|zi)=I(zi>0)I(yi=1)+I(zi0)I(yi=0).

Expansion of the parameter set from {β} to {β,Z} is the key to achieving a tractable solution for variational approximation.

3.2. Development of Intrinsic Prior for Probit Models

For the sample z=(z1,...,zn), the null normal model is

M1:{Nn(z|α1n,In),π(α)}.

For a generic model Mj with j+1 regressors, the alternative model is

Mj:{Nn(z|Xjβj,In),π(βj)},

where the design matrix Xj has dimensions n×(j+1). Intrinsic prior methodology for the linear model was first developed by [19], and was further developed in [20] by using the methods of [21]. This intrinsic methodology gives us an automatic specification of the priors π(α) and π(β), starting with the non-informative priors πN(α) and πN(β) for α and β, which are both improper and proportional to 1.

The marginal distributions for the sample z under the null model, and under the alternative model with intrinsic prior, are formally written as

p1(z)=Nn(z|α1n,In)πN(α)dα,pj(z)=Nn(z|Xjβj,In)πI(β|α)πN(α)dαdβ. (11)

However, these are marginals of the sample z, but our selection procedure requires us to compute the Bayes factor of model Mj versus the reference model M1 for the sample y=(y1,...,yn). To solve this problem, reference [2] proposed to transform the marginal pj(z) into the marginal pj(y) by using the probit transformations yi=1(zi>0),i=1,...,n. These latter marginals are given by

pj(y)=A1×...×Anpj(z)dz (12)

where

Ai=(0,)ifyi=1,(,0)ifyi=0. (13)

4. Variational Inference

4.1. Overview of Variational Methods

Variational methods have their origins in the 18th century with the work of Euler, Lagrange, and others on the calculus of variations (The derivation in this section is standard in the literature on variational approximation and will at times follow the arguments in [22,23]). Variational inference is a body of deterministic techniques for making approximate inference for parameters in complex statistical models. Variational approximations are a much faster alternative to Markov Chain Monte Carlo (MCMC), especially for large models, and are a richer class of methods than the Laplace approximation [6].

Suppose we have a Bayesian model and a prior distribution for the parameters. The model may also have latent variables, here we shall denote the set of all latent variables and parameters by θ. In addition, we denote the set of all observed variables by X. Given a set of n independent, identically distributed data, for which X={x1,...,xn} and θ={θ1,...,θn}, our probabilistic model (e.g., probit regression model) specifies the joint distribution p(X,θ), and our goal is to find an approximation for the posterior distribution p(θ|X) as well as for the marginal likelihood p(X). For any probability distribution q(θ), we have the following decomposition of the log marginal likelihood

lnp(X)=L(q)+KL(q||p)

where we have defined

L(q)=q(θ)lnp(X,θ)q(θ)dθ (14)
KL(q||p)=q(θ)lnp(θ|X)q(θ)dθ (15)

We refer to (14) as the lower bound of the log marginal likelihood with respect to the density q, and (15) is by definition the Kullback–Leibler divergence of the posterior q(θ|X) from the density q. Based on this decomposition, we can maximize the lower bound L(q) by optimization with respect to the distribution q(θ), which is equivalent to minimizing the KL divergence. In addition, the lower bound is attained when the KL divergence is zero, which happens when q(θ) equals the posterior distribution p(θ|X). It would be hard to find such a density since the true posterior distribution is intractable.

4.2. Factorized Distributions

The essence of the variational inference approach is approximation to the posterior distribution p(θ|X) by q(θ) for which the q dependent lower bound L(q) is more tractable than the original model evidence. In addition, tractability is achieved by restricting q to a more manageable class of distributions, and then maximizing L(q) over that class.

Suppose we partition elements of θ into disjoint groups {θi} where i=1,...,M. We then assume that the q density factorizes with respect to this partition, i.e.,

q(θ)=i=1Mqi(θi). (16)

The product form is the only assumption we made about the distribution. Restriction (16) is also known as mean-field approximation and has its root in Physics [24].

For all distributions q(θ) with the form (16), we need to find the distribution for which the lower bound L(q) is largest. Restriction of q to a subclass of product densities like (16) gives rise to explicit solutions for each product component in terms of the others. This fact, in turn, leads to an iterative scheme for obtaining the solutions. To achieve this, we first substitute (16) into (14) and then separate out the dependence on one of the factors qj(θj). Denoting qj(θj) by qj to keep the notation clear, we obtain

L(q)=i=1Mqilnp(X,θ)i=1Mlnqidθ=qjlnp(X,θ)ijqidθidθjqjlnqjdθj+constant=qjlnp˜(X,θj)dθjqjlnqjdθj+constant (17)

where p˜(X,θj) is given by

lnp˜(X,θj)=Eij[lnp(X,θ)]+constant. (18)

The notation Eij[·] denotes an expectation with respect to the q distributions over all variables zi for ij, so that

Eij[lnp(X,θ)]=lnp(X,θ)ijqidθi.

Now suppose we keep the {qij} fixed and maximize L(q) in (17) with respect to all possible forms for the density qj(θj). By recognizing that (17) is the negative KL divergence between p˜(X,θj) and qj(θj), we notice that maximizing (17) is equivalent to minimize the KL divergence, and the minimum occurs when qj(θj)=p˜(X,θj). The optimal qj*(θj) is then

lnqj*(θj)=Eij[lnp(X,θ)]+constant. (19)

The above solution says that the log of the optimal qj is obtained simply by considering the log of the joint distribution of all parameter, latent and observable variables and then taking the expectation with respect to all the other factors qi for ij. Normalizing the exponential of (19), we have    

qj*(θj)=exp(Eij[lnp(X,θ)])exp(Eij[lnp(X,θ)])dθj.

The set of equations in (19) for j=1,...,M are not an explicit solution because the expression on the right hand side of (19) for the optimal qj* depends on expectations taken with respect to the other factors qi for ij. We will need to first initialize all of the factors qi(θi) and then cycle through the factors one by one and replace each in turn with an updated estimate given by the right hand side of (19) evaluated using the current estimates for all of the other factors. Convexity properties can be used to show that convergence to at least local optima is guaranteed [25]. The iterative procedure is described in Algorithm 1.

Algorithm 1 Iterative procedure for obtaining the optimal densities under factorized density restriction (16). The updates are based on the solutions given by (19).
  • 1:

    Initialize q2(θ2),,qM(θM).

  • 2:
    Cycle through
    q1(θ1)exp(Ei1[lnp(X,θ)])exp(Ei1[lnp(X,θ)])dθ1qM(θM)exp(EiM[lnp(X,θ)])exp(EiM[lnp(X,θ)])dθM
    until the increase in L(q) is negligible.

5. Incorporate Intrinsic Prior with Variational Approximation to Bayesian Probit Models

5.1. Derivation of Intrinsic Prior to Be Used in Variational Inference

Let Xl be the design matrix of a minimal training sample (mTS) of a normal regression model Mj for the variable ZN(Xjβj,Ij+1). We have, for the j+1-dimensional parameter βj,

Nj+1(zl|Xlβj,Ij+1)dβj=|XlXl|1/2ifrankofXlj+1otherwise.

Therefore, it follows that the mTS size is j+1 [2]. Given that priors for α and β are proportional to 1, the intrinsic prior for β conditional on α could be derived. Let β0 denote the vector with the first component equal to α and the others equal to zero. Based on Formula (9), we have

πI(β|α)=πjN(β)Ezl|βMjp1(zl|α)pj(zl|β)πjN(β)dβ=Ezl|βMjexp{12(zlXlβ0)(zlXlβ0)}exp{12(zlXlβ)(zlXlβ)}dβ=(2π)(j+1)2|(XlXl)1|12×Ezl|βMjexp{12(zlXlβ0)(zlXlβ0)}=(2π)(j+1)2|2(XlXl)1|12exp{12[(ββ0)XlXl2(ββ0)]}.

Therefore,

πI(β|α)=Nj+1(β|β0,2(XlXl)1),whereβ0=α00(j+1)×1.

Notice that XlXl is unknown because it is a theoretical design matrix corresponding to the training sample zl. It can be estimated by averaging over all submatrices containing j+1 rows of the n×(j+1) design matrix Xj. This average is j+1nXjXj (See [26] and Appendix A in [2]), and therefore

πI(β|α)=Nj+1(β|β0,2nj+1(XjXj)1).

Next, based on πI(β|α), the intrinsic prior for β can be obtained by

πI(β)=πI(β|α)πI(α)dα. (20)

Since we assume that πI(α)=πN(α) is proportional to one, set πN(α)=c where c is an arbitrary positive constant. Denote 2nj+1(XjXj)1 by Σβ|α, we obtain

πI(β)=πI(β|α)πI(α)dα=c·(2π)j+12|Σβ|α|12exp{12(ββ0)Σβ|α1(ββ0)}dαexp{12βΣβ|α1β}×exp{12[β0Σβ|α1β02βΣβ|α1β0]}dαexp{12βΣβ|α1β}×exp{12(Σβ|α(1,1)1α22βΣβ|α(·1)1α)}dα (21)

where Σβ|α(1,1)1 is component of Σβ|α1 at position row 1 column 1 and Σβ|α(·1)1 is the first column of Σβ|α1. Denote Σβ|α(1,1)1 by σ11 and Σβ|α(·1)1 by γ1, we then obtain

πI(β)exp{12βΣβ|α1β}×exp{12σ11(αβγ1σ11)2+12(βγ1)2σ11}dαexp{12(βΣβ|α1ββγ1γ1σ11β)}×2πσ111/2exp{12β(Σβ|α1γ1γ1σ11)β}. (22)

Therefore, we have derived that

πI(β)Nj+1(0,(Σβ|α1γ1γ1σ11)1). (23)

For model comparison, the specific form of the intrinsic prior may be needed, including the constant factor. Therefore, by following (21) and (22) we have

πI(β)=c·(2π)j+12|Σβ|α|12(2π)j+12|(Σβ|α1γ1γ1σ11)1|122πσ111/2×Nj+1(0,(Σβ|α1γ1γ1σ11)1)=c·|Σβ|α(Σβ|α1γ1γ1σ11)|122πσ111/2×Nj+1(0,(Σβ|α1γ1γ1σ11)1)=c·2πσ111/2|(Iγ1γ1σ11Σβ|α)|12×Nj+1(0,(Σβ|α1γ1γ1σ11)1). (24)

5.2. Variational Inference for Probit Model with Intrinsic Prior

5.2.1. Iterative Updates for Factorized Distributions

We have that

Zi|βN(xiβ,1)andp(yi|zi)=I(zi>0)I(yi=1)+I(zi0)I(yi=0)

in Section 3.1. We have shown in Section 5.1 that

πI(β)Nj+1(μβ,Σβ),

where μβ=0 and Σβ=(Σβ|α1γ1γ1σ11)1. Since y is independent of β given z, we have

p(y,z,β)=p(y|z,β)p(z|β)p(β)=p(y|z)p(z|β)p(β). (25)

To apply the variational approximation to probit regression model, unobservable variables are considered in two separate groups, coefficient parameter β and auxiliary variable Z. To approximate the posterior distribution of β, consider the product form

q(Z,β)=qZ(Z)qβ(β).

We proceed by first describing the distribution for each factor of the approximation, qZ(Z) and qβ(β). Then variational approximation is accomplished by iteratively updating the parameters of each factor distribution.

Start with qZ(Z), when yi=1, we have

logp(y,z,β)=logi12πexp{(zixiβ)22}×πI(β)wherezi>0.

Now, according to (19) and Algorithm 1, the optimal qZ is proportional to

graphic file with name entropy-22-00513-i001.jpg

So, we have the optimal qZ,

qZ*(Z)exp{12zz+Eβ[β]Xz+constant}exp{12(zXEβ[β])(zXEβ[β])}.

Similar procedure could be used to develop cases when yi=0. Therefore, we have that the optimal approximation for qZ is a truncated normal distribution, where

qZ*(Z)=N[0,+)(XEβ[β]i,1)ifyi=1,N(,0](XEβ[β]i,1)ifyi=0. (26)

Denote XEβ[β] by μz, the location of distribution qZ*(Z). The expectation Eβ is taken with respect to the density form of q(β) for which we shall derive now.

For qβ(β), given the joint form in (25), we have

logp(y,z,β)=12exp{(zXβ)(zXβ)}12exp{(βμβ)Σβ1(βμβ)}+constant.

Taking expectation with respect to qZ(z), we have

graphic file with name entropy-22-00513-i002.jpg

Again, based on (19) and Algorithm 1, the optimal qβ(β) is proportional to EZ[logp(y,z,β)],

qβ*(β)12β(XX+Σβ1)β+(EZ[Z]X+μβΣβ1)β.

First notice that any constant terms, including constant factor in the intrinsic prior, were canceled out due to the ratio form of (19). Then by noticing the quadratic form in the above formula we have

qβ*(β)=N(μqβ,Σqβ), (27)

where

Σqβ=(XX+Σβ1)1,μqβ=(XX+Σβ1)1(EZ[Z]X+μβΣβ1).

Notice that μqβ, i.e., Eβ[β], depends on EZ[Z]. In addition, from our previous derivation, we found that the update for EZ[Z] depends on Eβ[β]. Given that the density form of qZ is truncated normal, we have

EZ[Zi]=XEβ[β]i+ϕ(XEβ[β]i)1Φ(XEβ[β])iifyi=1,XEβ[β]iϕ(XEβ[β]i)Φ(XEβ[β])iifyi=0,

where ϕ is the standard normal density and Φ is the standard normal cumulative density. Denote EZ[Z] by μqZ. See properties of truncated normal distribution in Appendix A. Updating procedures for parameters μqβ and μqZ of each factor distribution are summarized in Algorithm 2.

Algorithm 2 Iterative procedure for updating parameters to reach optimal factor densities qβ and qZ in Bayesian probit regression model. The updates are based on the solutions given by (26) and (27).
  • 1:

    Initialize μqZ.

  • 2:
    Cycle through
    μqβ(XX+Σβ1)1(μqzX+μβΣβ1),μqZXμqβ+ϕ(Xμqβ)Φ(Xμqβ)y[Φ(Xμqβ)1]1y,
    until the increase in L(q) is negligible.

5.2.2. Evaluation of the Lower Bound L(q)

During the process of optimization of variational approximation densities, the lower bound for the log marginal likelihood need to be evaluated and monitored to determine when the iterative updating process converges. Based on derivations from previous section, we now have the exact form for the variational inference density,

q(β,Z)=qβ(β)qZ(Z).

According to (14), we can write down the lower bound L(q) with respect to q(β,Z).

L(q)=q(β,Z)logp(Y,β,Z)q(β,Z)dβdZ=qβ(β)qZ(Z)logp(Y,β,Z)qβ(β)qZ(Z)dβdZ=qβ(β)qZ(Z)log{p(Y,β,Z)}dβdZqβ(β)qZ(Z)log{qβ(β)qZ(Z)}dβdZ=Eβ,Z[log{p(Y,Z|β)}]+Eβ,Z[πI(β)]Eβ,Z[log{qβ(β)}]Eβ,Z[log{qZ(Z)}]. (28)

As we can see in (28), L(q) has been divided into four different parts with expectation taken over the variational approximation density q(β,Z)=qβ(β)qZ(Z). We now find the expression of these expectations one by one.

Part 1: Eβ,Z[log{p(Y,Z|β)}]
=log(2π)n2+qβ(β)qZ(Z){12(zXβ)(zXβ)}dβdz=log(2π)n2+qZ(Z)qβ(β){12(βXXβ2zXβ+zz)}dβdz (29)

Deal with the inner integral first, we have

qβ(β){12(βXXβ2zXβ+zz)}dβ=12qβ(β)[βXXβ]dβ+zXEβ[β]12zz=12qβ(β)[βXXβ]dβ+zXμqβ12zz (30)

where

12qβ(β)[βXXβ]dβ=12qβ(β)[(βμqβ+μqβ)XX(βμqβ+μqβ)]dβ=12trace(XXEβ[(βμqβ)(βμqβ)])12μqβXXμqβ=12trace(XX[μqβμqβ+Σqβ]). (31)

Substitute (31) into (30), we got

qβ(β){12(βXXβ2zXβ+zz)}dβ=12trace(XX[μqβμqβ+Σqβ])+zXμqβ12zz. (32)

Substituting (32) back into (29) gives

Eβ,Z[log{p(Y,Z|β)}]=log(2π)n2+qZ(z){12trace(XX[μqβμqβ+Σqβ])+zXμqβ12zz}dz=log(2π)n212trace(XX[μqβμqβ+Σqβ])12EZ[zz]+μqzμz=log(2π)n212trace(XX[μqβμqβ+Σqβ])+μqzμz12i=1n[1+μzi2μziϕ(μzi)Φ(μzi)]I(yi=0)[1+μzi2+μziϕ(μzi)1Φ(μzi)]I(yi=1)=log(2π)n212trace(XX[μqβμqβ+Σqβ])+μqzμz12i=1n[1+μqziμzi]I(yi=0)[1+μqziμzi]I(yi=1)=log(2π)n212trace(XX[μqβμqβ+Σqβ])+12μqzμzn2. (33)

We applied properties of truncated normal distribution in Appendix B to find the expression of the second moment EZ[zz].

Part 2: Eβ,Z[logqZ(z)]
=qβ(β)qZ(z)logqZ(z)dβdZ=qZ(z)logqZ(z)dZ=n2(log(2π)+1)+i=1n{[log(Φ(μzi))+μziϕ(μzi)2Φ(μzi)]I(yi=0)[log(1Φ(μzi))μziϕ(μzi)2(1Φ(μzi))]I(yi=1)}=n2(log(2π)+1)12μzμz+12μqzμz+i=1n{[log(Φ(μzi))]I(yi=0)[log(1Φ(μzi))]I(yi=1)} (34)

Again, see Appendix B for well-known properties of truncated normal distribution. Now subtracting (34) from (33) we got

Eβ,Z[log{p(Y,Z|β)}]Eβ,Z[logqZ(z)]=12trace(XX[μqβμqβ+Σqβ])+12μzμz+i=1n{[log(Φ(μzi))]I(yi=0)[log(1Φ(μzi))]I(yi=1)}. (35)

Based on the exact expression of the intrinsic prior πI(β), denoting all constant terms by C, we have

Part 3: Eβ,Z[logpβ(β)]
=qZ(z)qβ(β)logπI(β)dβdz=logC(j+1)2log(2π)12log|Σβ|12qβ(β)[βΣβ1β]dβ (36)

To find the expression for the integral, we have

qβ(β)[βΣβ1β]dβ=qβ(β)(βμqβ+μqβ)Σβ1(βμqβ+μqβ)dβ=E[trace(Σβ1(βμqβ)(βμqβ))]+μqβΣβ1μqβ=trace(Σβ1Σqβ)+μqβΣβ1μqβ (37)

Substituting (37) back into (36), we obtained

Eβ,Z[logpβ(β)]=logC(j+1)2log(2π)12log|Σβ|12[trace(Σβ1Σqβ)+μqβΣβ1μqβ]. (38)
Part 4: Eβ,Z[logqβ(β)]
=qZ(z)qβ(β)logqβ(β)dβ=j+12log(2π)12log|Σqβ|12qβ(β)(βμqβ)Σqβ1(βμqβ)dβ=j+12log(2π)12log|Σqβ|12trace(Σβ1Σβ)=j+12(log(2π)+1)12log|Σqβ| (39)

Combining all four parts together, we get

L(q)=Eβ,Z[log{p(Y,Z|β)}]+Eβ,Z[πI(β)]Eβ,Z[log{qβ(β)}]Eβ,Z[log{qZ(Z)}]=12trace(XX[μqβμqβ+Σqβ])+12μzμz+i=1n{[log(Φ(μzi))]I(yi=0)[log(1Φ(μzi))]I(yi=1)}Eβ,Z[log{p(Y,Z|β)}]Eβ,Z[log{qZ(Z)}]+logC12log|Σβ|12[trace(Σβ1Σqβ)+μqβΣβ1μqβ]+j+12+12log|Σqβ|Eβ,Z[logpβ(β)]Eβ,Z[logqβ(β)]. (40)

5.3. Model Comparison Based on Variational Approximation

Suppose we want to compare two models, M1 and M0, where M0 is the simpler model. An intuitive thought on comparing two models by variational approximation methods is just to compare the lower bounds L(q1) and L(q0). However, we should note that by comparing the lower bounds, we are assuming that the KL divergences in the two approximations are the same, so that we can use just these lower bounds as guide. Unfortunately, it is not easy to measure how tight in theory any particular bound can be, if this can be accomplished we could then more accurately estimate the log marginal likelihood from the beginning. As clarified in [27], when comparing two exact log marginal likelihood, we have

logp1(X)logp0(X)=[L(q1)+KL(q1p1)][L(q0)KL(q0p0)] (41)
=L(q1)L(q0)+[KL(q1p1)KL(q0p0)] (42)
L(q1)L(q0). (43)

The difference in log marginal likelihood, logp1(X)logp0(X), is the quantity we wish to estimate. However, if we base this on the lower bounds difference, we are basing our model comparison on () rather than (41). Therefore, there exists a systematic bias towards simpler model when comparing models if KL(q1p1)KL(q0p0) is not zero.

Realizing that we have a variational approximation for the posterior distribution of β, we propose the following method to estimate p(X) based on our variational approximation qβ(β) (27). First, writing the marginal likelihood as

p(x)=p(x|β)πI(β)qβ(β)qβ(β)dβ,

we can interpret it as the conditional expectation

p(x)=Ep(x|β)πI(β)qβ(β)

with respect to qβ(β). Next, draw samples β(1),...,β(n) from qβ(β) and obtain the estimated marginal likelihood

pX(x)^=1ni=1np(x|β(i))πI(β(i))qβ(β(i)).

Please note that this method proposed is equivalent to importance sampling with importance function being qβ(β), for which we know the exact form and the generation of the random β(i) is easy and inexpensive.

6. Modeling Probability of Default Using Lending Club Data

6.1. Introduction

LendingClub (https://www.lendingclub.com/) is the world’s largest peer-to-peer lending platform. LendingClub enables borrowers to create unsecured personal loans between $1000 and $40,000. The standard loan period is three or five years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee. To attract lenders, LendingClub publishes most of the information available in borrowers’ credit reports as well as information reported by borrowers for almost every loan issued through its website.

6.2. Modeling Probability of Default—Target Variable and Predictive Features

Publicly available LendingClub data, from 2007 June to 2018 Q4, has a total of 2,260,668 issued loans. Each loan has a status, either Paid-off, Charged-off, or Ongoing. We only adopted loans with an end status, i.e., either paid-off or charged-off. In addition, that loan status is the target variable. We then selected following loan features as our predictive covariates.

  • Loan term in months (either 36 or 60)

  • FICO

  • Issued loan amount

  • DTI (Debt to income ratio, i.e., customer’s total debt divided by income)

  • Number of credit lines opened in past 24 months

  • Employment length in years

  • Annual income

  • Home ownership type (own, mortgage, of rent)

We took a sample from the original data set that has customer yearly income between $15,000 and $60,000 and end up with a data set of 520,947 rows.

6.3. Addressing Uncertainty of Estimated Probit Model Using Variational Inference with Intrinsic Prior

Using the process developed in Section 5, we can update the intrinsic prior for parameters (see Figure 1) of the probit model using variational inference, and get the posterior distribution for the estimated parameters. Based on the derived parameter distributions, questions of interest may be explored with model uncertainty being considered.

Figure 1.

Figure 1

Intrinsic Prior.

Investors will be interested in understanding how each loan feature affect the probability of default, given a certain loan term, either 36 or 60. To answer this question, we samples 6000 cases from the original data set and draw from derived posterior distribution 100 times. We end up with 6000×100 calculated probability of default, where each one of the 6000 samples yield 100 different probit estimates based on 100 different posterior draws. We summarize some of our findings in Figure 2, where color red representing 36 months loans and green representing 60 months loans.

Figure 2.

Figure 2

Effect of term months and other covariates on probability of default

  • In general, 60 months loans have higher risk of default.

  • Given loan term months, there is a clear trend showing that high FICO means lower risk.

  • Given loan term months, there is a trend showing that high DTI indicating higher risk.

  • Given loan term months, there is a trend showing that more credit lines opened in past 24 months indicating higher risk.

  • There is no clear pattern regarding income. This is probably because we only included customers with income between $15,000 and $60,000 in our training data, which may not representing the true income level of the whole population.

Model uncertainty could also be measured through credible intervals. Again, with the derived posterior distribution, the credible interval is just the range containing a particular percentage of estimated effect/parameter values. For instance, the 95% credible interval of the estimated parameter value of FICO is simply the central portion of the posterior distribution that contains 95% of the estimated values. Contrary to the frequentist confidence intervals, Bayesian credible interval is much more straightforward to interpret. Using the Bayesian framework created in this article, from Figure 3, we can simply state that given the observed data, the estimated effect of DTI on default has 89% probability of falling within [8.300,8.875]. Instead of the conventional 95%, we used 89% following suggestions in [28,29], which is just as arbitrary as any of the conventions.

Figure 3.

Figure 3

Credible intervals for estimated coefficients

One of the main advantages of using variational inference over MCMC is that variational inference is much faster. Comparisons were made between the two approximation frameworks on a 64-bit Windows 10 laptop, with 32.0 GB RAM. Using the data set introduced in Section 6.2, we have that

  • with a conjugate prior and following the Gibbs sampling scheme proposed by [17], it took 89.86 s to finish 100 simulations for the Gibbs sampler;

  • following our method proposed in Section 5.2, it took 58.38 s to get the approximated posterior distribution and sampling 10,000 times from that posterior.

6.4. Model Comparison

Following the procedure proposed in Section 5.3, we compare the following series of nested models. From the data set introduced in Section 6.2, 2000 records were sampled to estimate the likelihood p(x|β(i)). Where β(i) is one of the 2500 draws sampled directly from the approximated posterior distribution qβ(β), which serves as the importance function used to estimate the marginal likelihood p(x).

  • M2: FICO + Term 36 Indicator

  • M3: FICO + Term 36 Indicator + Loan Amount

  • M4: FICO + Term 36 Indicator + Loan Amount + Annual Income

  • M5: FICO + Term 36 Indicator + Loan Amount + Annual Income + Mortgage Indicator

Estimated log marginal likelihood for each model is plotted in Figure 4. We can see that the model evidence has increased by adding predictive features Loan Amount and Annual Income sequentially. However, if we further adding home ownership information, i.e., Mortgage Indicator as a predictive feature, the model evidence decreased. We have the Bayes factor

BF45=p(x|M4)p(x|M5)=e1014.78(1016.42)=5.16,

which suggests a substantial evidence for model M4, indicating home ownership information may be irrelevant in predicting probability of default given that all the other predictive features are relevant.

Figure 4.

Figure 4

Log marginal likelihood comparison

7. Further Work

The authors thank the reviewers for pointing out that mean-field variational Bayes underestimates the posterior variance. This could be an interesting topic for our future research. We plan to study the linearresponsevariationalBayes (LRVB) method proposed in [30] to see if it can be applied on the framework we proposed in this article. To see if we can get the approximated posterior variance close enough to the true variance using our proposed method, comparisons should be made between normal conjugate prior with the MCMC procedure, normal conjugate prior with LRVB, and intrinsic prior with LRVB.

Appendix A. Density Function

Suppose XN(μ,σ2) has a normal distribution and lies within the interval X(a,b),a<b. Then X conditional on a<X<b has a truncated normal distribution. Its probability density function, f, for aX<b, is given by

f(x|μ,σ,a,b)=1σϕ(xμσ)Φ(bμσ)Φ(aμσ)

and by f=0 otherwise. Here

ϕ(ξ)=12πexp(12ξ2)

is the probability density function of the standard normal distribution and Φ(·) is its cumulative distribution function. If b=, then Φ(bμσ)=1, and similarly, if a=, then Φ(aμσ)=0. And the cumulative density for the truncated normal distribution is

F(x|μ,σ,a,b)=Φ(ξ)Φ(α)Z,

where ξ=xμσ and Z=Φ(β)Φ(α).

Appendix B. Moments and Entropy

Let α=aμσ and β=bμσ. For two-sided truncation:

E(X|a<X<b)=μ+σϕ(α)ϕ(β)Φ(β)Φ(α),Var(X|a<X<b)=σ21+αϕ(α)βϕ(β)Φ(β)Φ(α)ϕ(α)ϕ(β)Φ(β)Φ(α)2.

For one sided truncation (upper tail):

E(X|X>a)=μ+σλ(α)Var(X|X>a)=σ2[1δ(α)],

where α=aμσ,λ(α)=ϕ(α)1Φ(α) and δ(α)=λ(α)[λ(α)α].

For one sided truncation (lower tail):

E(X|X<b)=μσϕ(β)Φ(β)Var(X|X<b)=σ21βϕ(β)Φ(β)ϕ(β)Φ(β)2.

More generally, the moment generating function for truncated normal distribution is

eμt+σ2t2/2·Φ(βσt)Φ(ασt)Φ(β)Φ(α).

For a density f(x) defined over a continuous variable, the entropy is given by

H[x]=f(x)logf(x)dx.

And the entropy for a truncated normal density is

log(2πeσZ)+αϕ(α)βϕ(β)2Z.

Author Contributions

Methodology, A.L., L.P. and K.W.; software, A.L.; writing–original draft preparation, A.L., L.P. and K.W.; writing–review and editing, A.L. and L.P.; visualization, A.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work of L.R.Pericchi was partially funded by NIH grants U54CA096300, P20GM103475 and R25MD010399.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Salmeron D., Cano J.A., Robert C.P. Objective Bayesian hypothesis testing in binomial regression models with integral prior distributions. Stat. Sin. 2015;25:1009–1023. doi: 10.5705/ss.2013.338. [DOI] [Google Scholar]
  • 2.Leon-Novelo L., Moreno E., Casella G. Objective Bayes model selection in probit models. Stat. Med. 2012;31:353–365. doi: 10.1002/sim.4406. [DOI] [PubMed] [Google Scholar]
  • 3.Jaakkola T.S., Jordan M.I. Bayesian parameter estimation via variational methods. Stat. Comput. 2000;10:25–37. doi: 10.1023/A:1008932416310. [DOI] [Google Scholar]
  • 4.Girolami M., Rogers S. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Comput. 2006;18:1790–1817. doi: 10.1162/neco.2006.18.8.1790. [DOI] [Google Scholar]
  • 5.Consonni G., Marin J.M. Mean-field variational approximate Bayesian inference for latent variable models. Comput. Stat. Data Anal. 2007;52:790–798. doi: 10.1016/j.csda.2006.10.028. [DOI] [Google Scholar]
  • 6.Ormerod J.T., Wand M.P. Explaining variational approximations. Am. Stat. 2010;64:140–153. doi: 10.1198/tast.2010.09058. [DOI] [Google Scholar]
  • 7.Grimmer J. An introduction to Bayesian inference via variational approximations. Political Anal. 2010;19:32–47. doi: 10.1093/pan/mpq027. [DOI] [Google Scholar]
  • 8.Blei D.M., Kucukelbir A., McAuliffe J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017;112:859–877. doi: 10.1080/01621459.2017.1285773. [DOI] [Google Scholar]
  • 9.Pérez M.E., Pericchi L.R., Ramírez I.C. The Scaled Beta2 distribution as a robust prior for scales. Bayesian Anal. 2017;12:615–637. doi: 10.1214/16-BA1015. [DOI] [Google Scholar]
  • 10.Mulder J., Pericchi L.R. The matrix-F prior for estimating and testing covariance matrices. Bayesian Anal. 2018;13:1193–1214. doi: 10.1214/17-BA1092. [DOI] [Google Scholar]
  • 11.Berger J.O., Pericchi L.R. Model Selection. Institute of Mathematical Statistics; Beachwood, OH, USA: 2001. Objective Bayesian Methods for Model Selection: Introduction and Comparison; pp. 135–207. [Google Scholar]
  • 12.Pericchi L.R. Model selection and hypothesis testing based on objective probabilities and Bayes factors. Handb. Stat. 2005;25:115–149. [Google Scholar]
  • 13.Scott J.G., Berger J.O. Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Ann. Stat. 2010;38:2587–2619. doi: 10.1214/10-AOS792. [DOI] [Google Scholar]
  • 14.Jeffreys H. The Theory of Probability. OUP; Oxford, UK: 1961. [Google Scholar]
  • 15.Berger J.O., Pericchi L.R. The intrinsic Bayes factor for model selection and prediction. J. Am. Stat. Assoc. 1996;91:109–122. doi: 10.1080/01621459.1996.10476668. [DOI] [Google Scholar]
  • 16.Leamer E.E. Specification Searches: Ad Hoc Inference with Nonexperimental Data. Volume 53 Wiley; New York, NY, USA: 1978. [Google Scholar]
  • 17.Albert J.H., Chib S. Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 1993;88:669–679. doi: 10.1080/01621459.1993.10476321. [DOI] [Google Scholar]
  • 18.Tanner M.A., Wong W.H. The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 1987;82:528–540. doi: 10.1080/01621459.1987.10478458. [DOI] [Google Scholar]
  • 19.Berger J.O., Pericchi L.R. The intrinsic Bayes factor for linear models. Bayesian Stat. 1996;5:25–44. [Google Scholar]
  • 20.Casella G., Moreno E. Objective Bayesian variable selection. J. Am. Stat. Assoc. 2006;101:157–167. doi: 10.1198/016214505000000646. [DOI] [Google Scholar]
  • 21.Moreno E., Bertolino F., Racugno W. An intrinsic limiting procedure for model selection and hypotheses testing. J. Am. Stat. Assoc. 1998;93:1451–1460. doi: 10.1080/01621459.1998.10473805. [DOI] [Google Scholar]
  • 22.Bishop C.M. Pattern Recognition and Machine Learning. Springer; Berlin/Heidelberg, Germany: 2006. [Google Scholar]
  • 23.Jordan M.I., Ghahramani Z., Jaakkola T.S., Saul L.K. An introduction to variational methods for graphical models. Mach. Learn. 1999;37:183–233. doi: 10.1023/A:1007665907178. [DOI] [Google Scholar]
  • 24.Parisi G., Shankar R. Statistical field theory. Phys. Today. 1988;41:110. doi: 10.1063/1.2811677. [DOI] [Google Scholar]
  • 25.Boyd S., Vandenberghe L. Convex Optimization. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
  • 26.Berger J., Pericchi L. Training samples in objective Bayesian model selection. Ann. Stat. 2004;32:841–869. doi: 10.1214/009053604000000229. [DOI] [Google Scholar]
  • 27.Beal M.J. Variational Algorithms for Approximate Bayesian Inference. University College London; London, UK: 2003. [Google Scholar]
  • 28.Kruschke J. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan. Academic Press; Cambridge, MA, USA: 2014. [Google Scholar]
  • 29.McElreath R. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman and Hall/CRC; Boca Raton, FL, USA: 2018. [Google Scholar]
  • 30.Giordano R.J., Broderick T., Jordan M.I. Linear response methods for accurate covariance estimates from mean field variational Bayes; Proceedings of the Advances in Neural Information Processing Systems; Montreal, QC, USA. 7–12 December 2015; pp. 1441–1449. [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES