Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 May 11.
Published in final edited form as: J Am Stat Assoc. 2008 Dec 1;103(484):1659–1664. doi: 10.1198/016214508000000779

Properties and Implementation of Jeffreys's Prior in Binomial Regression Models

Ming-Hui Chen 1, Joseph G Ibrahim 1, Sungduk Kim 1
PMCID: PMC2680313  NIHMSID: NIHMS96771  PMID: 19436775

Abstract

We study several theoretical properties of Jeffreys's prior for binomial regression models. We show that Jeffreys's prior is symmetric and unimodal for a class of binomial regression models. We characterize the tail behavior of Jeffreys's prior by comparing it with the multivariate t and normal distributions under the commonly used logistic, probit, and complementary log–log regression models. We also show that the prior and posterior normalizing constants under Jeffreys's prior are linear transformation-invariant in the covariates. We further establish an interesting theoretical connection between the Bayes information criterion and the induced dimension penalty term using Jeffreys's prior for binomial regression models with general links in variable selection problems. Moreover, we develop an importance sampling algorithm for carrying out prior and posterior computations under Jeffreys's prior. We analyze a real data set to illustrate the proposed methodology.

Keywords: Bayes information criterion, Importance sampling, Normalizing constant, Tail behavior

1. INTRODUCTION

Jeffreys's prior is perhaps the most widely used noninformative prior in Bayesian analysis. For the binomial regression model, Jeffreys's prior is attractive because it is proper under mild conditions and requires no elicitation of hyperparameters whatsoever. The only requirement is a likelihood function from which the prior is then derived using Jeffrey's rule, which is to take the prior distribution to be the determinant of the square root of the Fisher information matrix. There has been an enormous literature on Jeffrey's prior and its properties for a wide variety of applications and models, as well as its connections to various reference priors proposed in the literature. This literature is too vast to list in its entirety here, but some relevant key references include works of Jeffreys (1946, 1961), Kass (1989, 1990), Ibrahim and Laud (1991), Kass and Wasserman (1996), and Berger (2000, 2006). Two excellent books discussing Jeffreys's prior include those of Box and Tiao (1973) and Berger (1985).

Although the literature on Jeffreys's prior is vast, there is very little discussion of theoretical properties of Jeffreys's prior for models in which it is proper, particularly for logistic regression. Such properties include its potential connections to normal or t distributions, the tail behavior of Jeffreys's prior, unimodality and symmetry properties, techniques for sampling, and its properties in variable selection problems. In Section 2 we carry out a detailed investigation of these theoretical properties of Jeffreys's prior for general binomial regression models. In Section 3 we propose an efficient importance sampling algorithm for computing the prior and posterior normalizing constants for Jeffreys's prior and examine its performance. The development of this importance sampling algorithm completely alleviates the need for more complex, time-consuming, and computationally intensive algorithms, such as Gibbs sampling or other Markov chain Monte Carlo (MCMC) algorithms. In Section 4 we use Jeffreys's prior in analyzing a prostate cancer data set. We give a brief discussion of our results and future research in Section 5. We give all proofs and the computational development in Appendixes A and B of the supplementary document.

2. PROPERTIES OF JEFFREYS'S PRIOR IN BINOMIAL REGRESSION MODELS

Suppose that {(xi, yi, ni), i = 1, 2, . . . , n} are independent observations, where yi is a binomial response variable taking a value between 0 and ni (≥1), and xi = (xi0, xi1, . . . , xik)′ is a (k + 1) × 1 vector of (possibly random) covariates and xi0 = 1 for the intercept term. The binomial regression model is assumed for [yi|xi], which has conditional density given by

p(yixi,ni,β)=(niyi){F(xiβ)}yi{1F(xiβ)}niyi,i=1,2,,n, (1)

where β = (β0, β1, . . . , βk)′ denotes a (k + 1) vector of regression coefficients, F(·) denotes a cumulative distribution function (cdf), and F−1 is called the link function. The likelihood function of β is

L(βX,y)=i=1n(niyi){F(xiβ)}yi{1F(xiβ)}niyi, (2)

where y = (y1, y2, . . . , yn)′ and X = (x1, x2, . . . , xn)′ is the n × (k + 1) design matrix. Throughout the article, we assume that F(·) is twice differentiable and that f(z) = dF (z)/dz denotes the probability density function (pdf). Then Jeffreys's prior for β under the binomial regression model (1) is given by

π(βX)XW(β)X12, (3)

where |XW(β)X| denotes the determinant of the matrix XW(β)X, W(β) = diag(w1(β), w2(β), . . . , wn(β)), and

wi(β)=ni{f(xiβ)}2F(xiβ){1F(xiβ)} (4)

for i = 1, 2, . . . , n. The Jeffreys prior given by (3) does not have a closed form in general, in the sense that it is known only up to a prior normalizing constant and is generally improper for most generalized linear models except for binary regression models, such as logistic, probit, and complementary log–log regression models, as shown by Ibrahim and Laud (1991). In addition, it has several additional attractive properties, which we formally state as follows.

Proposition 1

For the binomial regression model (1), assume that X is of full rank. Then Jeffreys's prior (3) for β is proper, and the corresponding moment-generating function of β exists.

Proposition 1 is a special case of theorem 2.1 of Ibrahim and Laud (1991).

Proposition 2

Assume that F(z) is symmetric in the sense that F(−z) = 1 − F(z) and f(−z) = f(z). Then Jeffreys's prior π(β|X) in (3) is symmetric about 0, that is, π(−β|X) = π(β|X), ∀βRk+1, where Rk+1 denotes (k + 1)-dimensional Euclidean space.

The proof of Proposition 2 follows directly from the fact that wi(−β) = wi(β) for i = 1, 2, . . . , n.

Theorem 1

Let q(z)=log[{f(z)}2F(z){1F(z)}]=2logf(z)logF(z)log{1F(z)}. Assume that (a) X is of full rank; (b) q(z) has a unique mode, zmod; and (c) q′(z) < 0 if z > zmod, q′(zmod) = 0, and q′(z) > 0 if z < zmod. Then Jeffreys's prior π(β|X) in (3) is unimodal, and its unique mode is βmod = (zmod, 0, . . . , 0)′.

Assumptions (b) and (c) in Theorem 1 are satisfied for several binomial regression models, including logistic, probit, and complementary log–log regressions. We formally state this result in the next theorem.

Theorem 2

Assumptions (b) and (c) in Theorem 1 hold for F(z) = exp(z)/{1 + exp(z)}, F(z) = Φ(z) [the N(0, 1) cdf], and F(z) = 1 − exp{−exp(z)}, corresponding to logistic, probit, and complementary log–log regression models. Furthermore, Jeffreys's prior π(β|X) has a unique mode βmod = 0 for the logistic and probit regression models, and βmod = (.466, 0, . . . , 0)′ for the complementary log–log regression model.

Our next result establishes an interesting property of Jeffreys's prior under logistic, probit, and complementary log–log regressions—that the tails of Jeffreys's prior, regardless of sample size, are always lighter than that of a multivariate t distribution. Toward this end, let g(β|Σ, ν) denote the pdf of a (k + 1)-dimensional multivariate t-distribution with ν degrees of freedom, location 0, and a positive definite dispersion matrix Σ, that is,

g(βΣ,ν)=Γ{(ν+k+1)2}Γ(ν2)(νπ)(k+1)2Σ12×(1+1νβΣ1β)(ν+k+1)2. (5)

Then we are led to the following theorem, which characterizes the tail behavior of Jeffrey's prior.

Theorem 3

Assume that X is of full rank. Then Jeffreys's prior π(β|X) in (3) under logistic regression, probit regression, and complementary log–log regression has lighter tails than g(β|Σ, ν) for any ν > 0, that is, limβπ(βX)g(βΣ,ν)=0.

In the following theorem, we examine the tail behavior of Jeffreys's prior compared with a (k + 1)-dimensional multivariate normal distribution.

Theorem 4

Let ϕk+1 (βN) denote the probability density function of the (k + 1)-dimensional normal distribution Nk+1(0, ΣN), where ΣN is a (k + 1) × (k + 1) positive definite matrix. Then the following results hold:

  1. Under logistic regression, we have limβπ(βX)ϕk+1(βΣN)=, which implies that Jeffreys's prior π(β|X) under logistic regression always has heavier tails than the normal distribution, regardless of n.

  2. Let Xi1i2ik+1=(xi1,xi2,,xik+1) be a (k + 1) × (k + 1) submatrix of X. If there exists (i1, i2, . . . , ik+1) such that Xi1i2ik+1 is of full rank and ΣN112(Xi1i2ik+1)×Xi1i2ik+1>0 (i.e., positive definite), then the normal distribution Nk+1(0, ΣN) has lighter tails than Jeffreys's prior π(β|X) under probit regression. If ΣN112(Xi1i2ik+1)Xi1i2ik+1<0 (i.e., negative definite) for all (k + 1) × (k + 1) full-rank submatrixes Xi1i2ik+1 of X, then Jeffreys's prior π(β|X) under probit regression has lighter tails than the normal distribution Nk+1(0, ΣN).

  3. Let β = rd, where r ≥ 0 and d = (d0, d1, d2, . . . , dk)′ denotes a (k + 1)-dimensional vector of unit direction such that d=dd=1. Under complementary log–log regression, Jeffreys's prior π(β|X) has lighter tails than a Nk+1(0, ΣN) distribution in certain directions d, including d = (1, 0, 0, . . . , 0)′, and has heavier tails than a Nk+1(0, ΣN) distribution in some other directions d, including d = (−1, 0, 0, . . . , 0)′.

Next we characterize the conditional Jeffreys's prior distribution of β0.

Proposition 3

For Jeffreys's prior π(β|X) given in (3) for general binomial regression, the conditional prior distribution of β0 (the intercept), given β1 = . . . = βk = 0, is given by

π(β0β1==βk=0,X)[f2(β0)F(β0){1F(β0)}](k+1)2,

and the conditional posterior distribution of β0, given β1 = . . . = βk = 0, is given by

π(β0β1==βk=0,X,y){f(β0)}k+1{F(β0)}Σi=1nyi(k+1)2×{1F(β0)}Σi=1n(niyi)(k+1)2.

The proof of Proposition 3 is straightforward. Note that the results given in Proposition 3 imply that the conditional Jeffreys's prior distribution of β0 does not depend on the sample size n, but the conditional posterior does. This result sheds light on the asymptotic behavior of Jeffreys's prior—namely, that it does not converge to any well-known distribution as n → ∞.

Let C0(X)=Rk+1XW(β)X12dβ and C(X)=Rk+1L(βX,y)XW(β)X12dβ, which correspond to the prior and posterior normalizing constants. We present an interesting result—that the prior and posterior normalizing constants based on Jeffreys's prior are invariant under a one-to-one linear transformation of the covariates. Let V denote a (k + 1) × (k + 1) matrix. We are led to the following theorem.

Theorem 5

Assume that V is of full rank (k + 1). We have

C0(X)=C0(XV)andC(X)=C(XV). (6)

Note that when V is of full rank, XV denotes a one-to-one linear transformation of the covariates. By taking V = diag(1, v1, v2, . . . , vk) with v1 > 0, v2 > 0, . . . , vk > 0, (6) implies that C0(X) and C(X) are scale-invariant with respect to the covariates. The scale-invariance of C0(X) and C(X) in the covariates is a desirable property in Bayesian variable selection. Because the posterior model probability under a uniform prior on the model space is a function of prior and posterior normalizing constants, the result given in Theorem 5 also implies that the posterior model probability is scale-invariant in the covariates. In this sense Jeffreys's prior is as attractive as Zellner's g-prior (Zellner 1986), which also leads to posterior model probabilities that are scale invariant in the covariates.

Finally, we examine a theoretical connection between the Bayes information criterion (BIC) (Schwarz 1978) and the ratio of the posterior normalizing constant and the prior normalizing constant under Jeffreys's prior. Toward this end, we assume that there are only k + 1 distinct rows in X, denoted by x~j, j = 1, 2, . . . , k + 1. Under this assumption, we combine the binomial counts into k + 1 aggregated counts corresponding to those x~j's, and the aggregated likelihood function of (2) is given by

L(βXA,y)={i=1n(niyi)}[j=1k+1{F(x~jβ)}yAj{1F(x~jβ)}nAjyAj], (7)

where XA=(x~1,x~2,,x~k+1), yAj=Σi:xi=x~jyi, and nAj=Σi:xi=x~jni. Using (7), Jeffreys's prior in (3) reduces to π(βXA)XAWA(β)XA12, where WA(β) = diag(wA1(β), . . . , wA,k+1(β)) and wAj(β)=nAj{f(x~jβ)}2F(x~iβ){1F(x~iβ)}. It is easy to see that when XA is of full rank and nAj ≥ 1, π(β|XA) is proper. Let pj=pj(β)=F(x~iβ) for j = 1, 2, . . . , k + 1. When XA is of full rank, p1, p2, . . . , pk+1 are (k + 1) unconstrained binomial probabilities. The aggregated likelihood given in (7) thus can be written as a function of (p1, p2, . . . , pk+1) given by L(p1,p2,,pk+1XA,y)={i=1n(niyi)}[j=1k+1pjyAj(1pj)nAjyAj]. In terms of the binomial probabilities, Jeffreys's prior and the corresponding posterior distribution have closed-form expressions, Specifically, Jeffreys's prior and the corresponding posterior distribution of p1, p2, . . . , pk+1 are given by

π(p1,p2,,pk+1XA)={B(12,12)}(k+1)j=1k+1{pj(1pj)}12 (8)

and

π(p1,p2,,pk+1XA,y)=j=1k+1pjyAj12(1pj)nAjyAj12B(12+yAj,12+nAjyAj). (9)

It is easy to show that

π(βXA)=π(p1,p2,,pk+1XA)(p1,p2,,pk+1)β, (10)

where (p1,p2,,pk+1)β=diag(f(x~1β),,f(x~k+1β))XA. Using (8) and (10), we obtain a closed-form expression of the normalizing constant for Jeffreys's prior, given by

C0(X)=Rk+1XAWA(β)XA12dβ=(j=1k+1nAj)12{B(12,12)}k+1=πk+1(j=1k+1nAj)12, (11)

where B(a,b)=Γ(a)Γ(b)Γ(a+b) denotes the beta function. Similarly, using (9) and (10), we obtain the posterior normalizing constant based on Jeffreys's prior as

C(X)=Rk+1L(βXA,y)XAWA(β)XA12dβ={i=1n(niyi)}(j=1k+1nAj)12×{j=1k+1B(12+yAj,12+nAjyAj)}. (12)

Let N=Σi=1nni,α^j=nAjN and μ^j=yAjN for j = 1, 2, . . . , k + 1. Also, let β^ denote the maximum likelihood estimate of β. Then, using (7), the BIC is given by

BIC=2logL(β^XA,y)+(k+1)log(N)=2log{i=1n(niyi)}2j=1k+1{yAjlog(yAjnAj)}{+(nAjyAj)log(nAjyAjnAj)}+(k+1)log(N). (13)

The following theorem characterizes the connection between the BIC and the normalizing constants.

Theorem 6

Assume that (a) limNα^j=αj and limNμ^j=μj exist and (b) 0 < αj < 1 and 0 < μj < αj for j = 1, 2, . . . , k + 1. Then for large N, we have

2{logC(X)logC0(X)}=BIC+j=1k+1log(π2α^j)+o(1N), (14)

where C0(X), C(X), and BIC are defined by (11), (12), and (13).

The proof of Theorem 6 follows directly from Stirling's formula, and we omit the details for brevity. The theoretical connection given in (14) is quite interesting. First, −2{logC(X) − logC0(X)} acts quite similarly to the BIC. Second, in addition to the dimensional penalty of (k + 1)log N in the BIC, the dimensional penalty term in −2{logC(X) − logC0(X)} also depends on (α1, α2, . . . , αk+1), which describes the “joint distribution” of the covariate xi, which takes on the k + 1 distinct values x~j, j = 1, . . . , k + 1. Thus Bayesian variable selection based on Jeffreys's prior can yield very different results than the BIC, especially when N is small.

3. COMPUTATIONAL IMPLEMENTATION OF JEFFREYS'S PRIOR

As shown in Theorem 3, under logistic regression, probit regression, and complementary log–log regression, Jeffreys's prior π(β|X) in (3) has lighter tails than g(β|Σ, ν) in (5) for any ν > 0. In addition, the likelihood L(β|X, y) is bounded above, so it follows that the posterior has lighter tails than the t-distribution. Because of Theorems 3 and 4, we can completely avoid Gibbs sampling or other computationally intensive MCMC schemes for carrying out posterior computations. Toward this goal, we propose an importance sampling approach (see, e.g., Geweke 1989) to compute the prior and posterior quantities of interest. To obtain appropriate importance sampling densities for Jeffreys's prior and the corresponding posterior distribution, we consider a more general form of the multivariate t-distribution with density

g(βμ,Σ,ν)=Γ{(ν+k+1)2}Γ(ν2)(νπ)(k+1)2Σ12×(1+1ν(βμ)Σ1(βμ))(ν+k+1)2. (15)

For computing the prior normalizing constant, we specify μ = βmod in (15), because Jeffreys's prior has a unique mode at βmod as shown in Theorem 1. To determine Σ in (15), we match the Hessians (curvatures) of Jeffreys's prior and the t-distribution at βmod as

κ02logπ(βX)βββ=βmod=2logg(βμ=βmod,Σ,ν)βββ=βmod, (16)

where π(β|X) and g(β|μ, Σ, ν) are given by (3) and (15), and κ0 > 0 is a fixed dispersion adjustment parameter. We follow a similar strategy for computing the posterior normalizing constant. We match the mode and the Hessian (curvature) at the mode of the posterior distribution μ and Σ in the importance sampling density (15). In doing so, we are led to μ=μ^=arg maxβRk+1{log[L(βX,y)π(βX)]} and

Σ1=κ1νν+k+12logL(βX,y)π(βX)βββ=μ^, (17)

where κ1 > 0 is a fixed dispersion adjustment parameter. A detailed discussion of the specification of ν and κ0 (or κ1) in determining the importance sampling density g(β|μ, Σ, ν) is given in Appendix B of the supplementary document.

The steps of the importance sampling method are quite simple. Let G(ν2,ν2) denote the gamma distribution with shape parameter ν/2 and scale parameter ν/2. Then the importance sampling algorithm for computing the prior normalizing constant C0 = C0(X) can be stated as follows:

  • Step 1: Generate a random sample {β1, β2, . . . , βQ of size Q from g(β|μ = βmod, Σ, ν), where for each q, independently generate λq~G(ν2,ν2) and βq ∼ Nk+1(βmod, Σ/λq).

  • Step 2: Compute the Monte Carlo estimate of the prior normalizing constant C0 as C^0=1QΣq=1QXW(βq)X12g(βqμ=βmod,Σ,ν).

In step 2 we also may calculate log(C^0) instead of C^0 for greater stability and accuracy. In addition, we compute the relative Monte Carlo (MC) standard error (SE) as RSE(C^0)=1C^0{1Q(Q1)Σq=1Q[XW(βq)X12g(βqμ=βmod,Σ,ν)C^0]2}12. Note that by the standard delta method, we can show that RSE(C^0) is indeed the MC SE of log(C^0). In practice, we recommend choosing Q and κ0 so that RSE(C^0) is approximately .01 or less. The importance sampling algorithm for computing the posterior normalizing constant C = C(X) is analogous to that for the prior normalizing constant.

To examine the performance of the proposed importance sampling algorithm, we consider the logistic regression model with a single binary covariate. We generate a simulated data set of size n = 100 with xi1 ∼ Bernoulli(.6) and yi|xi1 ∼ Bernoulli(pi), where pi=exp(xiβ)1+exp(xiβ), xi = (1, xi1)′, and β = (.5, 1.0)′. We implemented the proposed importance sampling algorithm with various values of κ0 and κ1. The results are given in Table 1. The prior and posterior normalizing constants in this example involve a two-dimensional integral. As discussed in Appendix B, βmod = 0 and ν = 3.37 is a guide value for ν in g(β|μ = 0, Σ, ν) for computing the prior normalizing constant under logistic regression. From Table 1, we see that both ν = 3.37 and ν = 5 work well for computing both the prior and posterior normalizing constants. The MC SEs are all smaller than .01 with Q = 10,000. Moreover, we see that log(C^0), log(C^0), and the MC SE are extremely consistent and robust for several values of ν, κ0, and κ1. Moreover, the MC SE for ν = 3.37 is smallest among all values of ν, confirming our theoretical results concerning the choice of ν.

Table 1.

Monte Carlo estimates of log C0 and log C with MC size Q = 10,000

Jeffreys's prior
Posterior
v k0 log Ĉ0 MC SE k1 log Ĉ MC SE
1 1 6.137 .008 2 −56.907 .007
3.37 6.131 .002 −56.900 .003
5 6.133 .002 −56.895 .003
20 6.139 .012 −56.881 .007
3.37 .5 6.144 .005 1 −56.884 .006
2 6.127 .005 3 −56.906 .004
5 .5 6.129 .004 1 −56.898 .004
2 6.139 .008 3 −56.890 .005
True value log C0 = 6.132 log C = −56.890

Next we discuss how to compute the posterior estimates of β under Jeffreys's prior via the proposed importance sampling method. Let {β1, β2, . . . , βQ}, where βq = (βq0, βq1, . . . , βqk)′, q = 1, 2, . . . , Q, be a random sample of size Q from g(β|μ, Σ, ν). Then an MC estimate of the hth posterior moment of βj is given by E^(βjhX,y)=Σq=1QβqjhωqΣq=1Qωq, where ωq=π(βqX,y)g(βqμ,Σ,ν) for q = 1, 2, . . . , Q. Using E^(βjhX,y), we can easily compute the posterior mean and standard deviation of βj. To compute the highest posterior density (HPD) interval of βj through the importance sampling method, we use the MC method proposed by Chen and Shao (1999). Specifically, for 0 ≤ γ < 1, define β^j(γ)=βj(1) if γ = 0 and βj(q) if Σlq1ωl<γΣl=1qωl, where βj(q) is the qth smallest of {βj(l), l = 1, 2, . . . , Q}. To obtain a 100(1 − α)% HPD interval for βj, we let Rq(Q)=(β^j(qQ),β^j((q+[(1α)Q])Q)) for q = 1, 2, . . . , Q − [(1 − α)Q], where [(1 − α)Q] denotes the integer part of (1 − α)Q. Then the 100(1 − α)% HPD interval is Rq*(Q), the interval with the smallest width among all Rq(Q)'s.

4. ANALYSIS OF PROSTATE CANCER DATA

To further motivate the proposed methodology, we consider data from a retrospective cohort study of men treated with radical prostatectomy (n = 968) between 1988 and 2000, which is a subset of the data published by D'Amico et al. (2002). The primary endpoint that D'Amico et al. (2002) considered was prostate-specific antigen (PSA) recurrence free survival. In our analysis, we consider Pathological Extracapsular Extension (PECE) a binary response variable (y) that takes values 0 and 1, where 1 denotes that the cancer has penetrated the prostate wall and 0 indicates otherwise. We consider five prognostic factors: age, natural logarithm of PSA (LogPSA), percent positive prostate biopsies (ppb), biopsy Gleason score, and the 1992 American Joint Commission on Cancer (AJCC) clinical tumor category. The covariates age, LogPSA, and ppb are continuous. We dichotomize biopsy Gleason score as GS7 and GS8H, where (GS7, GS8H) takes values (0, 0), (1, 0), and (0, 1), corresponding to biopsy Gleason scores of ≤6, 7, and 8−10. Similarly, we dichotomize the clinical tumor category as T2b and T2c, where (T2b, T2c) takes values (0, 0), (1, 0), and (0, 1), corresponding to the clinical tumor categories T1c or T2a, T2b, and T2c.

First, we carried out variable selection. For all of 32 possible models, including an intercept with five covariates, we computed the Akaike information criterion (AIC), the BIC, and the posterior model probabilities based on Jeffreys's prior. To compute the posterior model probabilities based on Jeffreys's prior, we implemented the importance sampling algorithm proposed in Section 3 with an MC sample size of Q = 20,000, κ0 = 1, κ1 = 2, and ν = 3.37 and ν = 5 for the prior and posterior normalizing constants. The best model under the BIC and the model with the highest posterior probability based on Jeffreys's prior is (LogPSA, ppb, GS7, GS8H). For this best model, the BIC value is 925.627, and the posterior probability is .871. The second-best model under these two criteria is (age, LogPSA, ppb, GS7, GS8H), with a BIC value of 928.574 and posterior probability of .119. The AIC selects the full model (age, LogPSA, ppb, GS7, GS8H, T2b, T2c) as the best model. The AIC values are 899.168 for the full model and 901.251 for the best BIC model. We note that the AIC gives different results than the BIC and the highest posterior probability model, apparently with no good scientific reason.

Next, we computed the posterior estimates under Jeffreys's prior for the highest posterior probability model, (LogPSA, ppb, GS7, GS8H), using the MC method with a MC sample of size Q = 20,000, as discussed in Section 3. Table 2 gives the posterior means (Estimates), the posterior standard deviations (SDs), and 95% HPD intervals for the βj's, along with the corresponding maximum likelihood (ML) estimates, standard errors (SEs), and p values. We can see from Table 2 that the posterior estimates based on Jeffreys's prior are very close to the ML estimates, which is intuitively appealing, thus demonstrating that Jeffreys's prior is indeed noninformative. We also can see from this table that all selected variables are highly significant, which implies that these are the most important prognostic factors in predicting the binary response PECE. We note that under the full model, age, T2b, and T2c all have 95% HPD intervals that contain 0.

Table 2.

Posterior estimates of β for the prostate cancer data

ML estimates
Posterior estimates
Variable Estimate SE p value Estimate SD 95% HPD interval
Intercept −3.895 .304 <.0001 −3.896 .307 (−4.586, −3.222)
LogPSA .696 .135 <.0001 .696 .135 (.400, 1.004)
ppb 2.376 .355 <.0001 2.376 .356 (1.612, 3.201)
GS7 .705 .182 .0001 .706 .182 (.283, 1.098)
GS8H 1.420 .337 <.0001 1.420 .337 (.639, 2.156)

5. DISCUSSION

We have examined theoretical and computational properties of Jeffreys's prior for inference in binomial regression models. We have derived several theoretical properties of Jefferey's prior for binomial regression models, including unimodality, symmetry, mode, and tail behavior. For variable selection problems, we have established a theoretical connection between the BIC and the ratio of the posterior and prior normalizing constants with Jeffreys's prior under binomial regression with a general link. As shown in Theorem 6, the dimension penalty under Jeffreys's prior depends on the covariates as well as on the penalty term imposed by the BIC. The penalty term for the BIC depends only on k and N, and thus the BIC does not have a penalty as sophisticated as that of Jeffreys's prior for variable selection problems. This is an interesting theoretical difference that may have an impact in variable selection problems. Future work includes investigating the performance of Jeffreys's prior in variable selection and carrying out computations for variable selection with Jeffreys's prior for high-dimensional problems.

Acknowledgments

The authors thank the editor, the associate editor, and three referees for helpful comments and suggestions that have improved the article. Dr. Chen's and Dr. Ibrahim's research was supported in part by National Institute of Health grants GM 70335 and CA 74015.

REFERENCES

  1. Berger JO. Statistical Decision Theory and Bayesian Analysis. 2nd ed. Springer-Verlag; New York: 1985. [Google Scholar]
  2. Berger JO. Bayesian Analysis: A Look at Today and Thoughts of Tomorrow. Journal of the American Statistical Association. 2000;95:1269–1127. [Google Scholar]
  3. Berger JO. The Case for Objective Bayesian Analysis. Bayesian Analysis. 2006;1:385–402. [Google Scholar]
  4. Box GEP, Tiao GC. Bayesian Inference in Statistical Analysis. Addison-Wesley; Reading, MA: 1973. [Google Scholar]
  5. Chen M-H, Shao Q-M. Monte Carlo Estimation of Bayesian Credible and HPD Intervals. Journal of Computational and Graphical Statistics. 1999;8:69–92. [Google Scholar]
  6. D'Amico AV, Whittington R, Malkowicz SB, Cote K, Loffredo M, Schultz D, Chen M-H, Tomaszewski JE, Renshaw AA, Wein A, Richie JP. Biochemical Outcome Following Radical Prostatectomy or External Beam Radiation Therapy for Clinically Localized Prostate Cancer in the PSA Era. Cancer. 2002;95:281–286. doi: 10.1002/cncr.10657. [DOI] [PubMed] [Google Scholar]
  7. Geweke J. Bayesian Inference in Econometrics Models Using Monte Carlo Integration. Econometrica. 1989;57:1317–1340. [Google Scholar]
  8. Ibrahim JG, Laud PW. On Bayesian Analysis of Generalized Linear Models Using Jeffreys's Prior. Journal of the American Statistical Association. 1991;86:981–986. [Google Scholar]
  9. Jeffreys H. An Invariant Form for the Prior Probability in Estimation Problems. (Ser. A).Proceedings of the Royal Society of London. 1946;196:453–461. doi: 10.1098/rspa.1946.0056. [DOI] [PubMed] [Google Scholar]
  10. Jeffreys H. Theory of Probability. 3rd ed. Oxford University Press; Oxford, U.K.: 1961. [Google Scholar]
  11. Kass RE. The Geometry of Asymptotic Inference. Statistical Science. 1989;4:188–234. (with discussion) [Google Scholar]
  12. Kass RE. Data Translated Likelihood and Jeffreys's Rules. Biometrika. 1990;77:107–114. [Google Scholar]
  13. Kass RE, Wasserman L. The Selection of Prior Distributions by Formal Rules. Journal of the American Statistical Association. 1996;91:1343–1370. [Google Scholar]
  14. Schwarz G. Estimating the Dimension of a Model. The Annals of Statistics. 1978;6:461–464. [Google Scholar]
  15. Zellner A. On Assessing Prior Distributions and Bayesian Regression Analysis With g-Prior Distributions. In: Goel P, Zellner A, editors. Bayesian Inference and Decision Techniques. Elsevier Science; Amsterdam: 1986. pp. 233–243. [Google Scholar]

RESOURCES