Bayesian Distance Weighted Discrimination

Eric F Lock

doi:10.1080/10618600.2022.2069778

. Author manuscript; available in PMC: 2023 May 26.

Published in final edited form as: J Comput Graph Stat. 2022 May 26;31(4):1177–1188. doi: 10.1080/10618600.2022.2069778

Bayesian Distance Weighted Discrimination

Eric F Lock ¹

PMCID: PMC9717576 NIHMSID: NIHMS1839266 PMID: 36465095

Abstract

Distance weighted discrimination (DWD) is a linear discrimination method that is particularly well-suited for classification tasks with high-dimensional data. The DWD coefficients minimize an intuitive objective function, which can solved efficiently using state-of-the-art optimization techniques. However, DWD has not yet been cast into a model-based framework for statistical inference. In this article we show that DWD identifies the mode of a proper Bayesian posterior distribution, that results from a particular link function for the class probabilities and a shrinkage-inducing proper prior distribution on the coefficients. We describe a relatively efficient Markov chain Monte Carlo (MCMC) algorithm to simulate from the true posterior under this Bayesian framework. We show that the posterior is asymptotically normal and derive the mean and covariance matrix of its limiting distribution. Through several simulation studies and an application to breast cancer genomics we demonstrate how the Bayesian approach to DWD can be used to (1) compute well-calibrated posterior class probabilities, (2) assess uncertainty in the DWD coefficients and resulting sample scores, (3) improve power via semi-supervised analysis when not all class labels are available, and (4) automatically determine a penalty tuning parameter within the model-based framework. R code to perform Bayesian DWD is available at https://github.com/lockEF/BayesianDWD.

Keywords: Cancer genomics, distance weighted discrimination, high-dimensional data, probabilistic classification

1. Introduction

High-dimensional data occur when a massive number of features are available for each unit of observation in a sample set. Such data are now very common across a variety of application areas, and this has motivated a large number of data analysis methods that were developed specifically for the high-dimensional context. For supervised analysis, where the task is to predict or describe an outcome, methods that are appropriate for the high-dimensional context often begin with an objective function to be optimized but lack a clear model-based framework for statistical inference. One reason for this is because fitting statistical models directly via (e.g.) maximum likelihood is prone to over-fitting and lack of identifiability in the high-dimensional context, especially when the number of features is larger than the sample size. Thus, objective functions that admit more regularization are required, and these often do not have a model-based interpretation in the frequentist framework. For example, a common approach in the high-dimensional context is to incorporate penalty terms that enforce shrinkage or sparsity (e.g., l₂ or l₁ penalization) within a classical predictive model (e.g., generalized linear models).

A Bayesian framework provides more flexibility for model-based supervised analysis of high-dimensional data, as appropriate regularization can be induced through the specified prior distribution. There are a number of instances in which optimizing a commonly used penalized objective function was later shown to give the mode of a posterior distribution under a particular Bayesian model. For example, a Bayesian linear model with a normally distributed outcome and a normal prior on the coefficients corresponds to solving an l₂ penalized least squares objective (i.e., ridge regression) (Hsiang 1975). Similarly, the mode of the posterior under a Bayesian framework with a double-exponential prior on the model coefficients corresponds to l₁ penalization (i.e., the lasso) (Park and Casella 2008). Such equivalences can be very useful to better understand and interpret the original optimization problem, and also provide a potential framework for statistical inference that is based on the original objective.

In this article we introduce a Bayesian formulation for Distance Weighted Discrimination (DWD) (Marron et al. 2007). DWD is a popular approach for supervised analysis with a binary outcome and high-dimensional data. It aims to identify a linear combination of the features that distinguish the sample groups corresponding to the binary outcome; in this respect, it is comparable to various versions of the Support Vector Machine (SVM) (Cortes and Vapnik 1995) or penalized Fisher’s Linear Discriminant Analysis (LDA) (Witten and Tibshirani 2011). However, relative to its competitors, DWD often has superior generalizability for settings in which the sample size is small relative to the dimension, i.e., the high dimension low sample size (HDLSS) context. In its original formulation, DWD minimizes the inverse distances of the observed units from the separating hyperplane with a penalty term. This circumvents the “data piling” issue of SVM, wherein projections tend to pile up at the margins of the separating hyperplane for HDLSS data as a symptom of over-fitting and lack of generalizability. Since its inception, many other versions of DWD have been developed, including extensions that allow more than two classes (Huang et al. 2013), sparsity (Wang and Zou 2016), non-linear kernels (Wang and Zou 2019), multi-way (i.e., tensor) data (Lyu et al. 2017), and sample weighting for unbalanced data (Qiao et al. 2010). These useful extensions modify the optimization task used for DWD, but remain model-free. Resampling methods such as permutation testing (Wei et al. 2016), cross-validation, and bootstrapping (Lyu et al. 2017) have been used with DWD to address inferential questions that involve uncertainty. However, while various model-based versions of SVM have been proposed (Sollich 2002; Henao et al. 2014), DWD is yet to be cast into a fully specified probability model.

We show that DWD identifies the mode of a proper Bayesian posterior distribution. The corresponding density of the posterior distribution is a monotone function of the DWD objective. The underlying model is given by a particular link function for the class probabilities and a shrinkage-inducing proper prior distribution on the model coefficients. In addition to providing a model-based context for DWD, we demonstrate how this Bayesian framework can be used to extend and enhance present applications of DWD in four specific ways:

Posterior class inclusion probabilities can be used for “soft” classification, and we find that these probabilities are well-calibrated across different contexts.
Posterior credible intervals can be used to visualize uncertainty in the coefficients and the sample scores (i.e., projections).
Semi-supervised analyses, in which only some of the class labels are observed in the training data, fit naturally into the Bayesian framework and can improve power.
A penalty tuning parameter for the DWD objective can be subsumed within the Bayesian framework and given its own prior, to allow for model-based selection of this parameter.

We describe a relatively efficient Markov chain Monte Carlo (MCMC) algorithm to simulate from the posterior distribution. Alternatively, we show that the posterior is asymptotically multivariate normal; we derive the asymptotic mean and covariance, which are readily computable using existing software.

In what follows, we introduce our formal framework and notation in Section 2, formally describe the DWD objective in Section 3, and then present and discuss the Bayesian model and its estimation in Sections 4, 5, 6, and 7. We present simulation results to illustrate and validate different aspects of the approach in Section 8, and we present an application to subtype classification in breast cancer genomics in Section 9.

2. Framework and notation

Throughout this article bold lower-case characters denote vectors, bold upper-case characters denote matrices, and greek characters denote unknown model parameters. Probability density functions are denoted by p (·), and discrete probabilities by P (·).

Let x_i be a d-dimensional vector and y_i ∈ {−1, 1} be a corresponding class indicator for sampled observations i = 1, …, n. Let X be the d × n matrix X = [x₁ x₂ … x_n]. Let y be the n-dimensional vector of outcomes y = [y₁ y₂ … y_n]. Our task is to predict the outcomes y_i from the data x_i.

3. Distance weighted discrimination

Distance weighted discrimination (DWD) identifies coefficients (weights) β = [β₁ β₂ … β_d] and an intercept $β_{0} \in ℝ$ such that the linear combinations (scores) $β_{0} + x_{i}^{T} β$ discriminate the classes. The optimization problem for DWD was originally formulated as

\underset{β_{0}, β, ξ}{argmin} \sum_{i = 1}^{n} \frac{1}{r_{i}} + C \sum_{1}^{n} ξ_{i}, where r_{i} = y_{i} (β_{0} + x_{i}^{T} β) + ξ_{i}

(1)

subject to r_i > 0 and ξ_i > 0 for i = 1, … , n and ${‖ β ‖}_{2}^{2} \leq 1$ . In practice the classification decision rule is determined by the sign of $β_{0} + x_{i}^{T} β$ , and thus the ξ_i can be interpreted as a penalty for misclassification. The tuning parameter C controls the relative weight of this misclassification penalty compared to the inverse distances from the separating hyperplane 1/r_i. An appealing property of DWD is that its objective accounts for the projections of all the data observations, rather than just those at the margins. Liu et al. (2011) showed that the DWD objective (1) is equivalent to the following “loss+penalty” formulation

\underset{β_{0}, β}{argmin} \frac{1}{n} \sum_{i = 1}^{n} V (y_{i} (β_{0} + x_{i}^{T} β)) + \frac{λ}{2} {‖ β ‖}_{2}^{2},

(2)

where λ is a penalty tuning parameter, and

V (u) = {\begin{array}{l} 1 - u & if u \leq 1 / 2, \\ 1 / (4 u) & if u > 1 / 2. \end{array}

(3)

The objectives (1) and (2) are equivalent in that, with a one-to-one correspondence between the penalty parameters λ and C, the estimated coefficients and resulting sample scores are proportional.

4. Bayesian model

Here we show that the DWD objective (2) gives the mode of a posterior distribution for {β₀, β} under a Bayesian framework. Let ψ (·) give the objective function that is minimized by DWD,

ψ (β_{0}, β, y, X, λ) = \frac{1}{n} \sum_{i = 1}^{n} V (y_{i} (β_{0} + x_{i}^{T} β)) + \frac{λ}{2} {‖ β ‖}_{2}^{2} .

The posterior density for {β₀, β} is the following monotone transformation of ψ,

p (β, β_{0} ∣ X, y, λ) \propto e^{- n ψ (β_{0}, β, y, X, λ)} = [\prod_{i = 1}^{n} e^{- V (y_{i} (β_{0} + x_{i}^{T} β))}] e^{- (λ n / 2) {‖ β ‖}_{2}^{2}}

(4)

Under general conditions (4) is integrable over {β₀, β}, and therefore defines a proper probability density function; we state this formally in Theorem 1.

Theorem 1. If $X \in ℝ^{d \times n}$ , λ ≥ 0, and y ∈ {−1, 1}ⁿ where y_i = −1 for some i and y_j = 1 for some j, then p(β, β₀ | X, y, λ) in (4) gives a proper probability density function over {β₀, β}.

In what follows, we “work backward” from this posterior to complete the specification of the model and discuss its implications.

4.1. Conditional distribution of y

Treating y as a discrete random vector, it follows directly from (4) that its entries y_i are conditionally independent and that

P (y_{i} ∣ β, β_{0}, x_{i}) \propto P (Y_{i} = y_{i}) e^{- V (y_{i} (β_{0} + x_{i}^{T} β))} .

The conditional distribution for y_i ∈ {−1, 1} thus depends on the unconditional class probability for observation i, P(Y_i = 1) = 1 − P(Y_i = −1), and its corresponding score $u_{i} ≔ β_{0} + x_{i}^{T} β$ :

P (Y_{i} = 1 ∣ u_{i}) = \frac{P (Y_{i} = 1) e^{- V (u_{i})}}{P (Y_{i} = 1) e^{- V (u_{i})} + P (Y_{i} = - 1) e^{- V (- u_{i})}} .

(5)

Choice of P(Y_i = 1) does not influence posterior inference for {β₀, β} in (4), but it is nevertheless important for the calibration of the predictive model and is discussed further in Section 6.1. Given P(Y_i = 1), (5) defines a link function from $u_{i} \in ℝ$ to a probability in [0, 1]. Figure 1 plots this function under equal class proportions; it is smooth and comparable to a probit or logit link. The full likelihood for y is

P (y ∣ β, β_{0}, X) = \frac{\prod_{i = 1}^{n} P (Y_{i} = y_{i}) e^{- V (y_{i} (β_{0} + x_{i}^{T} β))}}{\prod_{i = 1}^{n} (P (Y_{i} = 1) e^{- V (β_{0} + x_{i}^{T} β)} + P (Y_{i} = - 1) e^{- V (- β_{0} - x_{i}^{T} β)})} .

(6)

Fig. 1 — Class probability as a function of linear score under different links.

4.2. Prior distribution of β

Estimating β₀ and β via maximizing the likelihood (6) would perform poorly in the HDLSS context, analogous to the poor performance of unconstrained probit or logistic regression models. However, the posterior distribution (4) also implies a prior distribution for {β₀, β} that is not conditional on y. That is,

p (β, β_{0} ∣ X, y, λ) \propto P (y ∣ β, β_{0}, X) p (β, β_{0} ∣ X, λ)

implies

p (β, β_{0} ∣ X, λ) \propto p (β, β_{0} ∣ X, y, λ) / p (y ∣ β, β_{0}, X) \propto [\underset{A}{\underset{︸}{\prod_{i = 1}^{n} (P (Y_{i} = 1) e^{- V (β_{0} + x_{i}^{T} β)} + P (Y_{i} = - 1) e^{- V (- β_{0} - x_{i}^{T} β)})}}] \underset{B}{\underset{︸}{e^{- (λ n / 2) {‖ β ‖}_{2}^{2}}}} .

(7)

This prior facilitates shrinkage in two important ways that improve performance in HDLSS settings. Term B in (7) gives the kernel of a multivariate normal distribution for β, in which the β_i’s are independent with mean 0 and variance 1 / (λ N). This derives from the l₂ penalty in (2). There is a vast literature on the connection between l₂-regularized regression (e.g., ridge regression) and using Gaussian priors on the coefficients in a Bayesian framework, as mentioned in Section 1. This shrinks the inferred coefficients toward 0, which improves performance in the HDLSS setting or in the presence of multicollinearity. Term A in (7) favors β that correspond to directions in X with high variability or bimodality. To illustrate, Figure 2 plots a single term of the product A as a function of the DWD score $u_{i} = β_{0} + x_{i}^{T} β$ when P(Y_i = 1) = 1 / 2,

f (u_{i}) = e^{- V (u_{i})} + e^{- V (- u_{i})} .

(8)

Fig. 2 — Prior term f(u_i) (8) as a function of DWD score u_i.

This term gives higher prior density to directions in X for which the corresponding scores ${u_{i}}_{i = 1}^{N}$ are not concentrated near 0, and are therefore more useful for discriminating a negative class from a positive class.

The function A·B^· in (7) defines a proper density for β for any fixed intercept β₀, as stated in Theorem 2.

Theorem 2. If $X \in ℝ^{d \times n}$ , λ > 0, and $β_{0} \in ℝ$ , then

p (β ∣ X, β_{0}, λ) \propto [\prod_{i = 1}^{Γ} (P (Y_{i} = 1) e^{- V (β_{0} + x_{i}^{T} β)} + P (Y_{i} = - 1) e^{- V (- β_{0} - x_{i}^{T} β)})] e^{- (λ N / 2) {‖ β ‖}_{2}^{2}} .

gives a proper probability density function.

The prior (7) does not define a proper density for β₀. Improper priors are commonly used for a variety of Bayesian applications, and still provide a valid framework for inference if the posterior is proper (Taraldsen and Lindqvist 2010; Sun et al. 2001). Nevertheless, it would be straightforward to modify the model to allow for shrinkage of β₀, if one desired it to have a proper probability density without conditioning on y.

5. Asymptotic normality of β

It follows from the Bernstein-von Mises theorem (i.e., the “Bayesian central limit theorem”) (Ghosal 1997) that the posterior distribution for β approximates a multivariate normal (MVN) distribution as n → ∞. The limiting distribution can be derived precisely via a second-order Taylor expansion of the log of (4), about the mode β. The resulting mean and covariance matrix are provided in Theorem 3.

Theorem 3. If $X \in ℝ^{d \times n}$ , λ > 0, and y ∈ {−1, 1}ⁿ, then

p (β ∣ X, y, λ, β_{0}) \approx MVN (β ∣ β, V_{β})

as n → ∞, where β is the DWD solution (2), and

V_{β} = {(\sum_{i \in S_{β}} \frac{x_{i} x_{i}^{T}}{2 (y_{i} {({\hat{β}}_{0} + x_{i}^{T} β)}^{3}} + n λ I)}^{- 1}

where $S_{β} = {i : y_{i} ({\hat{β}}_{0} + x_{i}^{T} β) > 0.5}$ . That is, $ℒ (V_{β}^{- 1 / 2} (β - \hat{β}) ∣ X, y, λ, β_{0}) \to MVN (0, I)$ where $ℒ (\cdot)$ denotes the law of (·).

Note that this asymptotic approximation always exists and is directly computable given the DWD solution {β₀, β}.

6. Extensions

6.1. Marginal class probabilities

The prior class probabilities P(Y_i = 1) and P(Y_i = −1) in (4.1) do not affect the posterior for {β₀, β} in (4). Thus, they need not be specified for tasks that do not require a predictive model for y_i, such as exploratory visualization or batch adjustment (Huang et al. 2012). However, their specification is important to obtain appropriately calibrated probabilities for classification. In particular, they must be considered carefully when the proportions in the training data differ from that of the target population. The ability to adjust the marginal class probabilities for out-of-sample prediction is useful, as the decision boundary for DWD (β₀) is sensitive to the class proportions used for training (Qiao et al. 2010; Qiao and Zhang 2015). We let each observation have the same class membership probability a priori:

P (Y_{i} = 1) = P_{1} for i = 1, \dots, n,

and by default we set P₁ = 0.5. In other situations P₁ may be known and fixed at another value, or estimated as the sample proportion from class 1. Alternatively, one can give P₁ its own prior (e.g., P₁ ~ Uniform (0, 1)) and infer it with the rest of the Bayesian model; such an approach may be useful in (e.g.,) situation for which the y_i’ s are only partially observed and the population class proportions are of interest.

6.2. Inference for λ

Rather than fixing the penalty parameter λ a priori, we recommend inferring its value within the Bayesian framework. For a given prior density λ ~ p(λ), the conditional density for λ is

p (λ ∣ X, y, β, β_{0}) \propto p (λ) P (y ∣ X, β, β_{0}) p (β ∣ X, β_{0}, λ) \propto p (λ) e^{- (λ n / 2) {‖ β ‖}_{2}^{2}} / ϕ (X, β_{0}, λ)

(9)

where

ϕ (X, β_{0}, λ) = \int_{β} \prod_{i = 1}^{n} (P_{1} e^{- V (β_{0} + x_{i}^{T} β)} + (1 - P_{1}) e^{- V (- β_{0} - x_{i}^{T} β)}) e^{- (λ / 2) {‖ β ‖}_{2}^{2}} d β .

(10)

In practice we use a prior for λ that is uniform over a large range, by default, λ ~ Uniform (1 / 128, 128). The impact of λ depends on the scale of X and the number of observations n, so the prior may need to be suitably modified for other scenarios.

6.3. Semi-supervised learning

The conditional distribution of β on X in (7) is not flat, and therefore observations for which y_i are not observed can still inform β. Consider the context in which n_u observations have missing outcome ${(x_{i})}_{i = 1}^{n_{u}}$ and the remaining n₀ observations are fully observed ${(x_{i}, y_{i})}_{i = n_{u} + 1}^{n}$ where n = n_o + n_u. The full posterior distribution for {β₀, β} is

p (β, β_{0} ∣ X, y, λ) \propto [\prod_{i = 1}^{n_{u}} (P_{1} e^{- V (β_{0} + x_{i}^{T} β)} + (1 - P_{1}) e^{- V (- β_{0} - x_{i}^{T} β)})] [\prod_{i = n_{u} + 1}^{n} e^{- V (y_{i} (β_{0} + x_{i}^{T} β))}] e^{- (λ n / 2) {‖ β ‖}_{2}^{2}} .

This full posterior can be used for semi-supervised learning. In particular, the first term is analogous to term A in (7), and will favor directions with high variability or bimodality in the n_u columns of X with missing outcome. In Section 8.3 we illustrate the advantages of incorporating unlabeled data when the predictors have a bimodal (i.e., clustered) structure, reflecting the cluster assumption in semi-supervised learning (Van Engelen and Hoos 2020).

7. Posterior approximation

7.1. MCMC algorithm, fixed λ

To simulate from the posterior distribution for {β, β₀} (4), we implement a “Metropolis-in-Gibbs” algorithm (Carlin and Louis 2008). That is, we iteratively draw each β_i from its conditional distribution given the current values of the rest of the parameters β[−i], p(β_i | β[−i], X, y, λ); each conditional draw is accomplished via a Metropolis sub-step. For the Metropolis step, we generate proposals from a normal distribution with mean given by the previous value for β_i and a variance that is adaptively scaled over the course of the algorithm. We initialize the Markov chain at the posterior mode, which can be identified using existing software for DWD; we use the sdwd package for this initialization.

7.2. Inference for λ

Inference for λ can be accomplished by iteratively simulating from its full conditional distribution (9) to update its value within the Gibbs sampling algorithm in Section 7.1. However, this is not straightforward, because the integral ϕ(X, β₀, λ) (10) does not have a closed form. In our implementation, we first estimate ϕ(X, λ, β₀) over a grid of values {λ_j : j = 1, …, J} that span the range of the prior for λ. For this, we utilize the fact that $e^{- (λ n / 2) {‖ β ‖}_{2}^{2}}$ resembles a Gaussian kernel to approximate the integral via Monte Carlo with simulated random normal vectors. Given for β(t) ~ MVN(0, 1 / (λ n)I) for t = 1, …, T,

ϕ (X, λ, β_{0}) \approx {(\frac{2 π}{n λ})}^{p / 2} \frac{1}{T} \sum_{t = 1}^{T} \prod_{i = 1}^{n} (P_{1} e^{- V (β_{0} + x_{i}^{T} β^{(t)})} + (1 - P_{1}) e^{- V (- β_{0} - x_{i}^{T} β^{(t)})}) .

Using this approximation, we estimate $\hat{ϕ} (X, λ_{j}, β_{0})$ for each j, and then linearly interpolate on the log-scale to approximate the full function $\hat{ϕ} (X, λ, β_{0})$ over the range of the prior for λ. Posterior estimation then proceeds as in Section 7.1, with an additional Metropolis sub-step to update the value of λ. The full algorithm for posterior sampling is given in Section B of the online supplement.

7.3. Computational considerations

In practice, we find that 1000 MCMC cycles yield appropriate coverage of the posterior when λ is fixed (see Section 8.1) and 10000 are sufficient when λ is inferred. On a single CPU, 10000 MCMC iterations takes approximately 2 minutes for a dataset of size X : 500 × 500 (comparable to the data considered in Section 9) and computing time scales linearly with the dimension d as shown in Section 8.4. Thus, full posterior inference via MCMC is feasible for most applications (with patience), but computing time can be a limitation.

8. Simulations

We consider three distinct simulation scenarios, and multiple conditions under each scenario, to illustrate and validate several aspects of Bayesian DWD.

In Section 8.1 we simulate from a data generating scheme that matches the assumed prior and likelihood exactly. Under this scenario, we confirm that posterior inferences using the MCMC algorithms in Section 7 are appropriately calibrated regardless of the marginal distribution for X. We compare to results using the normal approximation in Section 5 and a bootstrapping approach.

In Section 8.2 we consider a scenario in which the two classes are defined by different multivariate normal distributions. Under this scenario we demonstrate the robustness of Bayesian DWD to its model assumptions, assess the sensitivity of results to the tuning parameter λ, and validate model-based inference for λ.

In Section 8.3 we consider different scenarios for which not all of the outcomes are observed to illustrate potential benefits and limitations of the semi-supervised approach.

8.1. Assumed model simulation

Here we fix β₀ = 0 and simulate data from the assumed model as follows:

Generate X : d × n according to a specified probability distribution,
Generate β_true from its conditional prior (7) (via a Metropolis-Hastings algorithm), given λ
For i = 1, …, n, compute the score $u_{i} = x_{i}^{T} β_{true}$ and generate y_i according to its class probabilities defined by (4.1).

As manipulated conditions, we consider three values for λ (λ = 0.1, 1, or 10), two configurations of d and n (X : 20 × 100 or X :100 × 20), and three different probability distributions for X: (1) entries are simulated independently from a Uniform (−1, 1) distribution, (2) entries are simulated independently from an Exponential(1) distribution centered to have mean 0, and (3) each column x_i is simulated from the following mixture of multivariate normal distributions:

x_{i} ~ {\begin{array}{l} MVN (μ_{0}, I) with probability 1 / 2 \\ MVN (μ_{1}, I) with probability 1 / 2, \end{array}

(11)

where the entries of μ₀ and μ₁ are generated independently from a N(0, 0.5) distribution.

For each set of manipulated conditions, we generate 100 datasets {X, β_true, y} under the assumed model and approximate the posterior distribution for β using the MCMC algorithm in Section 7.1. We construct 95% Bayesian credible intervals for β_true and the scores u = X^T β_true based on the posterior samples. We also consider credible intervals constructed using the posterior mode and the asymptotic approximation in Theorem 3 (i.e., the CLT approximation), and confidence intervals based on the percentiles of 200 non-parametric bootstrap resamples as in Lyu et al. (2017).

Table 1 gives the average coverage rates for the true values under each set of conditions and the three different approaches to inference. The credible intervals using MCMC all have nominal coverage rates, which validates the algorithm. The CLT approximation gives appropriate coverage in several instances, even when p > n. However, it performs less well when λ is smaller and has relatively lower coverage rates for the scores u = X^T β_true. The bootstrap percentile approach yields more narrow intervals and tends to undercover in all cases. As a representative example, Figure 3 plots the mode and 95% intervals for the coefficients for an instance under the Uniform scenario with λ = 0.1, n = 100 and p = 20; this shows close agreement between the MCMC and CLT intervals, and the bootstrap intervals are universally more narrow.

Table 1.

Coverage rates for model coefficients (β) and scores (u) using 95% credible intervals under MCMC sampling from the posterior, the asymptotic normal approximation (CLT), and bootstrapping (BOOT).

					MCMC		CLT	BOOT
Distribution	n	d	λ	β	u	β	u	β	u
Uniform	20	100	0.1	0.94	0.94	0.95	0.85	0.16	0.42
Uniform	20	100	1	0.94	0.94	0.95	0.85	0.23	0.50
Uniform	20	100	10	0.94	0.94	0.95	0.96	0.20	0.45
Uniform	100	20	0.1	0.94	0.94	0.92	0.85	0.46	0.55
Uniform	100	20	1	0.95	0.95	0.95	0.95	0.53	0.65
Uniform	100	20	10	0.94	0.94	0.95	0.95	0.24	0.32
Exponential	20	100	0.1	0.96	0.94	0.92	0.91	0.61	0.69
Exponential	20	100	1	0.95	0.95	0.96	0.95	0.62	0.71
Exponential	20	100	10	0.94	0.94	0.96	0.96	0.34	0.41
Exponential	100	20	0.1	0.94	0.95	0.92	0.85	0.39	0.49
Exponential	100	20	0.1	0.94	0.95	0.94	0.91	0.53	0.63
Exponential	100	20	1	0.94	0.94	0.95	0.96	0.40	0.52
Bimodal	20	100	10	0.95	0.96	0.94	0.91	0.61	0.66
Bimodal	20	100	1	0.94	0.95	0.96	0.95	0.66	0.73
Bimodal	20	100	10	0.93	0.94	0.96	0.96	0.39	0.46
Bimodal	100	20	0.1	0.94	0.94	0.92	0.85	0.37	0.47
Bimodal	100	20	1	0.95	0.95	0.94	0.90	0.51	0.60
Bimodal	100	20	10	0.94	0.95	0.96	0.96	0.40	0.61

Open in a new tab

Fig. 3 — Inferential results and true values for the coefficients β for a dataset generated under the Uniform scenario when λ = 0.1, n = 100 and p = 20.

To assess the calibration of estimated class probabilities, we generate a new set of 1000 observations under the true model as a “test” set for each instance of the simulation. We compute the class probabilities for each test case, given the posterior fit to the n = 20 or n = 100 training cases. We group the estimated probabilities into narrow bins of length 0.02 ([0, 0.02), [0.02, 0.04), etc.) and consider the proportion of cases with y = 1 in each bin. Figure 4 plots these proportions, aggregated across all conditions, vs. the midpoint of each bin. Using the posterior mean, the empirical proportions match the estimated probabilities perfectly, confirming that the model is well-calibrated and again validating the MCMC procedure. Using the posterior mode, the relationship deviates slightly but still tends to provide a reasonable fit in most cases.

Fig. 4 — Estimated probabilities vs. observed proportions on test observations, aggregated across all conditions, for different simulation scenarios using the posterior mean or mode.

8.2. Two-class multivariate normal simulation

Here we generate data according to a two-class multivariate normal distribution, in which each class has a different mean vector, as follows:

Generate d-dimensional mean vectors μ₀ and μ₁, with independent entries from a N(0, τ²) distribution.
Set y_i = −1 for i = 1, …, n / 2 and y_i = 1 for i = n / 2 + 1, …, n.
For i = 1, …, n, generate x_i via
$x_{i} ~ {\begin{array}{l} MVN (μ_{0}, I) if y_{i} = - 1 \\ MVN (μ_{1}, I) if y_{i} = 1. \end{array}$

Note that this scenario differs from the ‘Bimodal’ case in Section 8.1 (11); for that scenario, membership in a given mixture component (defined by μ₀ and μ₁) does not necessarily correspond to the supervised class labels y_i. The present scenario does not precisely match our assumed model. The “true” (i.e., oracle) probabilistic link between y_i and x_i in this scenario corresponds to a logit link with coefficients that are proportional to the mean difference vector μ₀ – μ₁; as shown in Figure 1 the link for Bayesian DWD is not a logit, but it is close.

We fix n = 100 and consider different levels of signal (τ = 0, 0.1, 0.2, 0.5) and different dimensions (d = 10, 100, 1000) as manipulated conditions. For each instance, we approximate the posterior separately with λ fixed at each of 15 possible values, that range from 1/128 to 128 and are equally spaced on a log scale. For each case we compute the Kullback-Leibler divergence (KL-divergence) of the estimated class probabilities from the oracle model. For each case we also generate a test set of 5000 samples and evaluate (1) the misclassification rate on the test samples using a posterior predictive probability of 0.5 as a threshold, and the mean squared error (MSE) of the predicted class probabilities for class 1 (p_i) from the true class memberships as

MSE = \frac{1}{n_{test}} \sum_{i = 1}^{n_{test}} {(1 - p_{i})}^{2} 1_{{y_{i} = - 1}} + p_{i}^{2} 1_{{y_{i} = 1}} .

Figure 5 plots these metrics as a function of λ for the different conditions. In general, lower values of λ perform better when the distinguishing signal is stronger. This is intuitive, as λ can be considered an inverse scale parameter for β, and larger magnitudes for β correspond to probabilities closer to 0 and 1. The misclassification rate is robust to the value of λ, but its specification is important to achieve well-calibrated probabilities.

Fig. 5 — Different performance metrics (KL-divergence, test MSE, and test misclassification rate) are shown as a function of λ used for estimation under different conditions.

Motivated by the sensitivity of the estimated probabilities to λ, we also approximate the posterior with a uniform prior on λ as in Section 6.2. Table 2 summarizes the posterior mean for λ under each set of conditions, and compares the performance of the model with uniform prior on λ with that of the best performing fixed value of λ for each metric. The model with a uniform prior performs comparatively well across the different scenarios, and the posterior mean for λ appropriately ranges from very high values when the signal is nonexistent to low values when the signal is strong. However, for the conditions with p = 1000 and strong signal the posterior mean is substantially larger than the best performing values of λ; this is likely because the model fits well and perfectly classifies the data with probabilities close to 0 or 1 for a wide range of λ values (see Figure 5), so there is not strong evidence to distinguish between different values of λ in this range.

Table 2.

Comparison of model performance when λ is inferred with a uniform prior ( $\hat{λ}$ ) vs. the best performing fixed value of λ based on KL-divergence (λ_KL) or MSE λ_MSE. Results for each set of condition are averages over 10 replications, with standard deviations in parentheses.

d	τ	$\hat{λ}$	λ _KL	λ _MSE	$KL (\hat{λ})$	KL(λ_KL)	$MSE (\hat{λ})$	MSE(λ_MSE)
10	0	70.21 (2.5)	128.00	128.00	0.00 (0.00)	0.00 (0.00)	0.25 (0.00)	0.25 (0.00)
10	0.1	70.82 (3.2)	2.00	2.00	0.02 (0.01)	0.02 (0.01)	0.25 (0.00)	0.25 (0.00)
10	0.2	31.27 (26.1)	0.50	1.00	0.06 (0.03)	0.02 (0.01)	0.24 (0.02)	0.22 (0.02)
10	0.5	0.10 (0.14)	0.12	0.12	0.06 (0.04)	0.05 (0.04)	0.11 (0.03)	0.11 (0.03)
100	0	74.56 (6.4)	128.00	64.00	0.00 (0.00)	0.00 (0.00)	0.25 (0.00)	0.25 (0.00)
100	0.1	22.55 (32.4)	1.00	2.00	0.15 (0.04)	0.10 (0.02)	0.24 (0.01)	0.22 (0.01)
100	0.2	0.14 (0.43)	0.12	0.50	0.11 (0.03)	0.10 (0.03)	0.11 (0.03)	0.10 (0.02)
100	0.5	0.11 (0.03)	0.01	0.02	0.03 (0.00)	0.01 (0.01)	0.00 (0.00)	0.00 (0.00)
1000	0	44.79 (21.9)	128.00	128.00	0.02 (0.01)	0.00 (0.00)	0.25 (0.00)	0.25 (0.00)
1000	0.1	3.11 (4.3)	0.03	0.50	0.21 (0.08)	0.06 (0.04)	0.11 (0.03)	0.09 (0.01)
1000	0.2	0.63 (0.84)	0.01	0.03	0.07 (0.03)	0.01 (0.00)	0.00 (0.00)	0.00 (0.00)
1000	0.5	1.82 (1.1)	0.01	0.01	0.04 (0.01)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)

Open in a new tab

In Figure 6 we consider the overall calibration of the fitted probabilities across the different conditions, as in Section 8.1. This shows clearly that fixing λ at a large value (λ = 128) can yield overly conservative estimated probabilities and a relatively small value (λ = 1 / 128) can yield overly confident probabilities. The model in which λ is inferred with a uniform prior is mostly well-calibrated but does show some deviations from the estimated probabilities and empirical proportions. These deviations are likely in part due to the fact that the assumed probabilistic link does not perfectly match the true generative model.

Fig. 6 — Estimated probabilities vs. observed proportions on test observations, aggregated across all conditions, for λ = 1 / 128, λ = 128, or inferring λ with a uniform prior.

8.3. Semi-supervised simulation

Here we generate data in which some of the class labels y_i may be unknown, to illustrate a semi-supervised approach. We consider two scenarios. For the first scenario, the observations x_i are generated exactly as in Section 8.2 with d = 100 and τ = 0.3. For the second scenario, d = 100 and entries of each x_i are generated independently from a N(0, 1) distribution; then class labels y_i are defined deterministically by the sign of $x_{i}^{T} β_{sep}$ , where the entries of β_sep are generated from a standard normal distribution. The key difference between these two scenarios, for our purpose, is that the first scenario (bimodal) has an underlying cluster structure that distinguishes the two classes whereas the second (unimodal) does not. So, we expect to improve performance by considering unlabeled data for the bimodal scenario but not the unimodal scenario.

As manipulated conditions, we vary the number of observations with y_i observed (n_o = 10, 20, 100, 1000) and with y_i unobserved (n_u = 0, 10, 20, 100, 1000) in our training set, for each of the bimodal and unimodal scenarios. For each combination of conditions, we generate 20 datasets, use MCMC to approximate the posterior, and consider the resulting misclassification rates for a test set. The average test misclassification rates under each set of conditions are shown in Table 3. This shows that for the bimodal scenario a semi-supervised analysis with a large number of unlabeled observations can improve performance dramatically; the misclassification rates for n_o = 10 observed and n_u = 1000 unobserved are comparable to that with n_o = 1000 observed. As expected, including data without observed class labels does not help in the unimodal scenario; however, it does not lead to a substantial decrease in performance.

Table 3.

Test misclassification rates using Bayesian DWD with differing number of observations with y_i observed (n_o) or unobserved (n_u).

		N unobserved (n_u)
Scenario	N observed (n_o)	0	10	20	100	1000
Bimodal	10	0.164	0.138	0.145	0.081	0.019
Bimodal	20	0.077	0.077	0.080	0.053	0.014
Bimodal	100	0.038	0.035	0.032	0.036	0.020
Bimodal	1000	0.019	0.021	0.019	0.020	0.017
Unimodal	10	0.444	0.451	0.425	0.410	0.432
Unimodal	20	0.392	0.390	0.402	0.395	0.412
Unimodal	100	0.280	0.266	0.272	0.280	0.297
Unimodal	1000	0.098	0.102	0.091	0.104	0.113

Open in a new tab

8.4. High-dimensional comparison

Here we further explore the behavior of Bayesian DWD in high-dimensional settings, and compare with other Bayesian linear classifiers. Data are generated from a two-class multivariate normal mixture model as in Section 8.2, with n = 100, τ² = 0.25, and varying dimension d ∈ {10, 100, 1000, 10000}. In addition to Bayesian DWD, we consider two other fully Bayesian methods: (1) a Bayesian implementation of SVM using the data augmentation approach of Polson and Scott (2011) and (2) a logistic regression model with Cauchy priors on the coefficients as suggested in Gelman et al. (2008). The details of these implementations are given in Section D of the online supplement.

For each simulation and for each method, we compute the following:

Cor: The Pearson correlation of the estimated coefficients β with those corresponding to the true Bayes classifier μ₁ − μ₀, as a scale-invariant measure of similarity.
Coeff.Rat: The ratio of the average standard deviation of the coefficients across posterior draws over the standard deviation of the estimated coefficients,
$\frac{1}{d} \sum_{j = 1}^{d} sd ({β_{j}^{(1)}, \dots, β_{j}^{(T)}}) / sd ({{\hat{β}}_{1}, \dots, {\hat{β}}_{d}}),$

as a relative measure of posterior uncertainty in the coefficients.
The ratio of the average standard deviation of the class probabilities across posterior samples over the standard deviation of the estimated class probabilities,
$\frac{1}{n} \sum_{i = 1}^{n} sd ({p_{i}^{(1)}, \dots, p_{i}^{(T)}}) / sd ({{\hat{p}}_{1}, \dots, {\hat{p}}_{i}}),$

as a relative measure of posterior uncertainty in the probabilities.
Misclass: The misclassification rate for n_test = 500 new samples.
LogLike: The log-likelihood of the estimated probabilities for the 500 new samples, as $\sum_{i = 1}^{n_{test}} log (1 - p_{i}) 1_{{y_{i} = - 1}} + log (1 - p_{i}) 1_{{y_{i} = 1}}$ .
Time: The computing time in seconds, on a desktop with a AMD Ryzen Threadripper 3960X 24-Core Processor and 128GB RAM.

The results, averaged over 100 replications for each scenario, are shown in Table 4. The misclassification rate for Bayesian DWD converges to 0 as the sample size increases. This is not surprising, as the classification consistency of DWD as d → ∞ has been established for moderate signal-to-noise ratios (Hall et al. 2005; Qiao et al. 2010). However, the correlation of the estimated coefficients to that of the Bayes classifier does not improve as d increases. This is also supported by high-dimensional asymptotic theory; the coefficients of the DWD classifier only converge to the Bayes classifier as d → ∞ if the signal is much higher than the noise (Qiao et al. 2010). As measured by Coeff.Rat and Prob.Rat, as d increases the relative level of posterior uncertainty increases for the coefficients and decreases for the class probabilities. A similar trend is apparent for Bayesian SVM and logistic regression. The logistic regression model tends to suffer from overfitting for moderate and high-dimensional scenarios, performing worse than Bayesian DWD. The performance of Bayesian SVM is more comparable across most metrics, but the log-likelihood of the fitted probabilities is substantially worse when the test samples are not classified perfectly, indicating that the probabilities under Bayesian DWD are better calibrated. The log-likelihood consistently improves with larger d under Bayesian DWD, but gets progressively worse for logistic regrssion. Bayesian DWD is also more computationally efficient, and computing time scales linearly with d. The implementations of Bayesian SVM and logistic regression were computationally infeasible when d = 10000, and so these results are not shown.

Table 4.

Comparison of posterior uncertainty under different dimensions d and performance for Bayesian DWD, Bayesian SVM and Bayesian logistic regression. The mean over 100 replications is shown for each measure, with standard deviation in parentheses.

d	Cor	Coeff.Rat	Prob.Rat	Misclass	LogLike	Time
Bayesian DWD
10	0.845 (0.11)	0.77 (0.17)	0.77 (0.22)	0.332 (0.05)	−302.5 (23.0)	0.24 (0.02)
100	0.841 (0.03)	1.43 (0.02)	0.50 (0.05)	0.072 (0.02)	−134.5 (17.2)	1.79 (0.03)
1000	0.760 (0.01)	4.16 (0.04)	0.30 (0.01)	0.000 (0.00)	−11.52 (1.1)	17.4 (0.31)
10000	0.424 (0.01)	7.52 (0.04)	0.18 (0.00)	0.000 (0.00)	−0.03 (0.01)	173.4 (0.55)
Bayesian SVM
10	0.804 (0.13)	1.05 (0.02)	0.276 (0.07)	0.341 (0.05)	−534.2 (81.8)	1.77 (0.07)
100	0.771 (0.04)	1.59 (0.05)	0.072 (0.01)	0.089 (0.02)	−486.0 (221)	8.41 (0.07)
1000	0.831 (0.01)	4.91 (0.08)	0.055 (0.01)	0.00 (0.00)	−0.00 (0.00)	4307 (77.9)
Bayesian Logistic
10	0.807 (0.14)	1.44 (0.08)	0.509 (0.14)	0.336 (0.05)	−333 (36)	0.80 (0.28)
100	0.654 (0.06)	1.41 (0.10)	0.084 (0.01)	0.131 (0.04)	−2519 (1331)	13.6 (2.6)
1000	0.271 (0.04)	1.64 (0.21)	0.021 (0.00)	0.072 (0.03)	−4586 (2447)	730.2 (15.5)

Open in a new tab

9. Application to breast cancer genomics

As a real data application, we consider classifying breast cancer (BRCA) tumor subtypes using data on microRNA (miRNA) abundance that are publicly available from the Cancer Genome Atlas (TCGA). We use miRNA-Seq data for n = 1047 BRCA tumor samples from different individuals, that was previously curated for a pan-cancer clustering application (Hoadley et al. 2018). The tumors are classified into one of 4 canonical subtypes based on the PAM50 classifier for gene (mRNA) expression (Parker et al. 2009): Luminal A (LumA) (n = 572), Luminal B (LumB) (n = 203), Her2 (n = 83), and Basal (189). Previous exploratory analysis have shown that miRNA data shares some of the same BRCA subtype distinctions as gene expression (Park and Lock 2020; Lock and Dunson 2013). Here, we examine the power of miRNAs to distinguish the subtypes using a supervised approach.

The miRNA expression read counts were log(1 + x)-transformed and centered to have mean 0. After transformation, we keep miRNAs with standard deviation greater than 0.5, leaving d = 489 miRNAs. We consider separate classification tasks to discriminate each pair of subtypes, giving 6 comparisons. For each comparison, we apply (1) Bayesian DWD (BayesDWD) with uniform prior on λ, (2) DWD (i.e., the posterior mode) via the R package sdwd, (3) SVM via the package e1071, (4) random forest classification (Breiman 2001) via the package randomForest, (5) Bayesian SVM, as described in Section D of the online supplement, and (6) Bayesian logistic regression, as described in Section D of the online supplement. We assess each method by the average misclassification rate on the test sets under 10-fold cross-validation. The results are shown in Table 5. Bayesian DWD, DWD, and SVM had similar classification performance across the different comparisons, random forests and Bayesian SVM had slightly worse performance across most comparisons, and Bayesian logistic regression performed the worst for five of the six comparisons. For methods that give a class probability we consider the calibration of the estimated probabilities on the test sets vs. their empirical class proportions, as in Section 8.1. Figure 7 shows that the probabilities inferred for Bayesian DWD are overall well-calibrated. The probabilities under the random forest are also reasonable, but tend toward being over-conservative. The probabilities for Bayesian SVM and Bayesian logistic regression are not well-calibrated. The majority of inferred class probabilities are below 0.02 or above 0.98 for both Bayesian SVM (97%) and Bayesian logistic regression (98%), leaving limited data for remaining bins.

Table 5.

Test misclassification rates for different subtype comparisons under cross validation using Bayesian DWD (BayesDWD), DWD, Bayesian SVM (BayesSVM), SVM, random forests (RF), and Bayesian logistic regression (BayesLogit).

Comparison	BayesDWD	DWD	BayesSVM	SVM	RF	BayesLogit
LumA vs. LumB	0.168	0.156	0.177	0.158	0.186	0.195
LumA vs. Her2	0.054	0.045	0.055	0.059	0.070	0.093
LumA vs. Basal	0.026	0.028	0.037	0.025	0.030	0.100
LumB vs. Her2	0.129	0.124	0.157	0.155	0.169	0.192
LumB vs. Basal	0.046	0.043	0.070	0.044	0.047	0.079
Her2 vs. Basal	0.066	0.064	0.050	0.053	0.075	0.054

Open in a new tab

Fig. 7 — Estimated probabilities vs. observed proportions on test observations, aggregated across all comparisons, using Bayesian DWD, random forests (RF), Bayesian SVM, and Bayesian logistic regression.

To illustrate the Bayesian DWD results, we focus on the comparison with the best classification accuracy from Table 5 (LumA vs. Basal) and the comparison with the worst classification accuracy (LumA vs. LumB). Figure 8 plots the posterior mean of the DWD scores and their 95% credible intervals for each comparison. Note that the DWD scores are almost perfectly separated for LumA vs. Basal, but not for LumA vs. LumB. The credible intervals for each plot help to understand the level of uncertainty in the discriminating projections.

Fig. 8 — Mean DWD scores and associated 95% credible intervals for LumA vs. Basal and for LumA vs. LumB.

10. Discussion

The Bayesian formulation of DWD is straightforward to modify or extend via changes to the likelihood or prior. As a future direction, we are interested in extending the model to accommodate multiple sources of data (e.g., multiple ‘omics or clinical sources) or data with multiway structure (Lyu et al., 2017). In this article we have focused on the original and most widely used version of DWD, but the Bayesian framework may be extended for other versions including multi-class DWD (Huang et al., 2013), weighted DWD (Qiao et al., 2010), sparse DWD (Wei et al., 2016), and kernel DWD (Wang and Zou, 2018, 2019). Additionally, a generalized version of DWD was introduced in (Hall et al., 2005), wherein the distances r_i in (2) are raised to a power q. The choice of q modifies the loss function V (·) in (3) (Wang and Zou, 2018), which in turn affects the probabilistic link (4.1); the Bayesian formulation is potentially very useful in this context, because one can put a prior on q and infer it within the model.

Computing time and resources can limit a fully Bayesian treatment of DWD for some applications, as discussed in Section 7.3. In such cases, the DWD solution can still be interpreted as a posterior mode to give point estimates for class probabilities, and the normal approximation in Section 5 can be used to quickly assess uncertainty in the estimated coefficients. Alternative computational approaches that facilitate a fully Bayesian treatment for massive datasets are worth pursuing.

We have established asymptotic normality of the posterior about its mode as n → ∞; large-sample consistency results have been established for traditional DWD (Qiao et al., 2010), which is the posterior mode in the Bayesian context. The asymptotic behavior of DWD in high-dimensions (d → ∞) has also been studied extensively (Hall et al., 2005; Huang and Yang, 2020, 2021), and these results similarly apply to the posterior mode in our context. Further high-dimensional asymptotic results that describe the behavior of the full posterior distribution is another direction worth pursuing.

Supplementary Material

Supp 1

NIHMS1839266-supplement-Supp_1.zip^{(6.8MB, zip)}

Supp 2

NIHMS1839266-supplement-Supp_2.pdf^{(147.4KB, pdf)}

Acknowledgements

This work was supported by the National Institutes of Health (NIH) grant R01-GM130622.

Footnotes

Supplementary material

A supplemental document available online contains proofs for theoretical results, a description of the MCMC algorithm, a theoretical exploration of the conditional distribution of X given y under the model, and additional details on our implementation of Bayesian SVM and Bayesian logistic regression. A zipped folder contains code to reproduce all of the results in this manuscript.

References

Breiman L (2001), ‘Random forests’, Machine learning 45(1), 5–32. [Google Scholar]
Carlin BP and Louis TA (2008), Bayesian methods for data analysis, CRC Press. [Google Scholar]
Cortes C and Vapnik V (1995), ‘Support-vector networks’, Machine learning 20(3), 273–297. [Google Scholar]
Gelman A, Jakulin A, Pittau MG and Su Y-S (2008), ‘A weakly informative default prior distribution for logistic and other regression models’, The annals of applied statistics 2(4), 1360–1383. [Google Scholar]
Ghosal S (1997), A review of consistency and convergence of posterior distribution, in ‘Varanashi Symposium in Bayesian Inference, Banaras Hindu University’. [Google Scholar]
Hall P, Marron JS and Neeman A (2005), ‘Geometric representation of high dimension, low sample size data’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(3), 427–444. [Google Scholar]
Henao R, Yuan X and Carin L (2014), Bayesian nonlinear support vector machines and discriminative factor modeling, in ‘Advances in neural information processing systems’, pp. 1754–1762. [Google Scholar]
Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, Shen R, Taylor AM, Cherniack AD, Thorsson V et al. (2018), ‘Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer’, Cell 173(2), 291–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hsiang T (1975), ‘A Bayesian view on ridge regression’, Journal of the Royal Statistical Society: Series D (The Statistician) 24(4), 267–268. [Google Scholar]
Huang H, Liu Y, Du Y, Perou CM, Hayes DN, Todd MJ and Marron JS (2013), ‘Multiclass distance-weighted discrimination’, Journal of Computational and Graphical Statistics 22(4), 953–969. [Google Scholar]
Huang H, Lu X, Liu Y, Haaland P and Marron J (2012), ‘R/DWD: distance-weighted discrimination for classification, visualization and batch adjustment’, Bioinformatics 28(8), 1182–1183. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang H and Yang Q (2020), ‘Large scale analysis of generalization error in learning using margin based classification methods’, Journal of Statistical Mechanics: Theory and Experiment 2020(10), 103407. [Google Scholar]
Huang H and Yang Q (2021), ‘Large dimensional analysis of general margin based classification methods’, Journal of Statistical Mechanics: Theory and Experiment 2021(11), 113401. [Google Scholar]
Liu Y, Zhang HH and Wu Y (2011), ‘Hard or soft classification? large-margin unified machines’, Journal of the American Statistical Association 106(493), 166–177. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lock EF and Dunson DB (2013), ‘Bayesian consensus clustering’, Bioinformatics 29(20), 2610–2616. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lyu T, Lock EF and Eberly LE (2017), ‘Discriminating sample groups with multi-way data’, Biostatistics 18(3), 434–450. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marron J, Todd MJ and Ahn J (2007), ‘Distance-weighted discrimination’, Journal of the American Statistical Association 102(480), 1267–1271. [DOI] [PMC free article] [PubMed] [Google Scholar]
Park JY and Lock EF (2020), ‘Integrative factorization of bidimensionally linked matrices’, Biometrics 76(1), 61–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
Park T and Casella G (2008), ‘The Bayesian lasso’, Journal of the American Statistical Association 103(482), 681–686. [Google Scholar]
Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z et al. (2009), ‘Supervised risk predictor of breast cancer based on intrinsic subtypes’, Journal of clinical oncology 27(8), 1160. [DOI] [PMC free article] [PubMed] [Google Scholar]
Polson NG and Scott SL (2011), ‘Data augmentation for support vector machines’, Bayesian Analysis 6(1), 1–23.22247752 [Google Scholar]
Qiao X, Zhang HH, Liu Y, Todd MJ and Marron JS (2010), ‘Weighted distance weighted discrimination and its asymptotic properties’, Journal of the American Statistical Association 105(489), 401–414. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qiao X and Zhang L (2015), ‘Flexible high-dimensional classification machines and their asymptotic properties’, The Journal of Machine Learning Research 16(1), 1547–1572. [Google Scholar]
Sollich P (2002), ‘Bayesian methods for support vector machines: Evidence and predictive class probabilities’, Machine learning 46(1–3), 21–52. [Google Scholar]
Sun D, Tsutakawa RK and He Z (2001), ‘Propriety of posteriors with improper priors in hierarchical linear mixed models’, Statistica Sinica pp. 77–95. [Google Scholar]
Taraldsen G and Lindqvist BH (2010), ‘Improper priors are not improper’, The American Statistician 64(2), 154–158. [Google Scholar]
Van Engelen JE and Hoos HH (2020), ‘A survey on semi-supervised learning’, Machine Learning 109(2), 373–440. [Google Scholar]
Wang B and Zou H (2016), ‘Sparse distance weighted discrimination’, Journal of Computational and Graphical Statistics 25(3), 826–838. [Google Scholar]
Wang B and Zou H (2018), ‘Another look at distance-weighted discrimination’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(1), 177–198. [Google Scholar]
Wang B and Zou H (2019), ‘A multicategory kernel distance weighted discrimination method for multiclass classification’, Technometrics 61(3), 396–408. [Google Scholar]
Wei S, Lee C, Wichers L and Marron J (2016), ‘Direction-projection-permutation for high-dimensional hypothesis tests’, Journal of Computational and Graphical Statistics 25(2), 549–569. [Google Scholar]
Witten DM and Tibshirani R (2011), ‘Penalized classification using Fisher’s linear discriminant’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(5), 753–772. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1839266-supplement-Supp_1.zip^{(6.8MB, zip)}

Supp 2

NIHMS1839266-supplement-Supp_2.pdf^{(147.4KB, pdf)}

[R1] Breiman L (2001), ‘Random forests’, Machine learning 45(1), 5–32. [Google Scholar]

[R2] Carlin BP and Louis TA (2008), Bayesian methods for data analysis, CRC Press. [Google Scholar]

[R3] Cortes C and Vapnik V (1995), ‘Support-vector networks’, Machine learning 20(3), 273–297. [Google Scholar]

[R4] Gelman A, Jakulin A, Pittau MG and Su Y-S (2008), ‘A weakly informative default prior distribution for logistic and other regression models’, The annals of applied statistics 2(4), 1360–1383. [Google Scholar]

[R5] Ghosal S (1997), A review of consistency and convergence of posterior distribution, in ‘Varanashi Symposium in Bayesian Inference, Banaras Hindu University’. [Google Scholar]

[R6] Hall P, Marron JS and Neeman A (2005), ‘Geometric representation of high dimension, low sample size data’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(3), 427–444. [Google Scholar]

[R7] Henao R, Yuan X and Carin L (2014), Bayesian nonlinear support vector machines and discriminative factor modeling, in ‘Advances in neural information processing systems’, pp. 1754–1762. [Google Scholar]

[R8] Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, Shen R, Taylor AM, Cherniack AD, Thorsson V et al. (2018), ‘Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer’, Cell 173(2), 291–304. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Hsiang T (1975), ‘A Bayesian view on ridge regression’, Journal of the Royal Statistical Society: Series D (The Statistician) 24(4), 267–268. [Google Scholar]

[R10] Huang H, Liu Y, Du Y, Perou CM, Hayes DN, Todd MJ and Marron JS (2013), ‘Multiclass distance-weighted discrimination’, Journal of Computational and Graphical Statistics 22(4), 953–969. [Google Scholar]

[R11] Huang H, Lu X, Liu Y, Haaland P and Marron J (2012), ‘R/DWD: distance-weighted discrimination for classification, visualization and batch adjustment’, Bioinformatics 28(8), 1182–1183. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Huang H and Yang Q (2020), ‘Large scale analysis of generalization error in learning using margin based classification methods’, Journal of Statistical Mechanics: Theory and Experiment 2020(10), 103407. [Google Scholar]

[R13] Huang H and Yang Q (2021), ‘Large dimensional analysis of general margin based classification methods’, Journal of Statistical Mechanics: Theory and Experiment 2021(11), 113401. [Google Scholar]

[R14] Liu Y, Zhang HH and Wu Y (2011), ‘Hard or soft classification? large-margin unified machines’, Journal of the American Statistical Association 106(493), 166–177. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Lock EF and Dunson DB (2013), ‘Bayesian consensus clustering’, Bioinformatics 29(20), 2610–2616. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Lyu T, Lock EF and Eberly LE (2017), ‘Discriminating sample groups with multi-way data’, Biostatistics 18(3), 434–450. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Marron J, Todd MJ and Ahn J (2007), ‘Distance-weighted discrimination’, Journal of the American Statistical Association 102(480), 1267–1271. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Park JY and Lock EF (2020), ‘Integrative factorization of bidimensionally linked matrices’, Biometrics 76(1), 61–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Park T and Casella G (2008), ‘The Bayesian lasso’, Journal of the American Statistical Association 103(482), 681–686. [Google Scholar]

[R20] Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z et al. (2009), ‘Supervised risk predictor of breast cancer based on intrinsic subtypes’, Journal of clinical oncology 27(8), 1160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Polson NG and Scott SL (2011), ‘Data augmentation for support vector machines’, Bayesian Analysis 6(1), 1–23.22247752 [Google Scholar]

[R22] Qiao X, Zhang HH, Liu Y, Todd MJ and Marron JS (2010), ‘Weighted distance weighted discrimination and its asymptotic properties’, Journal of the American Statistical Association 105(489), 401–414. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Qiao X and Zhang L (2015), ‘Flexible high-dimensional classification machines and their asymptotic properties’, The Journal of Machine Learning Research 16(1), 1547–1572. [Google Scholar]

[R24] Sollich P (2002), ‘Bayesian methods for support vector machines: Evidence and predictive class probabilities’, Machine learning 46(1–3), 21–52. [Google Scholar]

[R25] Sun D, Tsutakawa RK and He Z (2001), ‘Propriety of posteriors with improper priors in hierarchical linear mixed models’, Statistica Sinica pp. 77–95. [Google Scholar]

[R26] Taraldsen G and Lindqvist BH (2010), ‘Improper priors are not improper’, The American Statistician 64(2), 154–158. [Google Scholar]

[R27] Van Engelen JE and Hoos HH (2020), ‘A survey on semi-supervised learning’, Machine Learning 109(2), 373–440. [Google Scholar]

[R28] Wang B and Zou H (2016), ‘Sparse distance weighted discrimination’, Journal of Computational and Graphical Statistics 25(3), 826–838. [Google Scholar]

[R29] Wang B and Zou H (2018), ‘Another look at distance-weighted discrimination’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(1), 177–198. [Google Scholar]

[R30] Wang B and Zou H (2019), ‘A multicategory kernel distance weighted discrimination method for multiclass classification’, Technometrics 61(3), 396–408. [Google Scholar]

[R31] Wei S, Lee C, Wichers L and Marron J (2016), ‘Direction-projection-permutation for high-dimensional hypothesis tests’, Journal of Computational and Graphical Statistics 25(2), 549–569. [Google Scholar]

[R32] Witten DM and Tibshirani R (2011), ‘Penalized classification using Fisher’s linear discriminant’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(5), 753–772. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Bayesian Distance Weighted Discrimination

Eric F Lock

Abstract

1. Introduction

2. Framework and notation

3. Distance weighted discrimination

4. Bayesian model

4.1. Conditional distribution of y

Fig. 1.

4.2. Prior distribution of β

Fig. 2.

5. Asymptotic normality of β

6. Extensions

6.1. Marginal class probabilities

6.2. Inference for λ

6.3. Semi-supervised learning

7. Posterior approximation

7.1. MCMC algorithm, fixed λ

7.2. Inference for λ

7.3. Computational considerations

8. Simulations

8.1. Assumed model simulation

Table 1.

Fig. 3.

Fig. 4.

8.2. Two-class multivariate normal simulation

Fig. 5.

Table 2.

Fig. 6.

8.3. Semi-supervised simulation

Table 3.

8.4. High-dimensional comparison

Table 4.

9. Application to breast cancer genomics

Table 5.

Fig. 7.

Fig. 8.

10. Discussion

Supplementary Material

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases