Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2020 Jun 17;107(4):1005–1012. doi: 10.1093/biomet/asaa035

Efficient posterior sampling for high-dimensional imbalanced logistic regression

Deborshee Sen 1,, Matthias Sachs 2, Jianfeng Lu 2, David B Dunson 1
PMCID: PMC7799181  PMID: 33462537

Summary

Classification with high-dimensional data is of widespread interest and often involves dealing with imbalanced data. Bayesian classification approaches are hampered by the fact that current Markov chain Monte Carlo algorithms for posterior computation become inefficient as the number Inline graphic of predictors or the number Inline graphic of subjects to classify gets large, because of the increasing computational time per step and worsening mixing rates. One strategy is to employ a gradient-based sampler to improve mixing while using data subsamples to reduce the per-step computational complexity. However, the usual subsampling breaks down when applied to imbalanced data. Instead, we generalize piecewise-deterministic Markov chain Monte Carlo algorithms to include importance-weighted and mini-batch subsampling. These maintain the correct stationary distribution with arbitrarily small subsamples and substantially outperform current competitors. We provide theoretical support for the proposed approach and demonstrate its performance gains in simulated data examples and an application to cancer data.

Keywords: Imbalanced data, Logistic regression, Piecewise-deterministic Markov process, Scalable inference, Subsampling

1. Introduction

In designing algorithms for large datasets, much of the focus has been on optimization methods that produce a point estimate with no characterization of uncertainty. This motivates the development of scalable Bayesian algorithms. As variational methods and related analytic approximations lack theoretical support and can be inaccurate, this note focuses on posterior sampling algorithms.

One such class of methods consists of divide-and-conquer Markov chain Monte Carlo algorithms, which divide the data into chunks, run Markov chain Monte Carlo independently for each chunk, and then combine the samples (Li et al., 2017; Srivastava et al., 2018). However, combining samples inevitably leads to some bias, and accuracy theorems require sample sizes to increase within each subset.

An alternative strategy uses subsamples to approximate transition probabilities and to reduce bottlenecks in calculating likelihoods and gradients (Welling & Teh, 2011). Such approaches typically rely on uniform random subsamples, which can be highly inefficient, as noted in an expanding frequentist literature on biased subsampling (Fithian & Hastie, 2014; Ting & Brochu, 2018). The Bayesian literature has largely overlooked the use of biased subsampling in efficient algorithm design, though recent coreset approaches address a related problem (Huggins et al., 2016). A problem with subsampling Markov chain Monte Carlo is that it is almost impossible to preserve the correct invariant distribution. While there has been work on quantifying the error (Johndrow et al., 2017; Alquier et al., 2016), it is usually difficult to do so in practice. The pseudo-marginal approach of Andrieu & Roberts (2009) offers a potential solution, but it is generally impossible to obtain the required unbiased estimators of likelihoods using subsamples (Jacob & Thiery, 2015).

A promising recent direction has been the use of nonreversible samplers with subsampling within the framework of piecewise-deterministic Markov processes (Bouchard-Côté et al., 2018; Fearnhead et al., 2018; Bierkens et al., 2019). These approaches use the gradient of the loglikelihood, which can be replaced by a subsample-based unbiased estimator so that the exactly correct invariant distribution is maintained. In this note our goal is to improve the efficiency of such approaches by using nonuniform subsampling motivated concretely by logistic regression.

2. Logistic regression with sparse imbalanced data

2.1. Model

We consider the logistic regression model

graphic file with name Equation1.gif (1)

where Inline graphic is the response, Inline graphic are predictors, and Inline graphic are coefficients for the predictors. Consider data Inline graphic from model (1), where Inline graphic. For a prior distribution Inline graphic on Inline graphic, the posterior distribution is Inline graphic, where Inline graphic and Inline graphic denotes the potential function. A popular algorithm for sampling from Inline graphic is Pólya–Gamma data augmentation (Polson et al., 2013); however, this performs poorly if there is a large imbalance in the class labels Inline graphic (Johndrow et al., 2019). Similar issues arise when the Inline graphic are imbalanced. Logistic regression is routinely used in a wide range of fields and such imbalance issues are extremely common, giving rise to a great need for more efficient algorithms. While standard Metropolis–Hastings algorithms not relying on data augmentation can perform well despite imbalanced data in settings where both Inline graphic and Inline graphic are small (Johndrow et al., 2019), problems arise in scaling to large datasets due to the increasing computational time per step and slow mixing.

2.2. The zig-zag process

The zig-zag process (Bierkens et al., 2019) is a piecewise-deterministic Markov process that is particularly useful for logistic regression. It is a continuous-time stochastic process Inline graphic on the augmented space Inline graphic, where Inline graphic may be understood as the position and Inline graphic the velocity of the process at time Inline graphic. Under fairly mild conditions, the zig-zag process is ergodic with respect to the product measure Inline graphic, where Inline graphic is the uniform measure on Inline graphic. In other words, Inline graphic holds almost surely for any Inline graphic-integrable function Inline graphic.

For a starting point Inline graphic and velocity Inline graphic, the zig-zag process evolves deterministically according to

graphic file with name Equation2.gif (2)

At a random time Inline graphic, a bouncing event flips the sign of one component of the velocity Inline graphic. The process then evolves according to (2) with the new velocity until the next change in velocity. The time Inline graphic is the first arrival time of Inline graphic independent Poisson processes with intensity functions Inline graphic; that is, Inline graphic with Inline graphic. The sign flip applies Inline graphic to Inline graphic, with Inline graphic if Inline graphic and Inline graphic if Inline graphic. The intensity functions are of the form Inline graphic, where Inline graphic is a rate function. A sufficient condition for the zig-zag process to preserve Inline graphic as its invariant distribution is the existence of nonnegative functions Inline graphic such that Inline graphic  Inline graphic; here Inline graphic denotes the positive part of Inline graphic. The Inline graphic are known as refreshment rates.

If Inline graphic has a simple closed form, the arrival times Inline graphic can be sampled as Inline graphic for Inline graphic; otherwise, the Inline graphic are obtained via Poisson thinning (Lewis & Shedler, 1979). Assume that we have continuous functions Inline graphic such that Inline graphic  Inline graphic; here Inline graphic are upper computational bounds. Let Inline graphic denote the first arrival times of nonhomogeneous Poisson processes with rates Inline graphic, respectively, and let Inline graphic. A zig-zag process with intensity Inline graphic is still obtained if Inline graphic is evolved according to (2) for time Inline graphic instead of Inline graphic, and the sign of Inline graphic is flipped at time Inline graphic with probability Inline graphic.

The subsampling approach of Bierkens et al. (2019) uses uniform subsampling of a single data point to obtain an unbiased estimate of the Inline graphicth partial derivative of the potential function Inline graphic, where Inline graphic is from the prior and Inline graphic with Inline graphic  Inline graphic is from the likelihood. Their subsampling algorithm preserves the correct stationary distribution. Bierkens et al. (2019) considered estimates Inline graphic such that Inline graphic, where Inline graphic indexes the sampled data point. This is used to construct a stochastic rate function as Inline graphic. By using upper bounds satisfying Inline graphic for all Inline graphic, the rate functions Inline graphic can be replaced by stochastic versions Inline graphic, with Inline graphic being resampled at every unthinned event.

In addition, control variates can be used to reduce the variance of the estimate Inline graphic, which can lead to dramatic increases in sampling efficiency when the posterior is concentrated around a reference point (Bierkens et al., 2019). Here we use isotropic Gaussian priors and focus on situations where either Inline graphic is large relative to Inline graphic or the covariates and/or responses are imbalanced. In such situations, the posterior is not sufficiently concentrated around a reference point for control variates to perform efficiently. We demonstrate this numerically in § 4.2 for imbalanced responses, and the Supplementary Material contains similar experiments for sparse covariates. For this reason and to save space, our discussion in this note focuses on subsampling techniques that do not rely on control variates; the techniques developed can be combined with the use of control variates as detailed in the Supplementary Material.

3. Improved subsampling

3.1. General framework

We introduce a generalized version of the zig-zag sampler. Our motivation is (i) to increase the sampling efficiency and (ii) to simplify the construction of upper bounds. We achieve (ii) by letting the Poisson process that determines bouncing times in component Inline graphic be a superposition, i.e., sum, of two independent Poisson processes with state-dependent bouncing rates Inline graphic and Inline graphic. This construction allows Poisson thinning of each process separately, which decouples the problem of constructing upper bounds to that of constructing suitable bounds for the prior and for the likelihood. We achieve (i) by using general forms of the estimator Inline graphic in the Poisson thinning step obtained through nonuniform subsampling.

The resulting algorithm is presented as Algorithm 1, where Inline graphic and Inline graphic, assuming that Inline graphic, with Inline graphic, is an unbiased estimator of Inline graphic and Inline graphic is such that for all Inline graphic and Inline graphic, Inline graphic. To keep the presentation simple, we do not explicitly include the Poisson thinning step for the prior in the algorithm. The state-dependent bouncing rate of the resulting zig-zag process is Inline graphic, with Inline graphic having the explicit form Inline graphic. General results on piecewise-deterministic Markov processes imply that such a zig-zag process preserves the target measure Inline graphic (Fearnhead et al., 2018); we nonetheless provide a proof in the Supplementary Material for a self-contained presentation.

Algorithm 1.

Zig-zag algorithm with generalized subsampling.

Input: starting point Inline graphic, initial velocity Inline graphic, maximum number of bouncing attempts Inline graphic.

 Set Inline graphic.

 for Inline graphic do

  Draw Inline graphic and Inline graphic such that Inline graphic and Inline graphic

  Set Inline graphic where Inline graphic and Inline graphic.

  Evolve position: Inline graphic.

  Draw Inline graphic and Inline graphic.

  if Inline graphic then Inline graphic

  else if Inline graphic then Inline graphic

  else Inline graphic

 end for

Output: the path of a zig-zag process specified by skeleton points Inline graphic and bouncing times Inline graphic.

Although the focus of this work is on sampling from the Bayesian logistic regression problem presented in § 2.1, the approach can be readily applied to situations where the following assumption on the terms Inline graphic in the loglikelihood holds.

Assumption 1.

The partial derivatives of Inline graphic are bounded; that is, there exist constants Inline graphic such that for all Inline graphic, Inline graphic.

For the logistic regression problem considered, Bierkens et al. (2019) showed that Assumption 1 is satisfied with

graphic file with name Equation3.gif

To keep things simple, we consider the prior to be Inline graphic; we discuss other choices of priors in the Supplementary Material. Then we have that Inline graphic.

In what follows, we introduce alternative subsampling schemes and their associated estimators and bounds as variants of the zig-zag sampler. These are designed to improve sampling efficiency by either (I) improving the mixing of the zig-zag process or (II) reducing the computational cost per simulated unit time interval. Specifically, we replace uniform subsampling with importance subsampling, in § 3.2, to address (II), and we allow general mini-batches instead of subsamples of size 1, see § 3.3, to address (I). The Supplementary Material contains an extension to stratified subsampling, which enables mixing to be further improved.

3.2. Improving bounds via importance sampling

A generalization of the estimator obtained using uniform subsampling, Inline graphic with Inline graphic, is to consider the index Inline graphic as being sampled from a nonuniform probability distribution Inline graphic, defined by Inline graphic where Inline graphic are weights satisfying Inline graphic  Inline graphic. It follows that Inline graphic with Inline graphic defines an unbiased estimator of Inline graphic. Moreover, Inline graphic with Inline graphic defines an upper bound for the rate function Inline graphic under Assumption 1. The contribution Inline graphic to the effective bouncing rate is Inline graphic, the same as for uniform subsampling, which corresponds to Inline graphic  Inline graphic.

The magnitudes of the upper bounds Inline graphic can be minimized by choosing the weight vector Inline graphic such that the constants Inline graphic are minimized. This can be verified in the case where Inline graphic  Inline graphic with Inline graphic, so that Inline graphic. This approach can be trivially generalized to allow for importance weights Inline graphic in cases where the respective partial derivative vanishes, i.e., Inline graphic, which is the case, for example, when the respective covariate Inline graphic in the considered logistic regression example is zero. For logistic regression, using optimal importance subsampling thus reduces the bounds from Inline graphic to Inline graphic for the Inline graphicth dimension. This reduction is particularly significant when the Inline graphic are sparse or have outliers; see § 4.1.

3.3. Improving mixing via mini-batches

In the context of piecewise-deterministic Markov processes, the motivation for using mini-batches of size larger than 1 is to reduce the effective refreshment rate, which can be expected to improve the mixing of the process if the refreshment rate is high (Andrieu et al., 2019). We consider a mini-batch Inline graphic of random indices Inline graphic, so that Inline graphic is an unbiased estimator of Inline graphic. Entries of the mini-batch are typically sampled uniformly and independently from the dataset. This yields unbiased estimators of the form Inline graphic with Inline graphic.

Since for any function Inline graphic one has that Inline graphic, upper bounds for mini-batches of size Inline graphic are also upper bounds for mini-batches of size Inline graphic. We can also let Inline graphic with Inline graphic, where Inline graphic and Inline graphic and Inline graphic are as defined in § 3.2; by the same arguments, we conclude that the value of Inline graphic does not depend on the size of the mini-batch Inline graphic.

If we consider mini-batches of size Inline graphic, the effective bouncing rate of the zig-zag process when used with the estimators described above can be computed as Inline graphic. The effective refreshment rate Inline graphic decreases with increasing mini-batch size, as stated in the following lemma.

Lemma 1.

For allInline graphic, Inline graphic  andInline graphic, we haveInline graphic

4. Synthetic data examples

4.1. Scaling of computational efficiency

We evaluate the sampling efficiency using synthetic data generated by sampling the covariates Inline graphic from mixture distributions of the form Inline graphic, where Inline graphic is a point mass at zero, Inline graphic is a smooth density, and Inline graphic determines the degree of sparsity. The responses Inline graphic are sampled from (1) with true Inline graphic; we choose Inline graphic in this section. We further choose a noninformative prior by setting the prior variance to Inline graphic. We repeat the data generation and sampling Inline graphic times. The expected gain in efficiency from using importance subsampling instead of uniform subsampling is estimated as Inline graphic, where Inline graphic and Inline graphic denote the total simulation times after Inline graphic attempted bounces of the zig-zag process in the Inline graphicth run using uniform and importance subsampling, respectively. The Supplementary Material contains further, similar experiments but using control variates.

We can expect the expected relative gain in efficiency to behave as

graphic file with name Equation4.gif

Figure 1(a) plots the gain in efficiency for sparse covariates for Inline graphic and decreasing Inline graphic. Indeed, the behaviour of the estimated relative gain in efficiency as approximately Inline graphic is as suggested by the first-order Taylor expansion of Inline graphic when Inline graphic is large. Similarly, the form of Inline graphic suggests that the expected relative gain in efficiency is unbounded as the number Inline graphic of observations increases if Inline graphic has unbounded support. We therefore choose dense covariates for increasing Inline graphic. Figure 1(b) shows that the relative gains in efficiency for Inline graphic and Inline graphic are of order Inline graphic and order Inline graphic, respectively, as Inline graphic, which is what the first-order Taylor approximation of Inline graphic suggests. %The more heavy-tailed the density Inline graphic is, the larger the asymptotic growth rate of the efficiency gain becomes. Additionally, when Inline graphic is Student’s Inline graphic distribution, we observe the efficiency gain to be of order Inline graphic.

Fig. 1.

Fig. 1.

Scaling of the relative gain in efficiency for Inline graphic (red), Inline graphic (blue), Inline graphic (green) and Inline graphic (pink), with: (a) Inline graphic and varying Inline graphic, where the dashed black line is proportional to Inline graphic; (b) Inline graphic and varying number Inline graphic of observations, where the dashed black line is proportional to Inline graphic, the dotted black line is proportional to Inline graphic, and the dot-dashed black line is proportional to Inline graphic.

4.2. Control variates for sparse data

As mentioned in § 2.2, importance subsampling can be combined with the use of control variates. However, this approach fails to be efficient when the data are imbalanced or sparse. To demonstrate this, we generate covariates as described in § 4.1 for Inline graphic and Inline graphic, and generate responses independently of the covariates such that exactly Inline graphic of them are ones. We choose Inline graphic and Inline graphic, and choose the prior variance to be 1. Figure 2(a) plots, as a function of Inline graphic, the ratio of the mixing time of the slowest-mixing component for importance subsampling with control variates to that for importance subsampling without control variates. As the responses become more imbalanced, i.e., as Inline graphic decreases, it can be seen that the efficiency of using control variates decreases relative to not using them.

Fig. 2.

Fig. 2.

(a) Gain in efficiency of using control variates over not using control variates for imbalanced responses. Auto-correlation function plots for (b) uniform subsampling and (c) importance subsampling in a high-dimensional sparse example.

4.3. High-dimensional sparse example

In this simulation we consider a challenging setting with number of dimensions Inline graphic and number of observations Inline graphic. The data are generated as in § 4.1 with Inline graphic and Inline graphic. Traditional data augmentation and subsampling algorithms are either computationally very expensive or mix slowly in such a scenario. We choose the prior variance to be 1; auto-correlation function plots for uniform subsampling and for importance subsampling are displayed in Fig. 2 panels (b) and (c), respectively. These show that it is necessary to use importance subsampling for the zig-zag sampler to be a feasible sampling method. From a practical point of view, adaptive preconditioning can be of further help; this is described in the Supplementary Material.

5. Real-data example

We consider an imbalanced set of data on cervical cancer (Fernandes et al., 2017) obtained from the UCI Machine Learning Repository (Dua & Graff, 2017). The dataset contains 858 observations with 34 predictors. The responses are whether or not an individual has cancer, with only 18 out of the 858 individuals having cancer. The predictors include the number of sexual partners, use of hormonal contraceptives, and other variables; more than half of the predictors have approximately 80% zeros. Fixing the number of bouncing attempts, the mixing times of the slowest-mixing component for uniform subsampling and for importance subsampling are Inline graphic and Inline graphic, respectively. A stratification scheme described in the Supplementary Material brings this time down to Inline graphic.

6. Discussion

Subsampling for traditional Markov chain Monte Carlo schemes can be tricky, as the resulting chains induce an error in the invariant distribution that can be difficult to quantify. A promising class of recently developed algorithms, known as piecewise-deterministic Markov processes, allow subsampling without modifying the invariant measure. Nonuniform subsampling strategies can improve such algorithms relative to using uniform subsampling, especially for logistic regression with sparse covariate data; however, this can also be the case for other problems to which piecewise-deterministic Markov processes are applicable. After completion of this work, we became aware of a 2016 University of Oxford master’s thesis by Nicholas Galbraith, where a method called informed subsampling is introduced, which is similar to the importance subsampling strategy presented here. While aspects of sparsity in the covariate data are not addressed in that thesis, the author makes similar observations regarding the usefulness of the approach in the setting of covariate data with outliers or distributed according to heavy-tailed distributions.

Supplementary Material

asaa035_Supplementary_Data

Acknowledgement

This research was partially supported by the U.S. National Science Foundation and National Institutes of Health. We are grateful to the associate editor and two referees for helping us to improve the paper. Sen and Sachs contributed equally to this paper.

Supplementary material

Supplementary material available at Biometrika online includes more details on the zig-zag process with generalized subsampling, importance subsampling using control variates and associated numerical experiments, and stratified subsampling, as well as a proof of Lemma 1.

References

  1. Alquier,  P., Friel,  N., Everitt,  R. & Boland,  A. (2016). Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels. Statist. Comp.  26, 29–47. [Google Scholar]
  2. Andrieu,  C., Durmus,  A., Nüsken,  N. & Roussel,  J. (2019). Hypercoercivity of piecewise deterministic Markov process-Monte Carlo. arXiv: 1808.08592v2. [Google Scholar]
  3. Andrieu,  C. & Roberts,  G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Statist.  37, 697–725. [Google Scholar]
  4. Bierkens,  J., Fearnhead,  P. & Roberts,  G. (2019). The zig-zag process and super-efficient sampling for Bayesian analysis of big data. Ann. Statist.  47, 1288–320. [Google Scholar]
  5. Bouchard-Côté,  A., Vollmer,  S. J. & Doucet,  A. (2018). The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. J. Am. Statist. Assoc.  113, 855–67. [Google Scholar]
  6. Dua,  D. & Graff,  C. (2017). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml. [Google Scholar]
  7. Fearnhead,  P., Bierkens,  J., Pollock,  M. & Roberts,  G. O. (2018). Piecewise deterministic Markov processes for continuous-time Monte Carlo. Statist. Sci.  33, 386–412. [Google Scholar]
  8. Fernandes,  K., Cardoso,  J. S. & Fernandes,  J. (2017). Transfer learning with partial observability applied to cervical cancer screening. In Iberian Conference on Pattern Recognition and Image Analysis. Cham, Switzerland: Springer, pp. 243–50. [Google Scholar]
  9. Fithian,  W. & Hastie,  T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Ann. Statist.  42, 1693–724. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Huggins,  J., Campbell,  T. & Broderick,  T. (2016). Coresets for scalable Bayesian logistic regression. In NIPS’16: Proc. 30th Int. Conf. Neural Information Processing Systems. New York: Curran Associates, pp. 4080–8. [Google Scholar]
  11. Jacob,  P. E. & Thiery,  A. H. (2015). On nonnegative unbiased estimators. Ann. Statist.  43, 769–84. [Google Scholar]
  12. Johndrow,  J. E., Mattingly,  J. C., Mukherjee,  S. & Dunson,  D. (2017). Optimal approximating Markov chains for Bayesian inference. arXiv: 1508.03387v3. [Google Scholar]
  13. Johndrow,  J. E., Smith,  A., Pillai,  N. & Dunson,  D. B. (2019). MCMC for imbalanced categorical data. J. Am. Statist. Assoc.  114, 1394–403. [Google Scholar]
  14. Lewis,  P. W. & Shedler,  G. S. (1979). Simulation of nonhomogeneous Poisson processes by thinning. Naval Res. Logist. Quart.  26, 403–13. [Google Scholar]
  15. Li,  C., Srivastava,  S. & Dunson,  D. B. (2017). Simple, scalable and accurate posterior interval estimation. Biometrika  104, 665–80. [Google Scholar]
  16. Polson,  N. G., Scott,  J. G. & Windle,  J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. J. Am. Statist. Assoc.  108, 1339–49. [Google Scholar]
  17. Srivastava,  S., Li,  C. & Dunson,  D. B. (2018). Scalable Bayes via barycenter in Wasserstein space. J. Mach. Learn. Res.  19, 312–46. [Google Scholar]
  18. Ting,  D. & Brochu,  E. (2018). Optimal subsampling with influence functions. In NIPS’18: Proc. 32nd Int. Conf. Neural Information Processing Systems. New York: Curran Associates, pp. 3650–9. [Google Scholar]
  19. Welling,  M. & Teh,  Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML’11: Proc. 28th Int. Conf. Machine Learning. Madison, Wisconsin: Omnipress, pp. 681–8. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

asaa035_Supplementary_Data

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES