Summary
Classification with high-dimensional data is of widespread interest and often involves dealing with imbalanced data. Bayesian classification approaches are hampered by the fact that current Markov chain Monte Carlo algorithms for posterior computation become inefficient as the number
of predictors or the number
of subjects to classify gets large, because of the increasing computational time per step and worsening mixing rates. One strategy is to employ a gradient-based sampler to improve mixing while using data subsamples to reduce the per-step computational complexity. However, the usual subsampling breaks down when applied to imbalanced data. Instead, we generalize piecewise-deterministic Markov chain Monte Carlo algorithms to include importance-weighted and mini-batch subsampling. These maintain the correct stationary distribution with arbitrarily small subsamples and substantially outperform current competitors. We provide theoretical support for the proposed approach and demonstrate its performance gains in simulated data examples and an application to cancer data.
Keywords: Imbalanced data, Logistic regression, Piecewise-deterministic Markov process, Scalable inference, Subsampling
1. Introduction
In designing algorithms for large datasets, much of the focus has been on optimization methods that produce a point estimate with no characterization of uncertainty. This motivates the development of scalable Bayesian algorithms. As variational methods and related analytic approximations lack theoretical support and can be inaccurate, this note focuses on posterior sampling algorithms.
One such class of methods consists of divide-and-conquer Markov chain Monte Carlo algorithms, which divide the data into chunks, run Markov chain Monte Carlo independently for each chunk, and then combine the samples (Li et al., 2017; Srivastava et al., 2018). However, combining samples inevitably leads to some bias, and accuracy theorems require sample sizes to increase within each subset.
An alternative strategy uses subsamples to approximate transition probabilities and to reduce bottlenecks in calculating likelihoods and gradients (Welling & Teh, 2011). Such approaches typically rely on uniform random subsamples, which can be highly inefficient, as noted in an expanding frequentist literature on biased subsampling (Fithian & Hastie, 2014; Ting & Brochu, 2018). The Bayesian literature has largely overlooked the use of biased subsampling in efficient algorithm design, though recent coreset approaches address a related problem (Huggins et al., 2016). A problem with subsampling Markov chain Monte Carlo is that it is almost impossible to preserve the correct invariant distribution. While there has been work on quantifying the error (Johndrow et al., 2017; Alquier et al., 2016), it is usually difficult to do so in practice. The pseudo-marginal approach of Andrieu & Roberts (2009) offers a potential solution, but it is generally impossible to obtain the required unbiased estimators of likelihoods using subsamples (Jacob & Thiery, 2015).
A promising recent direction has been the use of nonreversible samplers with subsampling within the framework of piecewise-deterministic Markov processes (Bouchard-Côté et al., 2018; Fearnhead et al., 2018; Bierkens et al., 2019). These approaches use the gradient of the loglikelihood, which can be replaced by a subsample-based unbiased estimator so that the exactly correct invariant distribution is maintained. In this note our goal is to improve the efficiency of such approaches by using nonuniform subsampling motivated concretely by logistic regression.
2. Logistic regression with sparse imbalanced data
2.1. Model
We consider the logistic regression model
![]() |
(1) |
where
is the response,
are predictors, and
are coefficients for the predictors. Consider data
from model (1), where
. For a prior distribution
on
, the posterior distribution is
, where
and
denotes the potential function. A popular algorithm for sampling from
is Pólya–Gamma data augmentation (Polson et al., 2013); however, this performs poorly if there is a large imbalance in the class labels
(Johndrow et al., 2019). Similar issues arise when the
are imbalanced. Logistic regression is routinely used in a wide range of fields and such imbalance issues are extremely common, giving rise to a great need for more efficient algorithms. While standard Metropolis–Hastings algorithms not relying on data augmentation can perform well despite imbalanced data in settings where both
and
are small (Johndrow et al., 2019), problems arise in scaling to large datasets due to the increasing computational time per step and slow mixing.
2.2. The zig-zag process
The zig-zag process (Bierkens et al., 2019) is a piecewise-deterministic Markov process that is particularly useful for logistic regression. It is a continuous-time stochastic process
on the augmented space
, where
may be understood as the position and
the velocity of the process at time
. Under fairly mild conditions, the zig-zag process is ergodic with respect to the product measure
, where
is the uniform measure on
. In other words,
holds almost surely for any
-integrable function
.
For a starting point
and velocity
, the zig-zag process evolves deterministically according to
![]() |
(2) |
At a random time
, a bouncing event flips the sign of one component of the velocity
. The process then evolves according to (2) with the new velocity until the next change in velocity. The time
is the first arrival time of
independent Poisson processes with intensity functions
; that is,
with
. The sign flip applies
to
, with
if
and
if
. The intensity functions are of the form
, where
is a rate function. A sufficient condition for the zig-zag process to preserve
as its invariant distribution is the existence of nonnegative functions
such that
; here
denotes the positive part of
. The
are known as refreshment rates.
If
has a simple closed form, the arrival times
can be sampled as
for
; otherwise, the
are obtained via Poisson thinning (Lewis & Shedler, 1979). Assume that we have continuous functions
such that
; here
are upper computational bounds. Let
denote the first arrival times of nonhomogeneous Poisson processes with rates
, respectively, and let
. A zig-zag process with intensity
is still obtained if
is evolved according to (2) for time
instead of
, and the sign of
is flipped at time
with probability
.
The subsampling approach of Bierkens et al. (2019) uses uniform subsampling of a single data point to obtain an unbiased estimate of the
th partial derivative of the potential function
, where
is from the prior and
with
is from the likelihood. Their subsampling algorithm preserves the correct stationary distribution. Bierkens et al. (2019) considered estimates
such that
, where
indexes the sampled data point. This is used to construct a stochastic rate function as
. By using upper bounds satisfying
for all
, the rate functions
can be replaced by stochastic versions
, with
being resampled at every unthinned event.
In addition, control variates can be used to reduce the variance of the estimate
, which can lead to dramatic increases in sampling efficiency when the posterior is concentrated around a reference point (Bierkens et al., 2019). Here we use isotropic Gaussian priors and focus on situations where either
is large relative to
or the covariates and/or responses are imbalanced. In such situations, the posterior is not sufficiently concentrated around a reference point for control variates to perform efficiently. We demonstrate this numerically in § 4.2 for imbalanced responses, and the Supplementary Material contains similar experiments for sparse covariates. For this reason and to save space, our discussion in this note focuses on subsampling techniques that do not rely on control variates; the techniques developed can be combined with the use of control variates as detailed in the Supplementary Material.
3. Improved subsampling
3.1. General framework
We introduce a generalized version of the zig-zag sampler. Our motivation is (i) to increase the sampling efficiency and (ii) to simplify the construction of upper bounds. We achieve (ii) by letting the Poisson process that determines bouncing times in component
be a superposition, i.e., sum, of two independent Poisson processes with state-dependent bouncing rates
and
. This construction allows Poisson thinning of each process separately, which decouples the problem of constructing upper bounds to that of constructing suitable bounds for the prior and for the likelihood. We achieve (i) by using general forms of the estimator
in the Poisson thinning step obtained through nonuniform subsampling.
The resulting algorithm is presented as Algorithm 1, where
and
, assuming that
, with
, is an unbiased estimator of
and
is such that for all
and
,
. To keep the presentation simple, we do not explicitly include the Poisson thinning step for the prior in the algorithm. The state-dependent bouncing rate of the resulting zig-zag process is
, with
having the explicit form
. General results on piecewise-deterministic Markov processes imply that such a zig-zag process preserves the target measure
(Fearnhead et al., 2018); we nonetheless provide a proof in the Supplementary Material for a self-contained presentation.
Algorithm 1.
Zig-zag algorithm with generalized subsampling.
Input: starting point
, initial velocity
, maximum number of bouncing attempts
.
Set
.
for
do
Draw
and
such that
and
Set
where
and
.
Evolve position:
.
Draw
and
.
if
then
else if
then
else
end for
Output: the path of a zig-zag process specified by skeleton points
and bouncing times
.
Although the focus of this work is on sampling from the Bayesian logistic regression problem presented in § 2.1, the approach can be readily applied to situations where the following assumption on the terms
in the loglikelihood holds.
Assumption 1.
The partial derivatives of
are bounded; that is, there exist constants
such that for all
,
.
For the logistic regression problem considered, Bierkens et al. (2019) showed that Assumption 1 is satisfied with
![]() |
To keep things simple, we consider the prior to be
; we discuss other choices of priors in the Supplementary Material. Then we have that
.
In what follows, we introduce alternative subsampling schemes and their associated estimators and bounds as variants of the zig-zag sampler. These are designed to improve sampling efficiency by either (I) improving the mixing of the zig-zag process or (II) reducing the computational cost per simulated unit time interval. Specifically, we replace uniform subsampling with importance subsampling, in § 3.2, to address (II), and we allow general mini-batches instead of subsamples of size 1, see § 3.3, to address (I). The Supplementary Material contains an extension to stratified subsampling, which enables mixing to be further improved.
3.2. Improving bounds via importance sampling
A generalization of the estimator obtained using uniform subsampling,
with
, is to consider the index
as being sampled from a nonuniform probability distribution
, defined by
where
are weights satisfying
. It follows that
with
defines an unbiased estimator of
. Moreover,
with
defines an upper bound for the rate function
under Assumption 1. The contribution
to the effective bouncing rate is
, the same as for uniform subsampling, which corresponds to
.
The magnitudes of the upper bounds
can be minimized by choosing the weight vector
such that the constants
are minimized. This can be verified in the case where
with
, so that
. This approach can be trivially generalized to allow for importance weights
in cases where the respective partial derivative vanishes, i.e.,
, which is the case, for example, when the respective covariate
in the considered logistic regression example is zero. For logistic regression, using optimal importance subsampling thus reduces the bounds from
to
for the
th dimension. This reduction is particularly significant when the
are sparse or have outliers; see § 4.1.
3.3. Improving mixing via mini-batches
In the context of piecewise-deterministic Markov processes, the motivation for using mini-batches of size larger than 1 is to reduce the effective refreshment rate, which can be expected to improve the mixing of the process if the refreshment rate is high (Andrieu et al., 2019). We consider a mini-batch
of random indices
, so that
is an unbiased estimator of
. Entries of the mini-batch are typically sampled uniformly and independently from the dataset. This yields unbiased estimators of the form
with
.
Since for any function
one has that
, upper bounds for mini-batches of size
are also upper bounds for mini-batches of size
. We can also let
with
, where
and
and
are as defined in § 3.2; by the same arguments, we conclude that the value of
does not depend on the size of the mini-batch
.
If we consider mini-batches of size
, the effective bouncing rate of the zig-zag process when used with the estimators described above can be computed as
. The effective refreshment rate
decreases with increasing mini-batch size, as stated in the following lemma.
Lemma 1.
For all
,
and
, we have
4. Synthetic data examples
4.1. Scaling of computational efficiency
We evaluate the sampling efficiency using synthetic data generated by sampling the covariates
from mixture distributions of the form
, where
is a point mass at zero,
is a smooth density, and
determines the degree of sparsity. The responses
are sampled from (1) with true
; we choose
in this section. We further choose a noninformative prior by setting the prior variance to
. We repeat the data generation and sampling
times. The expected gain in efficiency from using importance subsampling instead of uniform subsampling is estimated as
, where
and
denote the total simulation times after
attempted bounces of the zig-zag process in the
th run using uniform and importance subsampling, respectively. The Supplementary Material contains further, similar experiments but using control variates.
We can expect the expected relative gain in efficiency to behave as
![]() |
Figure 1(a) plots the gain in efficiency for sparse covariates for
and decreasing
. Indeed, the behaviour of the estimated relative gain in efficiency as approximately
is as suggested by the first-order Taylor expansion of
when
is large. Similarly, the form of
suggests that the expected relative gain in efficiency is unbounded as the number
of observations increases if
has unbounded support. We therefore choose dense covariates for increasing
. Figure 1(b) shows that the relative gains in efficiency for
and
are of order
and order
, respectively, as
, which is what the first-order Taylor approximation of
suggests. %The more heavy-tailed the density
is, the larger the asymptotic growth rate of the efficiency gain becomes. Additionally, when
is Student’s
distribution, we observe the efficiency gain to be of order
.
Fig. 1.
Scaling of the relative gain in efficiency for
(red),
(blue),
(green) and
(pink), with: (a)
and varying
, where the dashed black line is proportional to
; (b)
and varying number
of observations, where the dashed black line is proportional to
, the dotted black line is proportional to
, and the dot-dashed black line is proportional to
.
4.2. Control variates for sparse data
As mentioned in § 2.2, importance subsampling can be combined with the use of control variates. However, this approach fails to be efficient when the data are imbalanced or sparse. To demonstrate this, we generate covariates as described in § 4.1 for
and
, and generate responses independently of the covariates such that exactly
of them are ones. We choose
and
, and choose the prior variance to be 1. Figure 2(a) plots, as a function of
, the ratio of the mixing time of the slowest-mixing component for importance subsampling with control variates to that for importance subsampling without control variates. As the responses become more imbalanced, i.e., as
decreases, it can be seen that the efficiency of using control variates decreases relative to not using them.
Fig. 2.
(a) Gain in efficiency of using control variates over not using control variates for imbalanced responses. Auto-correlation function plots for (b) uniform subsampling and (c) importance subsampling in a high-dimensional sparse example.
4.3. High-dimensional sparse example
In this simulation we consider a challenging setting with number of dimensions
and number of observations
. The data are generated as in § 4.1 with
and
. Traditional data augmentation and subsampling algorithms are either computationally very expensive or mix slowly in such a scenario. We choose the prior variance to be 1; auto-correlation function plots for uniform subsampling and for importance subsampling are displayed in Fig. 2 panels (b) and (c), respectively. These show that it is necessary to use importance subsampling for the zig-zag sampler to be a feasible sampling method. From a practical point of view, adaptive preconditioning can be of further help; this is described in the Supplementary Material.
5. Real-data example
We consider an imbalanced set of data on cervical cancer (Fernandes et al., 2017) obtained from the UCI Machine Learning Repository (Dua & Graff, 2017). The dataset contains 858 observations with 34 predictors. The responses are whether or not an individual has cancer, with only 18 out of the 858 individuals having cancer. The predictors include the number of sexual partners, use of hormonal contraceptives, and other variables; more than half of the predictors have approximately 80% zeros. Fixing the number of bouncing attempts, the mixing times of the slowest-mixing component for uniform subsampling and for importance subsampling are
and
, respectively. A stratification scheme described in the Supplementary Material brings this time down to
.
6. Discussion
Subsampling for traditional Markov chain Monte Carlo schemes can be tricky, as the resulting chains induce an error in the invariant distribution that can be difficult to quantify. A promising class of recently developed algorithms, known as piecewise-deterministic Markov processes, allow subsampling without modifying the invariant measure. Nonuniform subsampling strategies can improve such algorithms relative to using uniform subsampling, especially for logistic regression with sparse covariate data; however, this can also be the case for other problems to which piecewise-deterministic Markov processes are applicable. After completion of this work, we became aware of a 2016 University of Oxford master’s thesis by Nicholas Galbraith, where a method called informed subsampling is introduced, which is similar to the importance subsampling strategy presented here. While aspects of sparsity in the covariate data are not addressed in that thesis, the author makes similar observations regarding the usefulness of the approach in the setting of covariate data with outliers or distributed according to heavy-tailed distributions.
Supplementary Material
Acknowledgement
This research was partially supported by the U.S. National Science Foundation and National Institutes of Health. We are grateful to the associate editor and two referees for helping us to improve the paper. Sen and Sachs contributed equally to this paper.
Supplementary material
Supplementary material available at Biometrika online includes more details on the zig-zag process with generalized subsampling, importance subsampling using control variates and associated numerical experiments, and stratified subsampling, as well as a proof of Lemma 1.
References
- Alquier, P., Friel, N., Everitt, R. & Boland, A. (2016). Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels. Statist. Comp. 26, 29–47. [Google Scholar]
- Andrieu, C., Durmus, A., Nüsken, N. & Roussel, J. (2019). Hypercoercivity of piecewise deterministic Markov process-Monte Carlo. arXiv: 1808.08592v2. [Google Scholar]
- Andrieu, C. & Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Statist. 37, 697–725. [Google Scholar]
- Bierkens, J., Fearnhead, P. & Roberts, G. (2019). The zig-zag process and super-efficient sampling for Bayesian analysis of big data. Ann. Statist. 47, 1288–320. [Google Scholar]
- Bouchard-Côté, A., Vollmer, S. J. & Doucet, A. (2018). The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. J. Am. Statist. Assoc. 113, 855–67. [Google Scholar]
- Dua, D. & Graff, C. (2017). UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml. [Google Scholar]
- Fearnhead, P., Bierkens, J., Pollock, M. & Roberts, G. O. (2018). Piecewise deterministic Markov processes for continuous-time Monte Carlo. Statist. Sci. 33, 386–412. [Google Scholar]
- Fernandes, K., Cardoso, J. S. & Fernandes, J. (2017). Transfer learning with partial observability applied to cervical cancer screening. In Iberian Conference on Pattern Recognition and Image Analysis. Cham, Switzerland: Springer, pp. 243–50. [Google Scholar]
- Fithian, W. & Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Ann. Statist. 42, 1693–724. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huggins, J., Campbell, T. & Broderick, T. (2016). Coresets for scalable Bayesian logistic regression. In NIPS’16: Proc. 30th Int. Conf. Neural Information Processing Systems. New York: Curran Associates, pp. 4080–8. [Google Scholar]
- Jacob, P. E. & Thiery, A. H. (2015). On nonnegative unbiased estimators. Ann. Statist. 43, 769–84. [Google Scholar]
- Johndrow, J. E., Mattingly, J. C., Mukherjee, S. & Dunson, D. (2017). Optimal approximating Markov chains for Bayesian inference. arXiv: 1508.03387v3. [Google Scholar]
- Johndrow, J. E., Smith, A., Pillai, N. & Dunson, D. B. (2019). MCMC for imbalanced categorical data. J. Am. Statist. Assoc. 114, 1394–403. [Google Scholar]
- Lewis, P. W. & Shedler, G. S. (1979). Simulation of nonhomogeneous Poisson processes by thinning. Naval Res. Logist. Quart. 26, 403–13. [Google Scholar]
- Li, C., Srivastava, S. & Dunson, D. B. (2017). Simple, scalable and accurate posterior interval estimation. Biometrika 104, 665–80. [Google Scholar]
- Polson, N. G., Scott, J. G. & Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. J. Am. Statist. Assoc. 108, 1339–49. [Google Scholar]
- Srivastava, S., Li, C. & Dunson, D. B. (2018). Scalable Bayes via barycenter in Wasserstein space. J. Mach. Learn. Res. 19, 312–46. [Google Scholar]
- Ting, D. & Brochu, E. (2018). Optimal subsampling with influence functions. In NIPS’18: Proc. 32nd Int. Conf. Neural Information Processing Systems. New York: Curran Associates, pp. 3650–9. [Google Scholar]
- Welling, M. & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML’11: Proc. 28th Int. Conf. Machine Learning. Madison, Wisconsin: Omnipress, pp. 681–8. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




































