Nonparametric Bayesian Segmentation of a Multivariate Inhomogeneous Space-Time Poisson Process

Mingtao Ding; Lihan He; David Dunson; Lawrence Carin

doi:10.1214/12-BA727

. Author manuscript; available in PMC: 2013 Jun 3.

Published in final edited form as: Bayesian Anal. 2012 Dec 1;7(4):813–840. doi: 10.1214/12-BA727

Nonparametric Bayesian Segmentation of a Multivariate Inhomogeneous Space-Time Poisson Process

Mingtao Ding ^*, Lihan He ^†, David Dunson ^‡, Lawrence Carin ^§

PMCID: PMC3670617 NIHMSID: NIHMS444983 PMID: 23741284

Abstract

A nonparametric Bayesian model is proposed for segmenting time-evolving multivariate spatial point process data. An inhomogeneous Poisson process is assumed, with a logistic stick-breaking process (LSBP) used to encourage piecewise-constant spatial Poisson intensities. The LSBP explicitly favors spatially contiguous segments, and infers the number of segments based on the observed data. The temporal dynamics of the segmentation and of the Poisson intensities are modeled with exponential correlation in time, implemented in the form of a first-order autoregressive model for uniformly sampled discrete data, and via a Gaussian process with an exponential kernel for general temporal sampling. We consider and compare two different inference techniques: a Markov chain Monte Carlo sampler, which has relatively high computational complexity; and an approximate and efficient variational Bayesian analysis. The model is demonstrated with a simulated example and a real example of space-time crime events in Cincinnati, Ohio, USA.

Keywords: Bayesian hierarchical model, spatial segmentation, temporal dynamics, Gaussian process, logistic stick breaking process, inhomogeneous Poisson process

1 Introduction

1.1 Motivating application

Assume access to the locations of various types of crimes occurring in a given city, as a function of time. As a motivating example, in Figure 1(a) data are shown for 3090 crimes (of 17 crime types) in Cincinnati in Jan 2008. Our focus is on obtaining a spatial segmentation, such as that shown in Figure 1(b). In addition to the spatial dependence of point process data, we wish to simultaneously explore time dynamics. For example, in the crime data analysis, the crime intensity in summer may be different statistically from that in winter, and this intensity may change smoothly over seasons; consequently, the spatial segmentation of the city may also vary smoothly over time.

Crime events and the segmentation of the city. In (a) 3090 crime events are shown as black dots; in (b) each color indexes a segment with associated crime intensities in 17 crime types (see Section 4 for details).

The analysis of time dynamics helps to discover the temporal pattern of the events and to predict the spatial segmentation at an unobserved time instance or in the future. We desire that the analysis provide a simple summary that is useful to police forces and city planners in targeting resources, as well as to researchers in studying crime trends. We would like to obtain this space-time segmentation quickly, utilizing data from different types of events, while allowing temporal interpolation and forecasting.

1.2 Summary of proposed model

Consider the data 𝒟 = {s_i, v_it}_{i=1, …, M, t=1, …, T}, where v_it is a d-dimensional vector of the counts of d types of events, occurring in a (small) spatial region Δ(s_i), with the center of the region being s_i ∈ ℝ². In the context of Figure 1, we are interested in d types of crime. The contiguous grid of spatial regions Δ(·) is fixed in advance, and the size of Δ(·) is very small relative to the size of the entire spatial domain, providing justification for an approximation in which we index regions by the center point and assume homogeneity within regions (using the model developed below, in the limit Δ → 0 we have a Poisson process). There are T time points at which data are observed, not necessarily uniformly spaced in time. Although not done here, one may envision aligning the grid Δ(·) with the geometry of the terrain (e.g., roads).

The proposed space-time model may be summarized as

v_{it} ~ \prod_{j = 1}^{d} Poisson (λ_{ijt}), λ_{it} ~ \sum_{k = 1}^{K} w_{k} (s_{i}; θ_{kt}) δ_{λ_{kt}^{*}}

(1)

where w_k(s_i; θ_kt) ≥ 0, $\sum_{k = 1}^{K} w_{k} (s_{i}; θ_{kt}) = 1$ for all s_i, $δ_{λ_{kt}^{*}}$ is a unit measure concentrated at $λ_{kt}^{*}$ , and λ_ijt is the jth component of λ_it. This corresponds to a mixture model, with space-time varying mixture weights w_k(s_i; θ_kt) and time-varying atoms $δ_{λ_{kt}^{*}}$ .

Expression w_k(s; θ_kt) represents a general parametric function capable of modeling the probability of cluster k at spatial location s. In the details of the proposed model, one of the {w_k(s; θ_kt)}_k=1,K is likely to be dominant (large probability) over a contiguous region, yielding a segmentation. Since the parameters θ_kt change in general with time t, a probabilistic space-time segmentation is manifested. Within the proposed model, the prior encourages that {θ_kt} and $λ_{kt}^{*}$ vary smoothly as a function of time, and hence the model imposes smooth space-time variation in the shape/form of the segments, and smooth temporal variation of the Poisson rates associated with a given segment.

Two methods are considered for imposing temporal smoothness, representing two perspectives on imposing the same temporal structure. For discrete-time data with uniform temporal spacing, it is natural to consider the first-order autoregressive model, i.e., AR(1), as $θ_{kpt} ~ 𝒩 (ζ θ_{kp (t - 1)}, α_{0}^{- 1})$ , with θ_kpt the pth component of θ_kt, ζ the AR(1) coefficient (with |ζ| < 1), and α₀ a precision to be inferred (ζ and α₀ could also be extended to depend on k and p). The log of each component of $λ_{kt}^{*}$ may be similarly modeled.

We also consider a Gaussian process (GP) model (Rasmussen and Willams 2006) in time for each component θ_kpt, and for the log of each component of $λ_{kt}^{*}$ , thus allowing non-uniform temporal sampling. To make the AR(1) and GP models consistent, we assume an exponential model for the GP covariance between times t_i and t_l, c₀ $c_{1}^{| t_{i} - t_{l} |}$ , with c₁ playing a role analogous to ζ in the AR(1) model, and the variance c₀ corresponding to [(1 − ζ²)α₀]⁻¹ from the AR(1) model. The AR(1) and chosen GP representations are therefore essentially different means of imposing the same temporal prior, with the former restricted to uniform temporal sampling.

In addition to developing a new model for multivariate inhomogeneous space-time Poisson process data, a contribution of this paper concerns computations, in the form of a detailed comparison of Markov chain Monte Carlo (MCMC) and variational Bayesian (VB) inference for this class of models. The former is widely used, but it can be computationally prohibitive for the motivating large-scale problems considered here. Computations based on VB are attractive for large-scale modeling studies, but many simplifying assumptions must be made.

1.3 Related research

A natural model for exploiting spatial information, and to model point process data, is the inhomogeneous Poisson process (Diggle 2003),(Møller and Waagepetersen 2004). Researchers have recently studied nonparametric Bayesian approaches for such applications. One of these approaches models the Poisson intensity function by a variation of a Gaussian process (GP) (Adams et al. 2009; Rathbun and Cressie 1994; Møller et al. 1998). The log-Gaussian Cox process (Møller et al. 1998), corresponding to an intensity function modeled as an exponentiated GP, has proven highly successful in point process (Hossain and Lawson 2009) and geostatistical modeling (Diggle et al. 2010; Pati et al. 2010). Mixture models provide another approach to representing the Poisson intensity function (Wolpert and Ickstadt 1998). Kottas and Sansó (2007) proposed a Dirichlet process (DP) mixture model of bivariate beta densities to model heterogeneity in intensity functions. Dirichlet process mixture models of multivariate normal densities can be also found in (Ji et al. 2009; Chakraborty and Gelfand 2010).

In Taddy (2008, 2010); Taddy and Kottas (2012) a dynamic model was proposed for Poisson point processes, based on a novel version of the dependent Dirichlet process. Models of this type have been applied to the data considered in Figure 1, although the problem of segmentation was not considered. In Achcar et al. (2011) a time inhomogeneous Poisson model was proposed, with change-points to estimate the number of times that a given environmental standard is violated in a time interval of interest.

Rather than modeling the Poisson intensity via a GP or a DP mixture model, the model in (1) constitutes a mixture model with space-time mixture weights, and the spatial locations {s_i} of the grid are modeled as covariates. The details of how w_k(s; θ_kt) is modeled encourages contiguous regions in space and time for which a single component (cluster) dominates, encouraging a piecewise-constant Poisson intensity function. In Heikkinen and Arjas (1998) the authors similarly build a piecewise constant prior model for spatial Poisson intensities, using Voronoi tessellations. We model w_k(s; θ_kt) via an extension of the logistic stick-breaking process (LSBP) (Ren et al. 2011). The region of interest is partitioned into a set of contiguous small square cells, with related ideas considered in Hossain and Lawson (2009). Within the context of the aforementioned GP construction for the temporal dependence of θ_kt, related ideas were presented in the context of factor analysis (Luttinen and Ilin 2009), where GPs were used to describe the smoothness of both spatial locations and time. An AR model for temporal dynamics was considered in Taddy (2008, 2010).

2 Model Details

2.1 Basic construction

The proposed space-time model for data 𝒟 = {s_i, v_it}_{i=1, …, M, t=1, …, T} is summarized as

v_{it} ~ \prod_{j = 1}^{d} Poisson (λ_{ijt}), λ_{it} ~ \sum_{k = 1}^{K} w_{k} (s_{it}) δ_{λ_{kt}^{*}}

(2)

w_{k} (s_{it}) = p_{k} (s_{it}) \prod_{h = 1}^{k - 1} [1 - p_{h} (s_{it})]

(3)

p_{k} (s_{it}) = σ (g_{k} (s_{it})), for k = 1, \dots, K - 1, p_{K} (s_{it}) = 1

(4)

g_{k} (s_{it}) = \sum_{j = 1} β_{kjt} 𝒦 (s_{it}, {s̃}_{j}; ψ_{k}) + β_{k 0 t}

(5)

where (2) is repeated here from (1), for convenience. Below we explain and motivate each term in this construction. Parameters θ_kt from the Introduction correspond here to {β_kjt}j=0,J and ψ_k. In what follows, the notation s_it is meant to assign statistics to spatial location s_i at time t; for example, w_k(s_it) is the kth mixture weight as observed at s_i and time t. The spatial grid defining the regions {Δ(s_i)}_i=1,M is not changing with time.

The expression in (3), with p_k(s_it) ∈ [0, 1] for all s_it, is suggestive of the stick-breaking representation of the Dirichlet process (Sethuraman 1994). The function σ(x) = exp(x)/(1 + exp(x)) is associated with a logistic model, and p_K(s_it) = 1 such that $\sum_{k = 1}^{K} w_{k} (s_{it}) = 1$ for all s_it. By the construction of g_k(s_it) in (5), the probabilities p_k(s_it) have space-time variation, with such variation transferred to the mixture weights w_k(s_it) via (3). Therefore, via mixture weights w_k(s_it) in (2) we constitute a multivariate Poisson mixture model, with weights that vary as a function of s_it.

Function 𝒦(s, s̃j ; ψ_k) denotes a kernel with parameter ψ_k. Here we employ the radial basis function $𝒦 (s, {s̃}_{j}; ψ_{k}) = exp (- {‖ s - {s̃}_{j} ‖}_{2}^{2} / ψ_{k})$ , with J predefined kernel centers {s̃j}_j=1,J; for convenience these J centers are here aligned with the centers of the spatial grid defined by Δ(s̃j) (recall discussion in the Introduction). The appropriate kernel parameters {ψ_k} will be inferred. To ease computations, we assume a discrete set of parameters ${ψ_{1}^{*}, \dots, ψ_{L}^{*}}$ over which a uniform prior is placed; each kernel parameter ψ_k is assumed drawn from this finite library of parameters.

The space-time dependence of the model is manifested in how {β_kjt}_j=0,J and ${λ_{kt}^{*}}$ are modeled.

2.2 Temporal modeling

When the data are sampled uniformly in time, an autoregressive (AR) temporal model is natural. Specifically, we consider

β_{kjt} ~ 𝒩 (ζ β_{kj (t - 1)}, α_{β}^{- 1}), j = 0, \dots, J

(6)

log λ_{kjt}^{*} ~ 𝒩 (ξ log λ_{kj (t - 1)}^{*}, α_{λ}^{- 1}), j = 1, \dots, J

(7)

with $β_{kj 0} = log λ_{kj 0}^{*} = 0$ . Gamma priors are placed on α_β and α_λ. Further, ζ and ξ are drawn from a truncated normal 𝒩_(0,1) (0, 1) with 0 < ζ, ξ < 1.

The collection of data may be expensive, and there may be situations for which nonuniform temporal sampling is desired (e.g., to provide fine-scale sampling in particular regions – seasons – of time that may be interesting). This suggests using a Gaussian process (GP) model (Rasmussen and Willams 2006) for the temporal variation of β_kjt and log $λ_{kjt}^{*}$ .

For the kth mixture component, we let

B_{k} ~ 𝒩 (B_{k} | 0, Ω_{k}) = \prod_{j = 0}^{J} 𝒩 (β_{kj :} | 0, Σ_{kj}), {[Σ_{kj}]}_{il} = c_{0} {c_{1}}^{| t_{i} - t_{l} |}

(8)

where β_kj: = [β_kj1, …, β_kjT]^T, and B_k ∈ ℝ^{T(J + 1)} denotes a vector formed by concatenating β_kj: for j = 0, …, J. The covariance Ω_k is a block-diagonal matrix of size T(J + 1) × T(J + 1), and each block Σ_kj is a T × T covariance matrix; the entry at row i and column l, denoted as [Σ_kj]_il, is evaluated using the GP covariance function with the hyperparameters {c₀, c₁}. A gamma prior is placed on c₀. Since c₁ plays the same role as ζ, we also draw c₁ from the truncated normal 𝒩_(0,1) (0, 1) with 0 < c₁ < 1.

The Gaussian process priors are also placed on log $λ_{kjt}^{*}$ . For mixture component k

log (λ_{kj :}^{*}) ~ 𝒩 (0, Γ_{kj}), {[Γ_{kj}]}_{il} = d_{0} {d_{1}}^{| t_{i} - t_{l} |}

(9)

where $log (λ_{kj :}^{*}) = {[log (λ_{kj 1}^{*}), \dots, log (λ_{kjT}^{*})]}^{T}$ , and the covariance matrix Γ_kj ∈ ℝ^T×T, with the entries defined by the GP covariance function with the hyperparameters {d₀, d₁}. A gamma prior and truncated normal prior are placed on d₀ and d₁. As discussed in the Introduction, the considered AR(1) and GP priors are consistent, and provide different modeling strategies for the same imposed temporal dynamics.

2.3 Model interpretation

Equations (3)–(5) are of the form of the logistic stick-breaking process (LSBP) introduced in Ren et al. (2011); however, that paper did not consider Poisson data, and space-time processes were not addressed. Recall that σ(x) ≈ 1 for x > 4; we refer to this as the “clipping” property of the logistic, as all x larger than about 4 contribute effectively in the same manner to σ(x); one may alternatively use a probit model, to achieve the same end. If β_kjt > 4, then p_k(s) ≈ 1 for ${‖ s - {s̃}_{j} ‖}_{2}^{2} < ψ_{k}$ . This implies via (3) that within region ${‖ s - {s̃}_{j} ‖}_{2}^{2} < ψ_{k}$ , if β_kjt > 4 mixture component k is highly probable (assuming that other clusters k′ ≠ k do not have large p_k′ (s) in the vicinity of s̃_j). The “clipping” nature of the logistic function, and large values of β_kjt > 4, encourage contiguous regions for which a given cluster k has high space-time probability of being manifested (all locations s at which g_k(s) > 4 have similarly high probability of being associated with cluster k, regardless of the exact value of g_k(s)). The weights {β_kjt} play the role of assigning which regions in space-time are most likely to be associated with a given cluster k, and ψ_k defines the size scale of the cluster. Note that while we truncate the model to K mixture components, this does not mean that all components need actually be used to represent the data. For example, if a given β_k0t is large and negative, then the kth mixture component is unlikely to be utilized at all spatial locations at time t; K is simply an upper bound on the number of mixture components (segment types).

3 Posterior inference

The posterior distribution of the model parameters is inferred via an MCMC sampler and via variational Bayesian (VB) inference (Beal 2003). The VB inference typically converges quickly and is computationally efficient; by contrast, MCMC convergence may be difficult to diagnose, and a large number of iterations is required to collect samples representing the joint posterior distribution. The detailed MCMC and VB update equations are provided in the Appendix (we provide equations for the GP model, with minor changes manifested for the AR case). Since VB analysis is not as widely used in the statistics literature, for completeness we provide details on its modeling assumptions.

Let Θ represent a vector of all model parameters; the goal is to infer the posterior p(Θ|𝒟). The likelihood of the data is represented by p(𝒟|Θ) and the prior on the model parameters is denoted by p(Θ). Let q(Θ; Γ) be a parametric distribution with hyperparameters Γ, and consider the variational expression

ℱ (Γ) = \int d Θ q (Θ; Γ) ln \frac{q (Θ; Γ)}{p (𝒟 | Θ) p (Θ)} = D_{KL} [q (Θ; Γ) ‖ p (Θ | 𝒟)] - ln p (𝒟) .

(10)

In VB analysis the goal is to optimize the hyperparameters Γ to minimize the Kullback-Leibler divergence between q(Θ; Γ) and the true posterior p(Θ|𝒟); this corresponds to adjusting Γ in q(Θ; Γ) such that ℱ(Γ) is minimized. Note that $\int d Θ q (Θ; Γ) ln \frac{q (Θ; Γ)}{p (𝒟 | Θ) p (Θ)}$ is only a function of the likelihood p(𝒟|Θ) and the prior p(Θ), and not the unknown posterior; with careful selection of q(Θ; Γ), numerical techniques akin to expectation-maximization (EM) (Beal 2003) can be employed to minimize (Γ), with assurance of convergence to a local-optimal solution.

Focusing on the GP temporal model (the AR case is very similar), the model parameters are

Θ = {{λ_{kj :}^{*}}_{\begin{matrix} j = 1, \dots, d, \\ k = 1, \dots, K \end{matrix}}, {B_{k}}_{k = 1, \dots, K}, {Z_{k} (s_{it})}_{\begin{matrix} t = 1, \dots, T, \\ i = 1, \dots, M, \\ k = 1, \dots, K \end{matrix}}, c_{0}, c_{1}, d_{0}, d_{1}}

(11)

where Z_k(s_it) ~ Bernoulli(p_k(s_it)), with p_k(s_it) defined in (4). Completing the generative process, $v_{it} ~ \prod_{j = 1}^{d} Poisson (λ_{k̂ jt}^{*})$ if Z_k(s_it) = 0 for k < k̂ and Z_k̂(s_it) = 1; $λ_{k̂ jt}^{*}$ is the jth component of vector $λ_{k̂ jt}^{*}$ .

In VB one typically assumes a factorized form for q(Θ; Γ), i.e., q(Θ; Γ) = ∏_l q_l(Θ_l; Γ_l), where Θ_l represents the lth set of model parameters and q_l(Θ_l; Γ_l) is a parametric density function with hyperparameters Γ_l; the union of all Θ_l corresponds to Θ. Through careful selection of q_l(Θ_l; Γ_l) one may iteratively optimize the variational expression ℱ(Θ).

For the proposed model, q(B_k) is a multivariate normal distribution, q(Z_k(s_it)) is Bernoulli (with Bernoulli probability defined by a logistic function), q(ψ_k) is multinomial based upon a finite library of possible parameters ${ψ_{l}^{*}}_{l = 1, L}$ , and q(c₀) and q(d₀) are gamma distributions. It is not possible to define a $q (λ_{kj :}^{*})$ that yields closed-form updates. Therefore, the parameters $λ_{kj :}^{*}$ within the VB analysis are also approximated at each iteration via a point estimate that maximizes the functional ℱ(Γ). Similarly, q(c₁) and q(d₁) cannot be obtained in closed form. The parameters c₁ and d₁ are updated on each VB iteration by defining parameters that maximize the functional ℱ(Γ).

4 Example Results

While the proposed model may appear relatively complicated, the number of hyperparameters that need be set is actually modest. We compare the AR-LSBP and GP-LSBP models for imposing a prior on the temporal dependence with a simpler model in which the priors for each time point t are independent. In the context of this independent LSBP (ind-LSBP), we impose

β_{kjt} ~ 𝒩 (0, α_{kjt}^{- 1}), α_{kjt} ~ Gamma (a_{0}, b_{0})

(12)

and we set a₀ = b₀ = 10⁻⁶ as in the relevance vector machine (RVM) (Tipping 2001). The same gamma priors are placed on α_β and α_λ for the AR-LSBP model, and on c₀ and c₁ for the GP-LSBP model. In all examples the truncation level on the LSBP was set at K = 20, and the results are insensitive to this parameter, as long as it is large relative to the actual number of clusters/segments inferred by the model. Finally, we must specify the library for kernel parameters {ψ_k}_k=1,K; the manner in which these are specified is discussed when presenting the specific examples.

For uniform temporal sampling, the AR(1) and GP imposition of temporal dynamics are theoretically identical, for the imposed GP covariance. Nevertheless, even for uniform temporal sampling we show results for both of these implementations, because the details of the numerics dictates that the two models are slightly different in practice. Specifically, within the GP model a point estimate is employed for the kernel hyperparameters, with this obviously unnecessary for the direct AR(1) model. The comparison allows examination of the accuracy of this approximation within the GP inference, relative to the direct AR(1) implementation; this sheds light on the quality of the computations for non-uniform temporal sampling, where the GP implementation is required.

4.1 Simulation Example

We assume the data are constructed by a total of 9 equally spaced time instances, t = 1, 2, …, 9. At each time we randomly draw 50 spatial locations in one-dimensional space from a uniform distribution with support [0, 20], denoted as s_it ~ Uniform[0, 20], i = 1, …, 50, t = 1, …9. For each location, we draw an event count v_it from a Poisson distribution with the intensity parameter λ_it. To represent the time dynamics, we let λ_it = 20 when $5 + \frac{5}{8} (t - 1) \leq s_{it} \leq 10 + \frac{5}{8} (t - 1)$ , and λ_it = 1 otherwise. By this setting the high-intensity window moves gradually from [5, 10] to [10, 15] when time t increases. Note that here s_it ∈ ℝ¹ and v_it ∈ ℝ¹. The kernel centers are defined as s̃_j = 0.5(j − 1) for j = 1, …, J. The data are depicted in Figure 2. Within the analysis, the library of kernel parameters is the union of the following two sets: {0.05, 0.1, 0.05, …, 0.5} and {0.5, 1, 1.5, …, 5}.

Simulation example. The high-intensity window moves gradually from [5, 10] to [10, 15] when time increases.

The mean results from VB are shown in Figure 3, in which the inferred Poisson rate is constituted; for these and all VB results the computations were stopped when the change in the variational bound changed by 10⁻⁴. Further, all VB results are initialized at random. The VB results presented below represent a local-optimal solution, which forms one source of error, and this is compounded by the factorized approximation to the posterior. Nevertheless, the VB implementation of the GP-LSBP and AR-LSBP model yields results comparable to that of the MCMC implementation. When implementing MCMC, a total of 10,000 iterations are run, with the first 1000 discarded as burnin. On the same PC (and both codes written in Matlab), the VB GP-LSBP and ARLSBP results required approximately 158 seconds of CPU time, while the VB ind-LSBP results required approximately 96 seconds. In contrast, the GP-LSBP and AR-LSBP results based on the MCMC sampler required 6517 seconds, and ind-LSBP required 2913 seconds (109 and 48 minutes, respectively). The software was not optimized, and these numbers therefore represent a relative view of computational expense of the VB and MCMC solutions.

Segmentation and latent intensity inferred by VB: Comparison between GP-LSBP and ind-LSBP, considering the simulated-data example. The AR-LSBP results are similar to the GP-LSBP results, and are omitted for brevity.

From Figure 3 it is observed that, for the VB solution, incorporation of temporal smoothness in the GP-LSBP model yields significant improvements in the inferred Poisson rate, as compared to the VB ind-LSBP solution (with temporal dependence not accounted for in the prior); the AR-LSBP model performed similarly to GP-LSBP. It appears that the prior constraint imposed by GP/AR within the VB solution plays an important role in mitigating the underlying VB approximations. By contrast, for the MCMC results improvements are manifested via GP-LSBP and AR-LSBP relative to ind-LSBP, but in this case the differences are less dramatic (plots of MCMC results are not shown, for brevity).

We next examine the generative performance of the proposed model. After the model has been learned, either via VB or MCMC, we randomly generate 100 new test data, following the same procedure that generated the training data. We then compute the average log-likelihood and the accuracy rate of segmentation from the learned GP-LSBP, AR-LSBP and ind-LSBP models. The accuracy rate of segmentation is defined as the number of test data points segmented correctly as a fraction of the total number of test data points. The results are summarized in Table 1. We find that the GP-LSBP and AR-LSBP models achieve a higher likelihood and accuracy of segmentation compared to the ind-LSBP. Note that the differences between GP-LSBP, AR-LSBP and ind-LSBP are relatively modest for the MCMC solution, while there are again marked advantages in the GP-LSBP and AR-LSBP solutions relative to ind-LSBP when employing VB inference.

Table 1.

Comparison of generative performance between AR-LSBP, GP-LSBP and ind-LSBP, on simulated data.

Method	Average log-likelihood		Accuracy rate of segmentation
Method	VB	MCMC	VB	MCMC
AR-LSBP	−3.702	−1.749	0.9796	0.9801
GP-LSBP	−3.882	−2.082	0.9765	0.9757
ind-LSBP	−15.544	−2.274	0.9478	0.9741

Open in a new tab

Finally we test the prediction performance of the model. We first generate data 𝒟 = {s_i, v_it}_{i=1,…, 50, t=1, …, 9} as discussed above, and then randomly select N_miss time instances t̂₁, …, t̂N_miss from t = 1, …, 9, and this constructs our test data 𝒟_tst; the training data 𝒟_trn is composed of the data in 𝒟 but not in 𝒟_tst. We learn the model based on VB or MCMC analysis with 𝒟_trn, and predict the kernel weights β̂_kjt̂ and Poisson intensities ${λ̂}_{k t̂}^{*}$ at time t̂. The average log-likelihood and accuracy of segmentation are evaluated based on the prediction results of 𝒟_tst, given only the spatial locations s_it̂. We perform 100 trials, and at each trial N_miss time instances are selected randomly to construct 𝒟_tst. The average results are shown in Table 2.

Table 2.

Comparison of prediction performance between AR-LSBP, GP-LSBP and ind-LSBP.

N_miss	Average log-likelihood
	AR-LSBP		GP-LSBP		ind-LSBP
	VB	MCMC	VB	MCMC	VB	MCMC
1	−3.948	−1.975	−4.102	−2.123	−21.194	−2.641
2	−4.211	−2.241	−4.526	−2.473	−27.195	−3.077
3	−4.468	−2.573	−4.718	−2.652	−27.776	−3.507
4	−4.882	−2.740	−5.133	−3.108	−26.682	−3.963
5	−5.801	−3.014	−5.987	−3.521	−31.217	−4.316
N_miss	Accuracy rate of segmentation
	AR-LSBP		GP-LSBP		ind-LSBP
	VB	MCMC	VB	MCMC	VB	MCMC
1	0.9792	0.9794	0.9767	0.9758	0.7165	0.9545
2	0.9787	0.9786	0.9761	0.9754	0.6669	0.9581
3	0.9787	0.9785	0.9763	0.9752	0.6458	0.9379
4	0.9780	0.9783	0.9752	0.9740	0.6647	0.9274
5	0.9763	0.9770	0.9741	0.9633	0.6131	0.9066

Open in a new tab

Only the GP-LSBP results are fully principled in this analysis, where we use the learned parameters of the GP covariance matrix to interpolate to new time points (Rasmussen and Willams 2006). The AR model implicitly assumes that the data are sampled uniformly in time, while the ind-LSBP has no principled means of interpolating to missing time points. Nevertheless, as a comparison, for the AR-LSBP computations in this test the AR component was simply applied to consecutive observed time points, essentially assuming that the temporal variation was smooth, even if not sampled uniformly. To interpolate to new points using the learned AR-LSBP and ind-LSBP results, to obtain model parameters at any new point t̂, we average the learned model parameters from the two closest observed points, before and after t̂. From Table 2 it is observed that again for the VB solution, there is a marked advantage manifested via the GP-LSBP and AR-LSBP priors, as compared to ind-LSBP. For the MCMC solution, there is also a noticeable advantage manifested via the GP-LSBP and AR-LSBP solutions, particularly for segmentation accuracy for relatively large N_miss. Based upon the average log-likelihood, we note a small but consistent advantage of the AR-LSBP model over the GP-LSBP counterpart, for both VB and MCMC computations. This observation on simulated data will carry over to the analysis of real data.

4.2 Crime Data

We investigate crime events in Cincinnati, Ohio, USA; the data are available online at http://www.cincinnati-oh.gov. The data include the date, time, location and other information of all reported crimes in Cincinnati since 2006. This data set was first studied in Taddy (2008, 2010), where a mixture of beta distributions was employed to model the event density ν(s), and to discover the evolution of the density with time. In our problem we seek to segment the city into contiguous regions, with crime events at each region characterized by a common constant Poisson intensity vector.

We consider 117,314 crime events within the city, reported from January 2006 to December 2008. Each crime is assigned a uniform crime reporting (UCR) code. In total more than 170 different UCR codes describe a variety of crimes. These crime events can be categorized into 17 different crime types, based on the prefix of their UCR codes. They are: 1) murder, 2) rape, 3) robbery, 4) assault with weapon, 5) burglary, 6) nonvehicle theft, 7) vehicle theft, 8) general assault, 9) arson, 10) forgery, 11) fraud, 12) receiving stolen property 13) vandalism, 14) weapons related but no physical harm, 15) sexual crime, 16) children related, 17) general harassment. As an example, the locations (latitude and longitude coordinates) of the 3090 crime events in January 2008 are shown in Figure 1(a). Based on the locations of all the 117,314 crime events, the observation window is considered within a rectangular region of [39.06°, 39.24°] latitude and [−84.70°, −84.35°] longitude.

We construct the data 𝒟 = {s_i, v_it}_{i=1, …, M, t=1, …, T} as follows. The total crime events within one month are considered as one time instance, and therefore there are in total 36 time points. At each time, the observation window is divided into 15,750 small square grids (90 rows by 175 columns) of size 0.002° × 0.002°, and the event location s_it is defined as the center of each small square area, with this denoted as Δ(s_i). The count v_ijt is then the number of Type j crimes within Δ(s_i) over the corresponding month indexed by t. This produces a 17-dimensional count vector v_it at s_i for i = 1, …, 15750 and t = 1, …, 36. Related research in Taddy (2008, 2010) applied marked Poisson processes to address the crime types, regarding each crime type at s_it as a random mark. Here we attempt to segment the city by considering all the crime types within a local region Δ(s_it) as a correlated variable (a vector), instead of treating each event as a random type.

The proposed GP-LSBP, AR-LSBP and ind-LSBP models are inferred via VB and MCMC, with truncation level K = 20. The kernel centers are uniformly spaced every 0.04° (latitude and longitude) in the observation window, with a total of 60 kernel centers defined. The library of kernel parameters ${ψ_{l}^{*}}_{l = 1, L}$ is the union of the following sets: {0.006°, 0.012°, 0.018°, …, 0.06°} and {0.06°, 0.12°, 0.18°, …, 0.6°}. On the same PC, the VB GP-LSBP and AR-LSBP results require approximately 2.8 hours of CPU time, while the VB ind-LSBP results required approximately 1.3 hours. By contrast, due to the large size of the data, 3000 MCMC samples are employed, with 1000 discarded as burn-in. With the same PC, the MCMC GP-LSBP and AR-LSBP results required approximately 47.5 hours. We also considered 10,000 MCMC samples, with 1000 discarded as burn-in (at very significant computational cost), with little change in the results relative to those presented below.

Figure 4(a) shows the VB-based segmentation of the entire spatial observation window at 36 time instances, using GP-LSBP (similar results were found using AR-LSBP, omitted for brevity). The city is segmented into 4 regions (inferred by the model), and the segmentation changes smoothly with time. For comparison, Figure 4(b) shows the segmentation results obtained by applying an independent LSBP (VB computations) at each time instance. It is observed that with GP priors the proposed model presents a spatial segmentation more consistently over time and spatially more contiguously than ind-LSBP.

Comparison of spatial segmentation for crime data in Cincinnati, Ohio from January 2006 to December 2008 (VB results). Each color represents a segment with an associated intensity vector $λ_{kt}^{*}$ , and there are a total of four segments inferred: 1 - dark blue, 2 - light blue, 3 - yellow, and 4 - dark red. (a) GP-LSBP, (b) ind-LSBP.

We are also interested in examining the clustering manifested by the MCMC computations, with this complicated by label switching between samples. We compute an MCMC clustering that may be compared to the VB results as follows. We consider one spatial location from Segment 1 in Figure 4, denoted $s_{1}^{*}$ . Based upon the MCMC collection samples, for each other spatial location in the scene $s \neq s_{1}^{*}$ , we compute the probability that position s and $s_{1}^{*}$ are in the same cluster. All positions s with high probability of such clustering should (ideally) constitute a spatial region similar to Segment 1 inferred via VB. In Figure 5(a) we show MCMC results for Segment 1, and the high-probability regions (red) do indeed align well with the VB results in Figure 4. In Figure 5(b) we compute similar MCMC results for Segment 2, and in this case the high-probability spatial locations are aligned well with the VB results for Segment 2 in Figure 4. We found in general good agreement between the VB and MCMC segmentation results for GP-LSBP and AR-LSBP for these data.

Comparison of spatial segmentation for crime data in Cincinnati, Ohio from January 2006 to December 2008 (MCMC results). (a) Segment 1, (b) Segment 2, where these segments are related to the results in Figure 4(a). The color scale is the same in (a) and (b).

Figures 6(a)–(d) show the dynamic change of the VB-inferred Poisson intensities for each segment. To make the figure easier to read, we only plot components 3, 5 and 6 from the 17-dimensional vector $λ_{kt}^{*}$ ; these components correspond to crime types “robbery”, “burglary”, and “nonvehicle theft”, respectively. From these figures we observed that in all segments the crime intensities fluctuated periodically over season. Generally in summer there were more crime events of all types than than in winter. The overall crime intensities varied with regions. Segment 4 was in the downtown region, and had much more crime events compared to other regions. In all four regions Type 6 crime (nonvehicle theft) was dominant. In addition, the crime patterns were different in different regions. For example, Segment 4 had relatively less Type 5 crime (burglary), while in other 3 segments, the intensity of Type 5 crime was almost half of Type 6 crime. In Segment 4, Type 3 crime (robbery) was prevalent, while Segment 1 had relatively less Type 3 crime. For a comparison, we also present the MCMC-inferred Poisson intensities of Segment 3, as a representative (typical) example. It is observed that the MCMC and VB results are in generally good agreement for the GP-LSBP and AR-LSBP models.

Inferred intensity vector $λ_{kt}^{*}$ associated with the segments shown in Figure 4(a). Only 3 crime types are shown here to make the figure easy to read.

These results may be used by police to assign resources (personnel) to segmented regions in a consistent manner, to address varying levels of crimes. The segments typically change with season, and the spatial distribution of resources may be temporally adjusted as well. By relating the demographics of regions to the spatial segments (we didn’t have access to such demographics), one may deduce relationships between types of crimes and the types of people living and working in given regions, of interest to criminologists and city planners.

Following the same procedure as in the simulated example, we now examine the prediction performance of our model for the crime data. We randomly select N_miss time instances to construct a test set, and let the remaining data be the training set. Ten random trials are performed and the comparison of average log-likelihood between GP-LSBP, AR-LSBP and ind-LSBP inferred by VB is shown in Table 3. Since in this real application there is no ground truth, we cannot evaluate the accuracy rate of segmentation as done in the simulated example. From Table 3 GP-LSBP and ARLSBP consistently achieve higher likelihood than the independent LSBP for various N_miss values. Note also that for these real data there is less of a difference between the AR/GP-LSBP and ind-LSBP results for the VB solution, as compared to the synthetic data considered above. We do not perform this experiment for MCMC inference, as the computational requirements needed to perform this many experiments are prohibitive with this large data set (however, in isolated tests, the results were slightly better than the VB-based GP-LSBP and AR-LSBP models, consistent with the simulated example above).

Table 3.

Comparison of average log-likelihood in the prediction for the crime data (VB inference).

N_miss	1	2	3	4	5	6
AR-LSBP	−6.131	−6.352	−7.204	−7.631	−7.957	−8.338
GP-LSBP	−6.570	−6.762	−7.713	−7.965	−8.426	−8.721
ind-LSBP	−8.666	−9.247	−9.595	−8.840	−9.848	−8.762

Open in a new tab

4.3 Pearson residuals

Following Taddy (2010), we check model quality via computation of Pearson residuals (see Turner et al. (2005) for a detailed discussion of residuals for spatial point processes). For the modeling framework considered here, the Pearson residual reduces to

R (Δ (s_{it}), {λ̂}_{it}) = \frac{n_{it}}{\sqrt{{λ̂}_{it}}} = - \sqrt{{λ̂}_{it}}

(13)

where n_it is the number of events in region Δ(s_it) and λ̂_it is the inferred Poisson rate parameter in small region Δ(s_it). Ideally the residual should be close to zero, if the underlying Poisson assumption is valid. Note that within the proposed model we have a vector of counts v_it, and therefore we may compute the residual for each of the different types of crimes.

From Figure 7, which is based upon VB inference, we observe that the Pearson residuals tend to decrease substantially based upon a model that explicitly imposes temporal smoothness (note that the residuals are significantly lower for GP-LSBP and AR-LSBP, relative to ind-LSBP). Further, the AR-LSBP residuals are smaller than those of the GP-LSBP. Although we omit the MCMC results for brevity, similar phenomena were observed in that case. The residuals tend to be small, in the range [−2,2], with the larger values manifested on the edges of segments, as might be expected (segment interfaces are characterized typically by abrupt changes in statistical properties).

Pearson residuals for “nonvehicle theft,” using VB inference; best viewed electriconically, zoomed in. (a) ind-LSBP, (b) GP-LSBP, (c) AR-LSBP.

5 Conclusions

A Bayesian hierarchical model has been presented for segmenting time-evolving point process data, when the events are in vector form. The spatial-dependent point process is modeled using a generalization of a Poisson process, with piecewise constant Poisson intensities defined within the observation window. The logistic stick-breaking process is employed to favor spatially contiguous segments, and GP and AR models are considered for imposition of temporal smoothness of the segmentation and the Poisson intensity.

In addition to developing the model, a contribution of this paper concerns a detailed comparison between MCMC sampling and a VB approximation. For both the synthetic and real data, it was found that the GP-LSBP and AR-LSBP results computed via VB and MCMC were in close agreement, and the imposition of temporal smoothness manifested via GP/AR (compared to treating the different temporal samples independently) yielded significant improvements in the VB results. While the VB results are approximate, and are subject to local-optimal solutions (although the GP/AR models seemed to mitigate this to some extent), the VB approach provides significant advantages with regard to computations. For the large crime data set considered, while the MCMC results are in principle convergent, if run for enough samples, this attractiveness is tempered by the very significant computation time required to realize a number of collection samples to assure that we are indeed sampling from the posterior. Given that computational requirements will in practice mitigate the ability to collect as many MCMC samples as desired (and therefore MCMC is also an approximation), the VB solution appears to be an attractive option. However, the results presented here indicate that imposition of as much information as possible (here smoothness via GP/AR) is desirable. In future research it is of interest to consider online VB analysis (Hoffman et al. 2010), which provides further acceleration for large datasets, and it is appropriate for time-dependent data observed in an online/sequential manner, like the time-evolving crime data considered here.

Acknowledgments

The authors wish to thank the reviewers and editors for their comments, which have substantially improved the paper. The research reported here was supported by the Army Research Office (Dr. Liyi Dai) and the Office of Naval Research (Dr. Wen Masters).

Appendix: MCMC and VB Update Equations

A.1 MCMC Inference

The MCMC computations are performed using Gibbs sampling where the conditional density functions are analytic, and samples are drawn from the conditional density functions via Metropolis-Hastings when not analytic. The update equations are summarized as follows.

Sample

λ_{kj :}^{*}

from their respective posteriors conditional on {Z_k (s_it)} and {ν_ijt}:

p (λ_{kj :}^{*} | -) \propto \prod_{t = 1}^{T} \prod_{i = 1}^{M} Poisson {(ν_{ijt} | λ_{kjt}^{*})}^{I (c_{i} = k)} ln 𝒩 (λ_{kj :}^{*} | 0, Γ_{kj}) .

(14)

It is not possible to sample

λ_{kj :}^{*}

from the full conditionals. We update each

λ_{kj :}^{*}

by the Metropolis-Hastings algorithm. When updating

λ_{kj :}^{*}

, the proposed

λ_{kj :}^{* (τ + 1)}

is generated from the following distribution

q (ln λ_{kj :}^{* (τ + 1)} | ln λ_{kj :}^{* (τ)}) = 𝒩 (ln λ_{kj :}^{* (τ)}, (d_{0} + d_{2}) I_{T})

(15)

where I_T is the T × T identity matrix, and T denotes the number of time points. The acceptance probability for the proposed

λ_{kj :}^{* (τ + 1)}

is min

(1, α (λ_{kj :}^{* (τ + 1)}, λ_{kj :}^{* (τ)}))

, where

α (λ_{kj :}^{* (τ + 1)}, λ_{kj :}^{* (τ)}) = exp (- \frac{1}{2} {(λ_{kj :}^{* (τ + 1)})}^{T} Γ_{kj}^{- 1} λ_{kj :}^{* (τ + 1)} + \frac{1}{2} λ_{kj :}^{* (τ) T} Γ_{kj}^{- 1} λ_{kj :}^{* (τ)}) \cdot \prod_{t = 1}^{T} {[(\frac{λ_{kjt}^{* (t + 1)}}{λ_{kjt}^{* (t)}})}^{\sum_{i = 1}^{M} w_{k} (s_{it}) υ_{ij 1} - 1} exp [\sum_{i = 1}^{M} w_{k} (s_{it}) (λ_{kjt}^{* (τ + 1)} λ_{kjt}^{* (τ)})]] .

(16)

Sample β_k:i from their respective posteriors conditional on {Z_k (s_it)}:
$p (B_{k} | -) \propto \prod_{t = 1}^{T} \prod_{i = 1}^{M} p (Z_{k} (s_{it}) | B_{k}) \prod_{j = 1}^{J} 𝒩 (β_{kj :} | 0, Σ_{kj}) .$ (17)
Reorder the entries of B_k (and the associated Ω_k) in (8) such that B_k = [β_k:1, …, β_k:T]^T, then we obtain
$p (B_{k} | -) \propto exp {- \sum_{t = 1}^{T} \sum_{i = 1}^{M} f (η_{kit}) β_{k : t}^{T} φ_{kit} φ_{kit}^{T} β_{k : t}} \times exp {- \frac{1}{2} B_{k}^{T} Ω_{K}^{- 1} B_{k} + \sum_{t = 1}^{T} \sum_{i = 1}^{M} (2 Z_{k} (s_{it}) - 1) φ_{kit}^{T} β_{k : t}} .$ (18)
So, B_k can be drawn from a normal distribution as
$p (B_{k} | -) = 𝒟 (B_{k}; {(Ω_{k}^{- 1} + U_{k})}^{- 1} Y_{k}, {(Ω_{k}^{- 1} + U_{k})}^{- 1}),$ (19)
where U_k is a (J + 1) T × (J + 1) T block-diagonal matrix with the t-th (J + 1) × (J + 1) block expressed as $u_{kt} = 2 \sum_{i = 1}^{M} f (η_{kit}) ϕ_{kit} ϕ_{kit}^{T}$ and Y_k is a (J + 1) T × 1 vector formed by concatenating the T vectors $y_{kt} = \sum_{i = 1}^{M} (Z_{k} (s_{it}) - \frac{1}{2}) ϕ_{kit}$ , t = 1, ⋯, T. In these expressions ϕ_kit = [1, 𝒟 (s_it, s̃₁; ψ_k), ⋯, 𝒟 (s_it, s̃_J; ψ_k)]^T. The parameter $f (η_{kit}) = φ_{kit}^{T} β_{k : t}$ .
Sample Z_k (s_it) from their respective posteriors conditional on B_k and {ν_ijt}. According to the definition of LSBP,
$p (Z_{k} (s_{it}) = 1 | -) = {\begin{matrix} \frac{σ (g_{k} (s_{it})) p (ν_{it} | λ_{kt}^{*})}{σ (g_{k} (s_{it})) p (υ_{it} | λ_{kt}^{*}) + σ (- g_{k} (s_{it})) p (ν_{it} | λ_{k' t}^{*})}, if Z_{l} (s_{it}) = 0 for l < k \\ σ (g_{k} (s_{it})), if \exists l < k, such that Z_{l} (s_{it}) = 1 \end{matrix}$ (20)
where k′ is the first integer larger than k, associated with non-zero indicator. The equation can be expressed as
$p (Z_{k} (s_{it}) = 1 | -) = \frac{1}{1 + exp (- ρ_{kit})},$ (21)
with
$ρ_{kit} = \prod_{l < k} (1 - Z_{l} (s_{it})) log p (ν_{it} | λ_{kt}^{*}) - \sum_{k' > k} Z_{l} (s_{it}) \prod_{\begin{matrix} l < k' \\ l \neq k \end{matrix}} (1 - Z_{l} (s_{it})) log p ((ν_{it} | λ_{k' t}^{*})) + φ_{kit}^{T} β_{k : t} .$ (22)
With a uniform prior assumed on the kernel parameter library (a predefined finite set), the posterior distribution for each ψ_k can be represented as
$p (ψ_{k} = ψ_{l}^{*}) \propto \prod_{t = 1}^{T} \prod_{i = 1}^{M} σ {(g_{k}^{l} (s_{it}))}^{w_{k} (s_{it})} \prod_{t = 1}^{T} \prod_{i = 1}^{M} \prod_{k' > k} (1 - σ {(g_{k}^{l} (s_{it})))}^{w_{k'} (s_{it})} .$ (23)
For each specific k from k = 1, …, K, we have the following update equation
$ψ_{k} = ψ_{r_{k}}^{*}, r_{k} ~ Mult (p_{k 1}, \dots, p_{kL}), p_{kj} = \frac{p (ψ_{k} = ψ_{j}^{*})}{\sum_{l = 1}^{L} p (ψ_{k} = ψ_{l}^{*})} .$ (24)
We sample the kernel parameters based on the multinomial distributions from a given discrete set in each MCMC iteration.
Sample c₀ from its posteriors conditional on {B_k} and {a₀, b₀}:
$p (c_{0}) \propto Gamma (c_{0}; a_{0}, b_{0}) \prod_{k = 1}^{K} 𝒩 (B_{k}; 0, Ω_{k}) .$ (25)
Therefore, c₀ can be drawn from a Gamma distribution
$p (c_{0}) = Gamma (c_{0}; ã_{0}, {b̃}_{0}),$ (26)
where ã₀ = a₀+0.5KT(J + 1) and ${b̃}_{0} = b_{0} + 0.5 \sum_{k = 1}^{K} \sum_{j = 0}^{J} β_{kj :}^{T} {\sum̃}_{kj}^{- 1} β_{kj :}$ with [Σ̃_kj]_il = c₁^|t_i−t_l|.

Sample c₁ from its posterior conditional on {B_k}:

p (c_{1}) \propto 𝒩_{(0, 1)} (c_{1}; 0, 1) \prod_{k = 1}^{K} 𝒩 (B_{k}; 0, Ω_{k}) .

(27)

When updating c₁, the proposed

c_{1}^{(τ + 1)}

is generated from the following distribution:

q (c_{1}^{(τ + 1)} | c_{1}^{τ}) = 𝒩_{(0, 1)} (c_{1}^{(τ + 1)}; c_{1}^{τ}, 1) .

(28)

The acceptance probability for the proposed

c_{1}^{(τ + 1)}

is min

(1, α (c_{1}^{(τ + 1)}, c_{1}^{τ}))

, where

α (c_{1}^{(τ + 1)}, c_{1}^{τ}) = \frac{{| Σ_{kj}^{- 1} (c_{1}^{τ}) |}^{\frac{K (J + 1)}{2}}}{{| Σ_{kj}^{- 1} (c_{1}^{(τ + 1)}) |}^{\frac{K (J + 1)}{2}}} exp {\frac{1}{2} ({c_{1}^{(τ + 1)}}^{2} - {c_{1}^{(τ + 1)}}^{2})} \times exp {\frac{1}{2} (\sum_{k = 1}^{K} \sum_{j = 0}^{J} β_{kj :}^{T} Σ_{kj}^{- 1} (c_{1}^{τ}) β_{kj :} - \sum_{k = 1}^{K} \sum_{j = 0}^{J} β_{kj :}^{T} Σ_{kj}^{- 1} (c_{1}^{(τ + 1)}) β_{kj :})} .

(29)

Similarly, d₀ can be drawn from a Gamma distribution
$p (d_{0}) = Gamma (d_{0}; ã_{0}, {b̃}_{0}),$ (30)
where ã₀ = a₀ + 0.5dKT and ${b̃}_{0} = b_{0} + 0.5 \sum_{k = 1}^{K} \sum_{j = 1}^{d} ln λ_{kj :}^{* T} {Γ̃}_{kj}^{- 1} ln λ_{kj :}^{*}$ with [Γ̃_kj]_il = d₁^|t_i−t_l|.

Similar to c₁, we update d₁ by the Metropolis-Hastings algorithm. The proposed

d_{1}^{(τ + 1)}

is generated from the following distribution:

q (d_{1}^{(τ + 1)} | d_{1}^{τ}) = 𝒩_{(0, 1)} (d_{1}^{(τ + 1)}; d_{1}^{τ}, 1) .

(31)

The acceptance probability for the proposed

d_{1}^{(τ + 1)}

is min

(1, α (d_{1}^{(τ + 1)}, d_{1}^{τ}))

, where

α (d_{1}^{(τ + 1)}, d_{1}^{τ}) = \frac{{| Γ_{kj}^{- 1} (c_{1}^{τ}) |}^{\frac{dK}{2}}}{{| Γ_{kj}^{- 1} (c_{1}^{(τ + 1)}) |}^{\frac{dK}{2}}} exp {\frac{1}{2} (d_{1}^{{(τ + 1)}^{2}} - d_{1}^{{(τ + 1)}^{2}})} \times exp {\frac{1}{2} (\sum_{k = 1}^{K} \sum_{j = 1}^{d} ln λ_{kj :}^{* T} Γ_{kj}^{- 1} (d_{1}^{τ}) ln λ_{kj :} - \sum_{k = 1}^{K} \sum_{j = 1}^{d} ln λ_{kj :}^{* T} Σ_{kj}^{- 1} (d_{1}^{(τ + 1)}) ln λ_{kj :})} .

(32)

A2. VB inference

The log-normal priors placed on the Poisson intensities introduce non-conjugacy, which results in difficulty for VB inference. Therefore, we employ a point estimate for the Poisson intensities, by maximizing the lower bound ℱ. For the GP hyperparameters c₁ and d₁, the truncated normal prior also introduces non-conjugacy. Their posteriors are also inferred from point estimation by maximizing the VB lower bound. The update equations of the posterior inference of Θ are summarized below. In our model,

Θ = {{λ_{kj :}^{*}}_{\begin{matrix} j = 1, \dots, d, \\ k = 1, \dots, K \end{matrix}}, {B_{k}}_{k = 1, \dots, K}, {Z_{k} (s_{i, t})}_{\begin{matrix} t = 1, \dots, T, \\ i = 1, \dots, M, \\ k = 1, \dots, K \end{matrix}}, c_{0}, c_{1}, d_{0}, d_{1}} .

The lower bound for the Poisson intensity $λ_{kj :}^{*}$ may be derived as
$ℱ (λ_{kj :}^{*}) \propto - \frac{1}{2} Λ_{k, j}^{T} Γ_{kj}^{- 1} Λ_{kj} - Q_{kj}^{T} e^{Λ_{kj}} + R_{kj}^{T} Λ_{kj} + constant$ (33)
where $Λ_{kj} = log (λ_{kj :}^{*}), R_{kj} = {[\sum_{i = 1}^{M_{1}} 〈 w_{k} (s_{i 1}) 〉 ν_{ij 1} - 1, \dots, \sum_{i = 1}^{M} 〈 w_{k} (s_{iT}) 〉 ν_{ijT} - 1]}^{T}, and Q_{kj} = {[\sum_{i = 1}^{M_{1}} 〈 w_{k} (s_{i 1}) 〉, \dots, \sum_{i = 1}^{M} 〈 w_{k} (s_{i 1}) 〉]}^{T}$ , with 〈;·〉 denoting the expectation such that 〈w_k(s_it)〉 = q(w_k(s_it) = 1) (see Section 2 for detail of w_k(s_it)). The point estimate for $λ_{kj :}^{*}$ can be updated at each VB iteration by maximizing the lower bound $ℱ (λ_{kj :}^{*})$ . One may easily examine that $ℱ (λ_{kj :}^{*})$ is a concave function, and therefore a global maximum can be obtained by any appropriate convex optimization method. Note that if $Γ_{kj}^{- 1} \to 0$ (setting large variance for the prior distribution), by taking the derivative of (33) and setting it to zero, we have $λ_{kj :}^{*} = e^{Λ_{kj}} \to R_{kj} / Q_{kj}$ , which is consistent with the update equation if independent gamma priors are placed on $λ_{kjt :}^{*}$ for t = 1, …, T. Therefore, the GP priors represented in Γ_kj introduce the correlation among the components of $λ_{kj :}^{*}$ .
To update the variational distribution for the kernel weights β_kjt, note that the logistic link function σ(·) is not within the exponential family and therefore introduces the nonconjugacy. We here follow Jaakkola and Jordan (1998) by introducing a variational bound using the inequality
$σ {(y)}^{z} {[1 - σ (y)]}^{1 - z} = σ (x) \geq σ (η) exp (\frac{x - η}{2} - f (η) (x^{2} - η^{2}))$
where x = (2z − 1)y, $f (η) = \frac{tanh (η / 2)}{4 η}$ , and η is a variational parameter. An exact bound is achieved as η = ±x. If we reorder the entries of B_k (and the associated Ω_k) in (8) such that B_k = [β_k:1, …, β_k:T]^T, the update equation for B_k can be expressed as
$q (B_{k}) = 𝒩 ({(Ω_{k}^{- 1} + U_{k})}^{- 1} Y_{k}, {(Ω_{k}^{- 1} + U_{k})}^{- 1})$ (34)
where U_k is a (J + 1)T×(J + 1)T block-diagonal matrix with the tth (J + 1)×(J + 1) block expressed as
$u_{kt} = 2 \sum_{i = 1}^{M} f (η_{kit}) ϕ_{kit} ϕ_{kit}^{T}$
and Y_k is a (J + 1)T × 1 vector formed by concatenating the T vectors
$y_{kt} = \sum_{i = 1}^{M} (〈 Z_{k} (s_{it}) 〉 - \frac{1}{2}) ϕ_{kit}, t = 1, \dots, T .$
In the above expressions ϕ_kit = [1, 𝒦 (s_it, s̃₁; ψ_k), …, 𝒦 (s_it, s̃_J; ψ_k)]^T. The variational parameters η_kit are then updated as
$η_{kit}^{2} = ϕ_{kit}^{T} 〈 β_{k : t}^{T} β_{k : t} 〉 ϕ_{kit}$ (35)
where $〈 β_{k : t}^{T} β_{k : t} 〉 = COV (β_{k : t}, β_{k : t}) + 〈 β_{k : t} 〉 {〈 β_{k : t} 〉}^{T}$ and it may be evaluated from q(B_k) with the mean and variance associated with time t.
The variational distribution for the binary indicator Z_k(s_it) may be updated as
$q (Z_{k} (s_{it}) = 1) = \frac{1}{1 + exp (- ρ_{kit})}$ (36)
with
$ρ_{kit} = \prod_{l < k} (1 - 〈 Z_{l} (s_{it}) 〉) log p (ν_{it} | λ_{kt}^{*}) - \sum_{k' > k} 〈 Z_{k'} (s_{it}) 〉 \prod_{\begin{matrix} l < k' \\ l \neq k \end{matrix}} (1 - 〈 Z_{l} (s_{it}) 〉) log p (ν_{it} | λ_{k' t}^{*}) + \sum_{j = 1}^{J} 〈 β_{kjt} 〉 𝒦 (s_{it}, {s̃}_{j}; ψ_{k}) + 〈 β_{k 0 t} 〉$
where $log p (ν_{it} | λ_{kt}^{*})$ is the data log-likelihood from the Poisson distribution such that $log p (v_{it} | λ_{kt}^{*}) = log (\prod_{j = 1}^{d} poisson (ν_{ijt} | λ_{kjt}^{*}))$ , and the expectation 〈β_kjt〉 can be obtained from q(B_k).
Due to the non-conjugacy of the sigmoid function, we cannot acquire a variational distribution for ψ_k. However, we can sample it from its posterior distribution by establishing a discrete set of potential kernel widths ${ψ_{l}^{*}}_{l = 1, \dots, L}$ . The posterior distribution for each ψ_k is represented as
$p (ψ_{k} = ψ_{l}^{*}) \propto exp {\sum_{t = 1}^{T} \sum_{i = 1}^{M} 〈 w_{k} (s_{it}) 〉 〈 log σ (g_{k}^{l} (s_{it})) 〉} \times exp {\sum_{t = 1}^{T} \sum_{i = 1}^{M} \sum_{k' > k} 〈 w_{k'} (s_{it}) 〉 〈 log (1 - σ (g_{k}^{l} (s_{it}))) 〉},$ (37)
where $g_{k}^{l} (s_{it}) = \sum_{j = 1}^{J} β_{kjt} 𝒦 (s_{it}, {s̃}_{j}; ψ_{l}^{*}) + β_{k 0 t}$ . The detailed calculations of $〈 log σ (g_{k}^{l} (s_{it})) 〉 and 〈 log (1 - σ (g_{k}^{l} (s_{it}))) 〉$ can be found in Ren et al. (2011).
The variational distribution for c₀ may be updated as
$q (c_{0}) = Gamma (c_{0}; ã_{0}, {b̃}_{0}),$ (38)
with ã₀ = a₀+0.5KT(J + 1) and ${b̃}_{0} = b_{0} + 0.5 \sum_{k = 1}^{K} \sum_{j = 0}^{J} \sum_{i = 1}^{T} \sum_{l = 1}^{T} {[{\sum̃}_{kj}^{- 1}]}_{il} 〈 β_{kji} β_{kjl} 〉$ with [Σ̃_kj]_il = c₁^|t_i−t_l|.
The VB lower bound for c₁ may be derived as
$ℱ (c_{1}) = log 𝒩_{(0, 1)} (c_{1}; 0, 1) + \sum_{k = 1}^{K} log 𝒩 (B_{k}; 0, Ω_{k}) + constant .$ (39)
The point estimate for c₁ can be updated at each VB iteration by maximizing the lower bound ℱ(c₁).
Since a point estimate of λ_kj: is employed at each VB iteration, the variational distribution for d₀ may be the same as (30)
$q (d_{0}) = Gamma (d_{0}; ã_{0}, {b̃}_{0}),$ (40)
where ã₀ = a₀ + 0.5dKT and ${b̃}_{0} = b_{0} + 0.5 \sum_{k = 1}^{K} \sum_{j = 1}^{d} ln λ_{kj :}^{* T} {Γ̃}_{kj}^{- 1} ln λ_{kj :}^{*}$ .
Similarly, the lower bound for d₁ is
$ℱ (d_{1}) = log 𝒩_{(0, 1)} (d_{1}; 0, 1) + \sum_{k = 1}^{K} \sum_{j = 1}^{d} log 𝒩 (Λ_{kj}; 0, Γ_{kj}) + constant$ (41)
and the point estimation for d₁ is obtained by maximizing ℱ(d₁).

By following (33)–(41), the model parameters and GP hyperparameters can be updated iteratively until convergence. In our experiment, we observed fast convergence; typically the relative change of the lower bound reduces to 10⁻⁴ within 100 iterations.

References

Achcar JA, Rodrigues ER, Tzintzun G. Using non-homogeneous Poisson models with multiple change-points to estimate the number of ozone exceedances in Mexico City. Environmetrics. 2011;22:1–12. [Google Scholar]
Adams RP, Murray I, MacKay D. Tractable nonparametric Bayesian Inference in Poisson processes with Gaussian process intensities; International Conference on Machine Learning; 2009. pp. 9–16. [Google Scholar]
Beal MJ. Ph.D. thesis, Gatsby Computational Neuroscience Unit. University College London; 2003. Variational algorithms for approximate Bayesian inference. [Google Scholar]
Chakraborty A, Gelfand AE. Analyzing spatial point patterns subject to measurement error. Bayesian Analysis. 2010;5:97–122. [Google Scholar]
Diggle PJ. Statistical Analysis of Spatial Point Patterns. 2 edition. Arnold; 2003. p. 815. [Google Scholar]
Diggle PJ, Menezes R, Su T. Geostatistical inference under preferential sampling (with discussion) Journal of the Royal Statistical Society - Series C. 2010;59:191–232. [Google Scholar]
Heikkinen J, Arjas E. Non-parametric Bayesian estimation of a spatial Poisson Intensity. Scandinavian Journal of Statistics. 1998;25:435–450. [Google Scholar]
Hoffman M, Blei D, Bach F. Advances in Neural Information Processing Systems. Vancouver, Canada: 2010. Online learning for latent Dirichlet allocation; pp. 993–1022. [Google Scholar]
Hossain MM, Lawson AB. Approximate methods in Bayesian point process spatial models. Computational Statistics and Data Analysis. 2009;53:2831–2842. doi: 10.1016/j.csda.2008.05.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jaakkola T, Jordan MI. Bayesian parameter estimation through variational methods. Statistics and Computing. 1998;10:25–37. [Google Scholar]
Ji C, Merl D, Kepler TB. Spatial mixture modeling for unobserved point processes: Examples in immunofluorescence histology. Bayesian Analysis. 2009;4:297–315. doi: 10.1214/09-ba411. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kottas A, Sansó B. Bayesian mixture modeling for spatial Poisson process intensities, with applications to extreme value analysis. Journal of Statistical Planning and Inference. 2007;137:3151–3163. [Google Scholar]
Luttinen J, Ilin A. Advances in Neural Information Processing Systems. Vancouver, Canada: 2009. Variational Gaussian-process factor analysis for modeling spatio-temporal data; pp. 1177–1185. [Google Scholar]
Møller J, Syversveen AR, Waagepetersen RP. Log Gaussian Cox process. Scandinavian Journal of Statistics. 1998;25:451–482. [Google Scholar]
Møller J, Waagepetersen RP. Statistical Inference and Simulation for Spatial Point Processes. Chapman & Hall/CRC; 2004. [Google Scholar]
Pati D, Reich BJ, Dunson DB. Bayesian geostatistical modeling with informative sampling locations. Biometrika. 2010;98:35–48. doi: 10.1093/biomet/asq067. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rasmussen CE, Willams CKI. Gaussian Processes for Machine Learning. MIT Press; 2006. [Google Scholar]
Rathbun SL, Cressie N. Asymptotic propertes of estimators for the parameters of spatial inhomogeous Poisson point processes. Advances in Applied Probability. 1994;26:122–154. [Google Scholar]
Ren L, Du L, Carin L, Dunson DB. Logistic stick-breaking process. Journal of Machine Learning Research. 2011;12:203–239. [PMC free article] [PubMed] [Google Scholar]
Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;4:639–650. [Google Scholar]
Taddy M. Autoregressive mixture models for dynamic spatial Poisson processes: Application to tracking intensity of violent crime. Journal of the American Statistical Association. 2010;105:1403–1417. [Google Scholar]
Taddy M, Kottas A. Mixture modeling for marked Poisson processes. Bayesian Analysis. 2012;7:335–362. [Google Scholar]
Taddy MA. Ph.D. thesis, Statistics and Stochastic Modeling. Santa Cruz: University of California; 2008. Bayesian nonparametric analysis of conditional distributions and inference for Poisson point processes. [Google Scholar]
Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001;1:211–244. [Google Scholar]
Turner R, Møller J, Hazelton M. Residual analysis for spatial point processes (with discussion) Journal of the Royal Statistical Society - Series B. 2005;67:617–666. [Google Scholar]
Wolpert R, Ickstadt K. Poisson/Gamma random field models for spatial statistics. Biometrika. 1998;85:251–267. [Google Scholar]

[R1] Achcar JA, Rodrigues ER, Tzintzun G. Using non-homogeneous Poisson models with multiple change-points to estimate the number of ozone exceedances in Mexico City. Environmetrics. 2011;22:1–12. [Google Scholar]

[R2] Adams RP, Murray I, MacKay D. Tractable nonparametric Bayesian Inference in Poisson processes with Gaussian process intensities; International Conference on Machine Learning; 2009. pp. 9–16. [Google Scholar]

[R3] Beal MJ. Ph.D. thesis, Gatsby Computational Neuroscience Unit. University College London; 2003. Variational algorithms for approximate Bayesian inference. [Google Scholar]

[R4] Chakraborty A, Gelfand AE. Analyzing spatial point patterns subject to measurement error. Bayesian Analysis. 2010;5:97–122. [Google Scholar]

[R5] Diggle PJ. Statistical Analysis of Spatial Point Patterns. 2 edition. Arnold; 2003. p. 815. [Google Scholar]

[R6] Diggle PJ, Menezes R, Su T. Geostatistical inference under preferential sampling (with discussion) Journal of the Royal Statistical Society - Series C. 2010;59:191–232. [Google Scholar]

[R7] Heikkinen J, Arjas E. Non-parametric Bayesian estimation of a spatial Poisson Intensity. Scandinavian Journal of Statistics. 1998;25:435–450. [Google Scholar]

[R8] Hoffman M, Blei D, Bach F. Advances in Neural Information Processing Systems. Vancouver, Canada: 2010. Online learning for latent Dirichlet allocation; pp. 993–1022. [Google Scholar]

[R9] Hossain MM, Lawson AB. Approximate methods in Bayesian point process spatial models. Computational Statistics and Data Analysis. 2009;53:2831–2842. doi: 10.1016/j.csda.2008.05.017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Jaakkola T, Jordan MI. Bayesian parameter estimation through variational methods. Statistics and Computing. 1998;10:25–37. [Google Scholar]

[R11] Ji C, Merl D, Kepler TB. Spatial mixture modeling for unobserved point processes: Examples in immunofluorescence histology. Bayesian Analysis. 2009;4:297–315. doi: 10.1214/09-ba411. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Kottas A, Sansó B. Bayesian mixture modeling for spatial Poisson process intensities, with applications to extreme value analysis. Journal of Statistical Planning and Inference. 2007;137:3151–3163. [Google Scholar]

[R13] Luttinen J, Ilin A. Advances in Neural Information Processing Systems. Vancouver, Canada: 2009. Variational Gaussian-process factor analysis for modeling spatio-temporal data; pp. 1177–1185. [Google Scholar]

[R14] Møller J, Syversveen AR, Waagepetersen RP. Log Gaussian Cox process. Scandinavian Journal of Statistics. 1998;25:451–482. [Google Scholar]

[R15] Møller J, Waagepetersen RP. Statistical Inference and Simulation for Spatial Point Processes. Chapman & Hall/CRC; 2004. [Google Scholar]

[R16] Pati D, Reich BJ, Dunson DB. Bayesian geostatistical modeling with informative sampling locations. Biometrika. 2010;98:35–48. doi: 10.1093/biomet/asq067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Rasmussen CE, Willams CKI. Gaussian Processes for Machine Learning. MIT Press; 2006. [Google Scholar]

[R18] Rathbun SL, Cressie N. Asymptotic propertes of estimators for the parameters of spatial inhomogeous Poisson point processes. Advances in Applied Probability. 1994;26:122–154. [Google Scholar]

[R19] Ren L, Du L, Carin L, Dunson DB. Logistic stick-breaking process. Journal of Machine Learning Research. 2011;12:203–239. [PMC free article] [PubMed] [Google Scholar]

[R20] Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;4:639–650. [Google Scholar]

[R21] Taddy M. Autoregressive mixture models for dynamic spatial Poisson processes: Application to tracking intensity of violent crime. Journal of the American Statistical Association. 2010;105:1403–1417. [Google Scholar]

[R22] Taddy M, Kottas A. Mixture modeling for marked Poisson processes. Bayesian Analysis. 2012;7:335–362. [Google Scholar]

[R23] Taddy MA. Ph.D. thesis, Statistics and Stochastic Modeling. Santa Cruz: University of California; 2008. Bayesian nonparametric analysis of conditional distributions and inference for Poisson point processes. [Google Scholar]

[R24] Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001;1:211–244. [Google Scholar]

[R25] Turner R, Møller J, Hazelton M. Residual analysis for spatial point processes (with discussion) Journal of the Royal Statistical Society - Series B. 2005;67:617–666. [Google Scholar]

[R26] Wolpert R, Ickstadt K. Poisson/Gamma random field models for spatial statistics. Biometrika. 1998;85:251–267. [Google Scholar]

PERMALINK

Nonparametric Bayesian Segmentation of a Multivariate Inhomogeneous Space-Time Poisson Process

Mingtao Ding

Lihan He

David Dunson

Lawrence Carin

Abstract

1 Introduction

1.1 Motivating application

Figure 1.

1.2 Summary of proposed model

1.3 Related research

2 Model Details

2.1 Basic construction

2.2 Temporal modeling

2.3 Model interpretation

3 Posterior inference

4 Example Results

4.1 Simulation Example

Figure 2.

Figure 3.

Table 1.

Table 2.

4.2 Crime Data

Figure 4.

Figure 5.

Figure 6.

Table 3.

4.3 Pearson residuals

Figure 7.

5 Conclusions

Acknowledgments

Appendix: MCMC and VB Update Equations

A.1 MCMC Inference

A2. VB inference

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases