Abstract
We introduce a model for a time series of continuous outcomes, that can be expressed as fully nonparametric regression or density regression on lagged terms. The model is based on a dependent Dirichlet process prior on a family of random probability measures indexed by the lagged covariates. The approach is also extended to sequences of binary responses. We discuss implementation and applications of the models to a sequence of waiting times between eruptions of the Old Faithful Geyser, and to a dataset consisting of sequences of recurrence indicators for tumors in the bladder of several patients.
Keywords: binary data, dependent Dirichlet process, hierarchical Bayesian model, latent variables, longitudinal data
1 Introduction
Consider a sequence of continuous random variables {Yt : t ≥ 1}. A very popular class of models for such time series data is autoregressive models that relate Yt with a number of lagged terms Yt−1, Yt−2, …, Yt−p. In the simplest scenario, it is assumed that p = 1, and that conditional on Yt−1, Yt = β + αYt−1 + εt, for t ≥ 2, where {εt} is a conveniently chosen sequence of residuals.
The assumptions made on {εt} are crucial for the specification and statistical analysis of AR(1) models. Consider, for instance, a white noise process . It then follows that, conditionally on σ2, all random variables Yt are normally distributed. While convenient, such assumptions may be too restrictive in many practical cases.
We present here a general framework for nonparametric autoregressive modeling, that can be easily modified to accommodate the special cases of binary and ordinal outcomes. The main idea is to provide an extension of the usual normal dynamic models. We focus on a joint model for {Yt}, which can be equivalently done by considering the sequence of increasing conditionals Yt | Yt−1, …, Y1. To fix ideas, consider again the order-one dependence case, and assume that the conditional distribution Yt | Yt−1, …, Y1 depends only on Yt−1 for t ≥ 2, and denote Yt | Yt−1 = y ~ Fy for any t ≥ 2. We also assume homogeneity, in the sense that the distribution of Yt | Yt−1 = y does not change with t. We define a prior probability model for ℱ = {Fy : y ∈ 𝒴}.
We construct the desired family of random probability measures (RPMs) using the Dirichlet Process (DP) introduced by Ferguson (1973). It is well known that the DP is almost surely discrete, and that if G ~ DP(M, G0), a DP with total mass parameter M > 0 and baseline distribution G0, then G can be represented as (Sethuraman 1994)
| (1) |
where δθ(·) is a point-mass at θ, the weights follow a stick-breaking process, wh = ∏i<h(1 − Vi)Vh, with , and the atoms {θh}h≥1 are such that .
MacEachern (1999, 2000) extended (1) by introducing the dependent Dirichlet process (DDP) as a collection of RPMs of the form Gy = ∑h≥1 wh (y) δθh(y), y ∈ 𝒴, such that each RPM Gy is marginally distributed according to a certain DP, and with the property that Gy varies smoothly with y. In particular, this implies that Gy and Gy′ should be correlated for y ≠ y′ and that Gy → Gy′ in some sense as y → y′. A primary application of dependent models is to the case where y represents some (possibly vector-valued) covariate. De Iorio et al. (2004) explored an ANOVA formulation for categorical covariates, with weights that do not vary with y. Many other variations of DDPs have been proposed for various settings. De Iorio et al. (2009) applied the DDP to survival analysis, and Caron et al. (2008) and Rodríguez and ter Horst (2008) considered a time-dependent version of DDPs. Griffin and Steel (2006) proposed the order-based DDP, where weights are sorted according to the values of covariates. Other approaches that explicitly introduce covariate dependence in the weights include the kernel-stick breaking of Dunson and Park (2008), and the probit-stick breaking of Chung and Dunson (2011). See additional references in Hjort et al. (2010). An early development of dependent Dirichlet models appears in Cifarelli and Regazzini (1978), where the dependence on the covariates is introduced as a regression in the base measure of marginally Dirichlet process distributed random probability measures. Cruz-Marcelo et al. (2010) review and compare some covariate-dependent models. For an approach via parametric mixtures of autoregressive models with a common but unknown lag see Wood et al. (2011).
In practice, the discreteness associated with DPs and the DDP extension is inappropriate for the modeling of continuous data. A common way of addressing this limitation is by introducing an additional convolution with a continuous kernel, so that the resulting model can be expressed as a countable mixture of absolutely continuous distributions.
Our model uses the DDP. We propose modeling a sequence of continuous outcomes by means of a DDP with an additional normal kernel as a prior for the regression on lagged terms in an autoregression. In the general case, denoting y = (yt−1, …, yt−p), with possible values on 𝒴, we assume that Yt | (Yt−1, …, Yt−p) = y ~ Fy where Fy is a location mixture of normals, with a DDP prior on the mixing measures. We further assume that the weights {wh(y)} and atoms {θh(y)} are defined in terms of two independent sequences of stochastic processes defined on 𝒴, as discussed in MacEachern (1999, 2000) and in Barrientos et al. (2012).
Of course, there are similarities between the proposed approach and those in the soaring literature on Bayesian nonparametric dependent models. For instance, the common framework between our models and those in Caron et al. (2008), Rodríguez et al. (2010), and Rodríguez and Dunson (2011) consists of the adoption of mixture models with (dependent) stick-breaking random probabilities as mixing measures, where the dependence is on spatial or temporal covariates, and the type of dependence for the observations (or latent processes, or hidden states) is Markovian.
In particular, Rodríguez and Dunson (2011) propose mixture models where the stick-breaking mixing measure has constant atoms, but covariate-varying weights. In contrast, we consider varying atoms. In Rodríguez and Dunson (2011) the weights are defined via a probit transformation of a Gaussian latent process that determines the stick-breaking ratios. Specifications of this model include autoregressive dependence in the Gaussian latent process and random effects models for different population distributions, and the latent variables defining probit weights share information across populations (but apparently there is no autoregression). In contrast, we propose a model where dependence on previous states or observations and on covariates defining population subgroups is considered in the state of the latent process at time t. Rodríguez et al. (2010) assume the stick-breaking framework for the mixing measure of the mixture model; however marginally their stick-breaking covariate-dependent processes share the same dependent distribution. The latent process there drives the selection of the atoms at each location/covariate. In contrast, all the stick-breaking processes we propose have marginal distributions depending on time and covariates. Caron et al. (2008) is one of the first papers where Dirichlet processes were used in the context of dynamic linear models; more recent works include Fox et al. (2011). In particular, they model the hidden/latent state at time t, which is a known linear combination of the hidden state at time t − 1, plus an error which is distributed as a Dirichlet process mixture of Gaussian distributions. This approach is similar to one of our models, where we adopt a Gaussian latent process as well. In both cases, conditioning on the label which identifies the component in the mixture, the latent process at time t is Gaussian distributed, but the temporal assumptions on the means differ. Specifically, in Caron et al. (2008) the mean at time t is the sum of a linear transformation of the previous latent state and a linear transformation of the mean of the Gaussian component of the error. On the other hand, in our case, the mean is a linear transformation of the mean of the Gaussian component of the error, but the transformation matrix depends on the previous latent state.
Our proposed approach differs also from previous models for hidden Markov models (see Zucchini and MacDonald 2009, for instance) in that the dependence is directly either on previous outcomes or latent variables, rather than on a sequence of likelihood parameters. The type of dependence that we consider includes also that implied by the model in Lau and So (2008), who considered the case where the atoms are defined as linear combinations of lagged terms. As we will discuss later, posterior simulation for the proposed model can be carried out using standard techniques for DP mixtures.
The rest of this paper is organized as follows. Section 2 introduces the model, discussing some of its main features. Extensions to ordinal outcomes by means of a latent autoregressive process are also considered. Section 3 illustrates the model in two examples. In Section 3.1 the Old Faithful geyser dataset (Härdle 1991) is analyzed, while Section 3.2 addresses the bladder cancer example of Quintana and Müller (2004) and Giardina et al. (2011). In Section 3.1 we consider more complex alternative models for the Old Faithful geyser dataset, but conclude that no substantial differences are found when comparing with the previous results. We conclude with a discussion in Section 4.
2 The Model
2.1 Setup
The class of models that we consider is based on dependent Dirichlet processes (DDP). Given the vector y ∈ 𝒴 of p lagged responses at times t − 1, …, t − p, we consider a model for the conditional distribution of Yt given y, i.e., we assume p[Yt | (Yt−1, …, Yt−p) = y] = Fy. We define a prior distribution on the collection of random probability measures ℱ = {Fy : y ∈ 𝒴}. To do so, we consider two sequences of independent stochastic processes, {Vh(y) : y ∈ 𝒴, h ≥ 1} and {θh(y) : y ∈ 𝒴, h ≥ 1} on 𝒴 such that marginally for every y ∈ 𝒴 and h ≥ 1 we have Vh(y) ~ Beta(1, M) and θh(y) ~ G0,y. We also require continuity of trajectories for all these processes, which is satisfied when they are constructed using suitable families of copulas, as described in Barrientos et al. (2012). Setting wh(y) = ∏j<h Vj(y)(1 − Vh(y)) for h ≥ 1, we define
| (2) |
The above choices guarantee that Gy ~ DP(M, G0,y) for every y ∈ 𝒴. The proposed model can then be expressed in the general case as
| (3) |
where N(m, S2) indicates a normal distribution with mean m and variance S2.
We will discuss M and G0,y for specific applications later. Note that the resulting conditional distribution for Yt, given (Yt−1, …, Yt−p) = y is a location mixture of normals, where the mixing measure G comes from the DP. By the discussion around (1), this implies that the model for Yt | (Yt−1, …, Yt−p) = y can be equivalently represented as
| (4) |
Assuming common weights, i.e. wh(y) ≡ wh, model (3) can be further simplified to a countable mixture of autoregressive models, where each mixture component has a mean defined by its own stochastic process θh(y), h ≥ 1. Moreover, the exact nature of the dependence on lagged terms y encoded in the random probability measure (2) is very general.
Despite the great generality of the proposed construction, it is in practice useful to resort to simple and manageable specifications. The main motivations for the simplification are easier implementation and parsimony. As we will later demonstrate, inference in the more general model is practically equivalent to inference under the proposed simplification. We just mentioned the simplification with a common-weights DDP (Barrientos et al. 2012), where it is assumed that wh(y) = wh for all h ≥ 1, i.e., we assume the stochastic processes {Vh : h ≥ 1} to have constant trajectories on 𝒴. In addition, we may assume specific forms for the stochastic processes defining the atoms. For instance, we may assume that each θh(y) is a Gaussian process with mean and covariance functions expressed in terms of y. Later in Section 3 we will consider the special case where each θh(y) corresponds to a polynomial function of the first lagged response, or to a linear combination of the p elements of y. A related mixture model approach, with weights depending on previous responses was proposed in Müller et al. (1997). Lau and So (2008) considered similar types of models, where each atom included a formulation involving infinite mixtures of order-p linear autoregressions.
Observe that, when {Gy : y ∈ 𝒴} is a common-weights DDP, i.e., wh(y) = wh, model (4) can be alternatively expressed as a DP mixture (DPM) model as follows. This is best seen in the marginal model for Yt. Marginally, for each t,
| (5) |
where the above integration is interpreted as a marginalization over the stochastic process θ, which does not eliminate the dependence on y in (5).
As is usual in DPM models, computation is simplified by introducing latent variables and breaking the mixture (5). Since details are model-specific, we consider here as an illustration, the case where p = 1, wh(y) = wh, and θh(y) = βh + αhy, i.e., the common-weights DDP where the atoms correspond to linear trajectories of the first lagged response. We call this the AR(1)-DDP model. The model can alternatively be written as
| (6) |
Representation (6) provides a hierarchical definition and also highlights the fact that the dependence is introduced at the level of responses, and not in terms of the latent parameters {(βt, αt)}. The Bayesian model specification would then be completed by assigning a prior distribution to σ2 and a distribution for Y1. Specific prior choices for relevant parameters will be later discussed in Section 3.
A simplified version of models (4) or (5) can be sometimes convenient from a computational viewpoint. This is achieved by truncating the infinite mixture implied by the DP to a finite mixture of a sufficiently large number of components, say H. This simplified model also implies a stick-breaking definition of the mixture weights {wh(y) : h = 1,…, H}, with wh(y) = ∏i<h(1 − Vi(y))Vh(y), for h = 1, …, H, where each Vh(y) has marginally a Beta(1, M) distribution for h < H, and VH(y) = 1 for all y ∈ 𝒴, which guarantees = 1 for all y ∈ 𝒴 (Ishwaran and James 2001). In the particular case of (6) and wh(y) = wh for all h, y, and introducing latent mixture component indicators {rt}, with P(rt = h) = wh, 1 ≤ h ≤ H, the model becomes
| (7) |
Finally, it is worth pointing out some properties of the proposed model. There are no constraints to stationarity of the time series. In fact, the prior puts zero probability on stationarity, which would only arise as a special case of the auto-regression. Only the regressions p(yt | yt−1) are assumed to be constant through time, which contrasts with the method of Mena and Walker (2005) for constructing strictly stationary AR(1)-type models via nonparametric Bayes. The model inherits regularity properties of density estimation with a DP mixture of normals.
2.2 Binary Outcomes
The previous construction can easily be extended to the binary case, using ideas from Albert and Chib (1993), and model (5), or (4), on a latent scale. Assume Yt is binary for all t, and introduce latent scores Zt so that Yt is defined by means of
| (8) |
and, as consequence, Yt = 0 iff Zt ≤ 0. The extension can now be stated as
| (9) |
In other words, the proposed continuous nonparametric autoregressive model is used to define the distribution of the latent score Zt. Of course, given {Zt}, the observations {Yt} are deterministic, which means that the desired distribution for the observed binary sequence is completely specified. Moreover, in terms of the observables, this model has the following probit-type structure:
| (10) |
Here Be(p) indicates a Bernoulli distributed (binary) random variable with success probability p. Note that a truncated version of (9) can be also considered, exactly as in the discussion leading to (7).
An alternative model specification for the binary case considers the sequence of conditional distributions entirely in the latent scale. By this we mean a nonparametric autoregressive model directly in terms of the latent sequence of scores {Zt}. Parallelling (5) we consider
| (11) |
In this case, the model representation in terms of the observables is not as simple as before; in fact, conditioning on the parameters, the joint distribution of Yt, (Yt−1, …, Yt−p), can be expressed as
| (12) |
From (10) and (12), it is clear that the two models are different. In particular, the former defines a Markovian process of order p on {Yt}, unlike the latter. A formulation in terms of latent variables has an advantage though: it can be readily extended to ordinal outcomes. Indeed, assume Yt is ordinal, with support {0, 1, …, κ − 1} for some integer κ ≥ 2. The binary case follows when κ = 2. Let −∞ = γ0 < γ1 < ⋯ < γκ−1 < γκ = ∞ be ordered cutoffs. We then assume the {Yt} to be defined through a latent sequence {Zt} by means of
| (13) |
Kottas et al. (2005) argue that the cutoffs can be fixed without loss of generality. In particular, when κ = 2 we take γ1 = 0, and we have Yt = 1 if and only if Zt > 0, just as before. Conditional on the latent variables Zt, the observations are deterministic, and it is therefore natural to consider exactly the same nonparametric autoregressive model (3), or (4), on this latent scale.
The model can easily be extended to multiple subjects. In particular, in the following section, we will fit model (8) with (9) or (11) using a common-weights model with p = 1 and atoms defined as simple linear trajectories, to binary data representing the recurrences of a disease in patients at different times. In this case, since the AR(1)-dependence is on the latent scores, we guess that this model could produce a good fit when the data consist of several short sequences, a situation that prevents us from using a higher-order dependence specification. On the other hand, we expect that model (3) will fit one single longer sequence of data well, since it assumes the order p Markovian property directly on the continuous responses.
3 Applications
In this section, we illustrate the class of models with applications to two datasets, the Old Faithful geyser (Sect. 3.1) and the bladder cancer (Sect. 3.2) dataset. First we summarize some implementation details for the following examples. In the Old Faithful geyser example, inference was implemented in R as Markov chain Monte Carlo (MCMC) posterior simulation, using the first 100, 000 iterations as burn-in, and saving every 20-th iteration after burn-in. On the other hand, all inference for the latter example was coded in JAGS with the same burn-in, but with a larger thinning interval (100 iterations). In all cases, a posterior Monte Carlo sample of size 5, 000 was saved. Standard convergence diagnostics criteria such as those available in the R package CODA (Plummer et al. 2006) were applied to all parameters, indicating that convergence had been achieved.
3.1 Old Faithful Geyser
Inference under the AR(1)-DDP Model
We illustrate the proposed AR(1)-DDP model (6), or its simplified version (7), using the easily accessible Old Faithful geyser data set. For an extensive description of the data, see Härdle (1991) and Azzalini and Bowman (1990). Old Faithful is a geyser in the Yellowstone National Park in Wyoming, USA. The data consist of 299 pairs of measurements, referring to the time interval between the starts of successive eruptions, and the duration of the subsequent eruption. Here we only use the 272 data points that are readily available in the R dataset (Härdle 1991).
We focus on the waiting times {yt, t = 1, …, 272} between the eruptions (yt is the waiting time before the t-th eruption of the geyser), and fit model (7) to the dataset. Figure 1 plots yt versus yt−1. As the lagged data point yt−1 varies across the x-axis one can clearly recognize how the autoregressive model p(Yt | Yt−1 = yt−1) changes from a unimodal distribution around yt−1 = 50 to a bimodal distribution around yt−1 = 80. For later comparison, three pairs of vertical lines pick out three groups of data, with lagged waiting times yt−1 in the interval 50 ± 5, 65 ± 5 and 80 ± 5, respectively. We fitted model (7) using H = 20 and a total mass parameter of M = 1. The point masses in the DDP are assumed to be simple linear functions, θh(y) = βh + αhy. The base measure G0 is a (bivariate) Gaussian distribution with independent components, with mean (0, 0)T, and variances of the βh and αh components equal to 400 and 5, respectively. We consider fixed kernel variance σ2 = 25, and alternatively an inverse gamma prior, p(1/σ2) = Ga(2, 2), i.e., E(1/σ2) = 1, and Var(1/σ2) = 0.5.
Figure 1.
Old Faithful geyser: yt versus yt−1. The pairs of vertical lines pick out groups of data with yt−1 around 50 ± 5, 65 ± 5 and 80 ± 5, respectively. Notice the different form of the empirical distributions of yt within each of the groups.
Figure 2 shows the posterior mean of the autoregressive model Fy(·) in (5) for y corresponding to a first-lag response of ỹ1 = 50, ỹ2 = 65, and ỹ3 = 80, respectively. Let F̄y(·) denote the posterior expectation
and let f̄y(·) denote the corresponding probability density function. The three panels of Figure 2 show f̄ỹj, j = 1, …, 3. For comparison the figures also show a kernel density estimator (dashed line) using a subset of the data with yt−1 within ỹj ± 5. Figure 3 shows the posterior mean of fyt−1 (·) for yt−1 = 80 under the 1/σ2 ~ Ga(2, 2) prior, M = 1 and H = 50, together with 95% point-wise posterior credible bands.
Figure 2.
Old Faithful geyser data. Posterior means f̄yt−1 (yt) for yt−1 = 50 (left panel), 65 (central) and 80 (right). The continuous (black) line shows inference under the prior 1/σ2 ~ Ga(2, 2), the (red) dash-dotted line shows inference under σ2 = 25 (practically indistinguishable from the solid line), and the dashed (blue) shows a kernel density estimate.
Figure 3.
Old Faithful geyser data. Posterior mean f̄yt−1 (·) for yt−1 = 80 (blue semidashed line), together with pointwise 95% credible bands (red dotted lines) and median (solid black line).
Finally, we carried out sensitivity analysis to investigate variations in the prior assumptions. For example, we found that substantially increasing the value of fixed σ2 beyond σ2 = 50 lead to poorly mixing MCMC. On the other hand, increasing the prior mean for 1/σ2 by assuming 1/σ2 ~ Ga(2, 10) leads to only negligible changes in the inference. We investigated robustness with respect to prior parameters of the finite DP prior. Increasing H to 50 and M to 10, we observed little change in the estimated autoregressive models f̄yt−1 (yt). Figure 4 shows the estimates of fyt−1 (·) for yt−1 = 80 under a variety of choices for M and H. The different curves are almost indistinguishable.
Figure 4.
Old Faithful geyser data. Posterior means of fyt−1 (·), for yt−1 = 80. The (red) semi-dashed line is the estimate under M = 1, H = 20, the (orange) dotted line for M = 10, H = 20, the (green) dashed line is for M = 1, H = 50 and the (blue) long dashed line is for M = 10, H = 50. The estimates are almost indistinguishable.
Model Variations
In the construction of the proposed AR(1)-DDP model (6) we made a sequence of simplifying assumptions. The question arises whether a more general model without some of these simplifications could lead to a practically meaningful extension, trading parsimony with more flexibility. The answer, of course, is always dependent on the particular application. In a sequence of alternative implementations we investigate this question for the particular example of the Old Faithful Geyser example.
We first considered the truncation to the finite DDP. We implemented an alternative model as in (6), without approximating the DP random measure G to finitely many, H, point masses. Figure 5 shows the resulting posterior means of fyt−1 (·), for yt−1 = 50, 65 and 80. Compared with Figure 2 we find virtually the same inference. Another major simplification was the use of simple polynomials for the trajectories θh(y) to replace more flexible alternatives, such as a Gaussian process (GP) prior for θh(y). In the special case of lag p = 1 regression the more general GP model is easy to implement. In particular, we considered an Ornstein-Uhlenbeck (OU) process, a GP with covariance function cov[θ(s), θ(t)] = τ2ρ|s−t|, for 0 < ρ ≤ 1. The attraction of the OU process is the Markovian nature of the process that greatly simplifies posterior computation. We thus implemented (4) with common weights wh and point masses θh(y) = b + ahy + OU(ρ, τ2), where OU(ρ, τ2) denotes the OU process with parameters (ρ, τ2) in the covariance function. A priori, ρ was assumed uniform on (0,1), while 1/τ was given a Ga(0.1, 0.1) prior, while b ~ N(110, 1) and , with a0 ~ N(−0.5, 1), . For a fair comparison we used the same setup as above, now saving 10, 000 iterations for the inference. The estimated distributions f̄y(·) for y = 50, 65 and 80 (not shown) after the same number of iterations look very different from Figure 2, including a unimodal distribution f̄50(·) and f̄65(·) lacking the secondary mode around y = 50. We conclude a serious lack of convergence with the same number of iterations, which may be related to the fact that the GP model is over-parametrized. This leads us to prefer the parsimonious implementation of the AR(1)-DDP.
Figure 5.
Old Faithful geyser data. Posterior means f̄yt−1 (·) under the AR(1)-DDP model (6) with H = ∞, i.e., without truncation, for yt−1 = 50 (left), yt−1 = 65 (center) and yt−1 = 80 (right).
Finally we considered a variation with varying weights wh(y). Similar to what was proposed in Rodríguez and Dunson (2011), we used a logit model to replace the beta distributed fractions Vh in (2), with logit(Vh(y)) = ηh1 + ηh2y. We continue to use a finite truncation with H = 20. The resulting estimates f̄y(·) are shown in Figure 6. We see no practically meaningful differences in the inference. We therefore recommend the more parsimonious model (6).
Figure 6.
Old Faithful geyser data. Posterior means f̄yt−1 (·) under the AR(1)-DDP model (6) with H = ∞, but with varying weights wh(y) for yt−1 = 50 (left), yt−1 = 65 (center) and yt−1 = 80 (right).
3.2 Bladder Cancer Data
To illustrate the nonparametric autoregressive approach for latent scores, as described in the previous section, we consider many short sequences of binary variables. The dataset is part of a bladder cancer study conducted in the USA by the Veterans Administration Cooperative Urological Research Group (VACURG). The purpose of the study was to compare the effectiveness of three treatments (placebo, pyridoxine, and topical thiotepa) in preventing recurrence of Stage I bladder cancer (Byar et al. 1977).
Many authors, including Quintana and Müller (2004), have analyzed this dataset. The study conducted by VACURG enrolled m = 81 patients with up to a maximum of ni = 12 observations taken every three months for each patient. We restrict ourselves to only patients grouped into treatment (thiotepa) and placebo: group T (36 subjects) and group P (45 subjects). See Davis and Wei (1988) for the original dataset. Each observation records an indicator of recurrence of bladder cancer tumors, i.e. yit = 1 if an increased number of tumors was detected at time t for patient i, and yit = 0 otherwise, where i = 1, …, m denotes individuals and t = 1, …, ni denotes the measurement time for each individual i. We record treatment information as a binary covariate. Denote xi = 0 if patient i belongs to the P group, and xi = 1 otherwise, for i = 1, …, m. The binary r.v.’s Yit are modeled as
| (14) |
We compare two different classes of models for the latent variables, one as described in (11), and the other as in (9), using the AR(1)-DDP specification, which defines a Markovian process on each Yi = (Yi1, …, Yini)′. The covariate xi will be included in the autoregression, together with the past value of the latent score, in both models. We mention that these models can also be considered as nonparametric generalizations of earlier parametric work in Giardina et al. (2011), where more details on data construction can be found. However, the description of the models here is self-contained.
AR(1)-latent model
We consider the following AR(1)-DDP model on the latent variables Z1, …, Zm, where Zi = (Zi1, …, Zini)′:
| (15) |
for i = 1, …, m, t = 2, …, ni, where G0 is a bivariate distribution. To complete the model definition, a prior distribution for the initial latent variables {Zi1, i = 1, …, m} must be given. We further assume that, conditionally on the latent variables Z1, …, Zm, the vectors Y1, …, Ym are independent, with binary components as in (14), where each Zit follows (15). This is not a probit-type model. Analogously to (12), the joint distribution of all observables given parameters and covariates is
Moreover, observe that β0 in (15) is the intercept of the regression model, and β1 represents the treatment effect on the response variable. A finite approximation of equation (15) is
| (16) |
i.e. the distribution of Zit − (β0 + β1xi), given Zit−1 = zit−1, is a location-mixture of Gaussian distributions with fixed variance, where the mixing distribution is a truncated single-p (constant weights as defined in Section 2.1) DDP. Since the Zit’s are latent variables representing the observations according to (14), Zi and CZi yield the same distribution of Yi, whatever positive constant C we choose. Identification may be achieved by fixing σ2; the interested reader could refer to Giardina et al. (2011), Section 3.1, for a discussion about identifiability issues.
Finally note that model (15) is a slight generalization of (11). The dependence in the random mixing distribution includes both the previous latent variable zit−1 and the covariate xi. Also, although the treatment variable x is the only covariate in this application, the model could easily be adapted for inference with more covariates if desired.
Regarding σ2, β, G0 and M, we assume that σ2 = 0.25, (β0, β1)′ ~ 𝒩2((β00, β01)′, Vβ) with β00 = β01 = 0 and Vβ = I2, the bivariate identity matrix, and G0(α1h, α2h) is determined by
| (17) |
where α001 = α002 = 0, V = 10I2, Vα = I2, and (β0, β1)′ and {(α1h, α2h)′} are independent. The model is completed by assuming two different prior distributions for Zi1 for T and P patients as follows:
| (18) |
Prior (18) was proposed for the first latent variables so as to ensure that μ0 ≥ μ1 almost surely, since we assume that the patients under treatment will have a lower probability of recurrence. The logistic-beta(a, b) is assumed for μ1, i.e. (1 + exp(−μ1))−1 is distributed according to a Beta(a, b), where the specific choices of hyperparameter values will be discussed later. Here a more standard assumption for μ1 would be distributed as a Beta(a, b). In practice, however, this requires evaluating Φ−1, which is notorious for being a numerically unstable operation, unlike the case implied by the logistic assumption. Here the parametrization of the log-normal distribution is such that and . We have fixed a = b = 3, μD = −1, σD = 1.
A simpler alternative to (15) is to assume
| (19) |
for i = 1, …, m, t = 2, …, ni, i.e. the “slope” α1 is constant over the two groups of patients (P and T). Note that G0 denotes a univariate distribution for (19), fixed here as the corresponding marginal of that in (17). We will refer to this latter model as AR(1)-latent 3P, since the regression parameters included here are only 3, while the former model with 4 regression parameters will be obviously referred to as AR(1)-latent 4P. Summary posterior inferences and posterior distributions about regression parameters for both models can be found in Table 1 and in Figure 7. Unless otherwise stated, these estimates and those in the following tables were computed with H = 30 and M = 1. Observe that the marginal posterior distributions of β0, β1 and μ1 are concentrated on the negative numbers. This means that the baseline probability of tumor recurrence is less than 0.5 for both groups, and that treated patients have lower baseline probability than the ones in the placebo group. The posterior of D confirms that there is a difference between the two treatments.
Table 1.
Posterior means and standard deviations of the parameters of the AR(1)-latent models 3P and 4P.
| M = 1 | M ~ U(0.5, 10) | M ~ trunc-ℐ𝒢(2, 2) | ||||||
|---|---|---|---|---|---|---|---|---|
| 3P | 4P | 4P | 4P | |||||
| mean | sd | mean | sd | mean | sd | mean | sd | |
| β0 | −0.2171 | 0.0410 | −0.2221 | 0.0439 | −0.2206 | 0.0433 | −0.2207 | 0.0429 |
| β1 | −0.1348 | 0.0749 | −0.1547 | 0.1299 | −0.1301 | 0.1038 | −0.1286 | 0.0995 |
| α01 | 0.0798 | 3.1894 | 0.3576 | 0.9326 | 0.4703 | 0.9552 | 0.4128 | 0.9386 |
| α02 | - | - | −0.2642 | 0.9937 | −0.1596 | 0.9635 | −0.1969 | 0.9562 |
| μ1 | −0.4275 | 0.0890 | −0.4240 | 0.0876 | −0.4252 | 0.0883 | −0.4249 | 0.0882 |
| D | 0.1475 | 0.0811 | 0.1483 | 0.0816 | 0.1482 | 0.0815 | 0.1465 | 0.0809 |
| K | 4.0524 | 1.5484 | 4.2164 | 1.6007 | 3.7666 | 1.6754 | 4.2758 | 1.6719 |
| M | - | - | - | - | 0.8411 | 0.3331 | 1.1115 | 0.2748 |
Figure 7.
Posterior marginal distribution of AR(1)-latent models parameters when H = 30 and M = 1 for models 4P (continuous) and 3P (dashed).
AR(1)-latent-Y model
As a second model, we assume a finite approximation of (9):
| (20) |
i.e. the distribution of Zit − (β0 + β1xi), given Yit−1 = yit−1 is a location-mixture of Gaussian distributions with fixed variance, where the mixing distribution is a truncated single-p (constant weights) DDP. The prior for the “regression” parameters and the initial latent variables Zi1’s is as in (17)–(18). Of course, the meaning of β0, β1, α1h and α2h is completely different. But we can meaningfully compare the resulting predictive recurrence probabilities of the two classes of models. Summary posterior inferences and posterior distributions for the regression parameters for model AR(1)-latent-Y as specified in (20) are reported in Table 2 and Figure 8. Note that, as was the case for the AR(1)-latent models, the marginal posterior distributions for β0, β1 and μ1 are all concentrated on the negative numbers.
Table 2.
Posterior means and standard deviations of the parameters of the AR(1)-latent-Y model, when H = 30, M = 1 and σ2=1.
| M = 1 | M ~ U(0.5, 10) | M ~ trunc-ℐ𝒢(2, 2) | ||||
|---|---|---|---|---|---|---|
| mean | sd | mean | sd | mean | sd | |
| β0 | −1.0797 | 0.0881 | −1.0818 | 0.0891 | −1.0816 | 0.0891 |
| β1 | −0.4039 | 0.1483 | −0.4009 | 0.1532 | −0.4007 | 0.1497 |
| α01 | 0.8921 | 0.9371 | 0.8870 | 0.9370 | 0.8851 | 0.9219 |
| α02 | 0.2114 | 0.9766 | 0.2234 | 0.9521 | 0.2136 | 0.9411 |
| μ1 | −0.7454 | 0.1656 | −0.7479 | 0.1675 | −0.7465 | 0.1667 |
| D | 0.2143 | 0.1361 | 0.2173 | 0.1376 | 0.2157 | 0.1373 |
| K | 4.3454 | 1.6996 | 3.9334 | 1.8607 | 4.8270 | 2.0100 |
| M | - | - | 0.8615 | 0.3582 | 1.1450 | 0.3103 |
Figure 8.
Posterior marginal distributions of AR(1)-latent-Y model parameters when H = 30 and M = 1, for σ2 = 1.
Comparison between models
For comparison purposes, we report estimates of the predictive probabilities for the two models considered, corresponding to additional measurement for already observed patients (Table 3), and for new patients (see Figure 9). Both are reported separately for patients under treatment groups P and T.
Table 3.
Estimates of the predictive probabilities of a new measurements for subjects 9, 16, 23, 33 (PLACEBO) and 60, 71, 74 (TREATMENT), including Monte Carlo standard errors.
| AR(1)-latent | AR(1)-latent-Y | |||||
|---|---|---|---|---|---|---|
| 3P | 4P | 4P | ||||
| Prob. | MCse | Prob. | MCse | Prob. | MCse | |
| Y9,9 | 0.5412 | 0.0070 | 0.5226 | 0.0071 | 0.4550 | 0.0070 |
| Y16,10 | 0.0934 | 0.0041 | 0.1036 | 0.0043 | 0.1354 | 0.0048 |
| Y23,11 | 0.5698 | 0.0070 | 0.5532 | 0.0070 | 0.4530 | 0.0070 |
| Y33,13 | 0.1038 | 0.0043 | 0.0992 | 0.0042 | 0.1392 | 0.0049 |
| Y60,10 | 0.0688 | 0.0036 | 0.0680 | 0.0036 | 0.0744 | 0.0037 |
| Y71,9 | 0.0590 | 0.0033 | 0.0498 | 0.0031 | 0.0724 | 0.0037 |
| Y74,12 | 0.0532 | 0.0032 | 0.0526 | 0.0032 | 0.0684 | 0.0036 |
Figure 9.
Predicted recurrence probabilities for a new placebo and a new treated patient under different models.
Figure 9 displays predicted recurrence probabilities for a new placebo (upper set of lines) and a new treated patient (lower set of lines). We observe no significant differences in these predictions between the three considered models, for both types of patients.
We have also examined robustness of these results to choices of H and M. When increasing H to 50, we found no substantial differences on the predictives for new patients or on the posterior distributions of K, the number of components in the mixtures (see Figure 10, first row). On the other hand, the predictive probabilities of additional measurements for already observed patients are very robust, and for this reason we have not reported them.
Figure 10.
Posterior distributions of the number of components K in the mixture in the AR(1)-latent 4P model when M = 1 and H = 30 (a) or H = 50 (b), and when H = 30 and M is U(0.5, 10)- (c) or trunc − ℐ𝒢(2, 2)-distributed (d).
As far as the total mass parameter M is concerned, we have assumed it random, i.e. M has a Uniform prior on the interval (0.5, 10), or inverse-gamma with parameter (2, 2) with support (0.5, + ∞) (M = 0.5 + X, with 1/X ~ gamma(2, 2)). The total mass parameter was assumed bounded away from zero due to numerical instability of the posterior simulation algorithms, as implemented in JAGS. In any case, these two choices imply quite different prior assumptions for M. Table 1 reports the regression parameter estimates for model 4P-AR(1)-latent, while Figure 11 displays some of these posterior distributions. Even if the posteriors of M, under the two priors, are different (see Figure 12), the posteriors of the number of clusters K in Figure 10 (c)–(d) are quite similar.
Figure 11.
Posterior marginal distribution of 4P-AR(1)-latent model when H = 30 and M is U(0.5, 10)- (continuous) or trunc − ℐ𝒢(2, 2)- distributed (dashed).
Figure 12.
Posterior marginal distribution of the total mass parameter M in the AR(1)-latent 4P model when H = 30 and M is U(0.5, 10)- (continuous) or trunc − ℐ𝒢(2, 2)-distributed (dashed).
Figures 13 and 14 display the posterior distributions of the regression parameters, and of the number K of components in the mixture, respectively, for AR(1)-latent-Y model when M is random (as before). We note that posterior distribution of M is similar to Figure 12, and therefore, not shown here. This suggests that inference on M is not affected by the specific choice of autoregressive dependence, i.e. latent variables or responses. It is also worth mentioning that under these priors, the predictive probabilities for “new” placebo and treated patients shown in Figure 15 are very robust to all the model choices explored here. Finally, we incidentally remark that, for the AR(1)-latent models only, the MCMC algorithm may fail to converge if choosing a value of σ2 larger than 0.25, as traceplots corresponding to latent variables for some patients with too many zero responses diverged to −∞. This suggests that, in this case, the latent variables, due to the lack of identifiability problem mentioned earlier, need to be tightly controlled.
Figure 13.
Posterior marginal distribution of the regression parameters in the AR(1)-latent-Y model when H = 30 and M is U(0.5, 10)- (continuous) or trunc − ℐ𝒢(2, 2)-distributed (dashed).
Figure 14.
Posterior distributions of the number of components K in the mixture in the AR(1)-latent-Y model when M is equal to 1 (left), or U(0.5, 10)- (central) or trunc − ℐ𝒢(2, 2)- distributed (right panel).
Figure 15.
Predicted recurrence probabilities for a new placebo (continuous line) and a new treated patient (dashed line), under different models with M random.
To summarize, though the predictions for this particular dataset are quite robust to the proposed models, we point out that the two approaches are actually very different. As we mentioned before, the AR(1)-latent model is not Markovian while the AR(1)-latent-Y model is. Moreover, the former presents similarities with the nonparametric linear dynamic model by Caron et al. (2008), where they assume that the hidden state at time t is a known linear combination of the hidden state at time t − 1 plus an error which is distributed as a Dirichlet process mixture of Gaussian distributions. The AR(1)-latent-Y model does not seem to fit such structure. Of course, from a computational point of view, the latter model yields a better mixing of the MCMC algorithm. Setting aside the problem of inference on the random measure G, which was not of interest for these data, this seems to be the only practical difference between the two models.
4 Discussion
We have presented a framework for nonparametric modeling of either one or multiple time series of observations. The model is based on dependent Dirichlet processes (DDPs), where the dependence is on lagged responses. The proposal can be characterized as using non-parametric Bayesian density regression, i.e., fully nonparametric regression, to define the regression on lagged data in an autoregressive model. For the sake of clarity, we have limited the presentation to simple implementations of the nonparametric regression. A simplification of the models to a finite number of mixture components was also discussed. The framework can be also applied to binary or ordinal responses, where the key is to apply the model to sequences of latent variables defining the observations. Applications to both types of data were considered.
We characterized and introduced the model as a DDP. However, it is worth reiterating that the model can alternatively be written as a simple DP mixture. We showed this representation in (5). Recognizing this representation greatly simplifies computation. We still prefer to think of the model as a special case of the DDP because this highlights the nature of the problem as inference about a family of random probability measures ℱ = {Fy}.
The class of models considered here can adopt many different forms. The linear dependence discussed in Section 2.1 is just one example. Higher order polynomials or other nonlinear functions of lagged terms can be accommodated under the general framework, for instance b-splines (Eilers and Marx 1996). Another option consists of including dependence on p ≥ 2 lagged terms, i.e., a nonparametric AR(p) model. Although computational convenience is achieved by linearity assumptions on the autoregression coefficients, the model for the point masses θh(y) in the DDP can be arbitrarily specified. In practice, however, one would like to retain some interpretability of the mixture components, which poses some practical restriction on the way lagged terms enter the model.
In the discussion and the examples we did not focus on inference for the random mixing measure G in (6), as this is usually not an inference target. However, if such inference were desired it could easily be obtained as part of the MCMC. Even without the constraint to the finite DP with finite H, one could report inference on G by means of the slice sampler proposed in Walker (2007) and Kalli et al. (2011).
Finally, extensions to the current approach include further comparison between different ways of specifying the dependence on lagged terms, assessing the number of lagged terms to include in the autoregression, and multivariate formulations of the autoregressive models. These and other topics are the subject of current research.
Acknowledgments
We thank Annalisa Cadonna for running the JAGS code of the bladder cancer application. Alessandra Guglielmi was partially funded by MIUR, grant 2008MK3AFZ, and she would like to thank people at Departamento de Estadística at PUC, Chile, for their kind hospitality. Fernando Quintana was partially funded by grant FONDECYT 1100010. Peter Müller was supported in part by NIH/NCI R01CA075981.
References
- Albert JH, Chib S. Bayesian Analysis of Binary and Polychotomous Response Data. Journal of the American Statistical Association. 1993;88:669–679. [Google Scholar]
- Azzalini A, Bowman AW. A look at some data on the Old Faithful Geyser. Journal of the Royal Statistical Society, Series C-Applied Statistics. 1990;39:357–365. [Google Scholar]
- Barrientos AF, Jara A, Quintana FA. On the support of MacEachern’s dependent Dirichlet processes. Bayesian Analysis. 2012;7:277–310. [Google Scholar]
- Byar DP, Blackard C the Veterans Administration Cooperative Urological Research Group. Comparisons of placebo, pyridoxine, and topical thiotepa in preventing recurrence of Stage I bladder cancer. Urology. 1977;10:556–561. doi: 10.1016/0090-4295(77)90101-7. [DOI] [PubMed] [Google Scholar]
- Caron F, Davy M, Doucet A, Duflos E, Vanheeghe P. Bayesian inference for linear dynamic models with Dirichlet process mixtures. IEEE Transactions on Signal Processing. 2008;56:71–84. [Google Scholar]
- Chung Y, Dunson DB. The local Dirichlet process. Annals of the Institute of Statistical Mathematics. 2011;63:59–80. doi: 10.1007/s10463-008-0218-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cifarelli DM, Regazzini E. Technical Report Quaderni Istituto di Matematica Finanziaria, Serie III. n.12. Universitá di Torino; 1978. Problemi statistici non parametrici in condizioni di scambiabilità parziale: impiego di medie associative. [Google Scholar]
- Cruz-Marcelo A, Rosner GR, Müller P, Stewart C. Modeling Covariates with Nonparametric Bayesian Methods. Technical Report. 2010 Available at SSRN: http://ssrn.com/abstract=1576665. [Google Scholar]
- Davis CS, Wei LJ. Nonparametric Methods for Analyzing Incomplete Nondecreasing Repeated Measurements. Statistics in Medicine. 1988;44:1005–1018. [PubMed] [Google Scholar]
- De Iorio M, Johnson WO, Müller P, Rosner GL. Bayesian non-parametric nonproportional hazards survival modeling. Biometrics. 2009;65:762–771. doi: 10.1111/j.1541-0420.2008.01166.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Iorio M, Müller P, Rosner GL, MacEachern SN. An ANOVA model for dependent random measures. Journal of the American Statistical Association. 2004;99:205–215. [Google Scholar]
- Dunson DB, Park JH. Kernel stick-breaking processes. Biometrika. 2008;95:307–323. doi: 10.1093/biomet/asn012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eilers PHC, Marx BD. Flexible Smoothing with B-splines and penalties. Statistical Science. 1996;11:89–121. [Google Scholar]
- Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1:209–230. [Google Scholar]
- Fox E, Sudderth EB, Jordan MI, Willsky AS. Bayesian nonparametric inference for switching dynamic linear models. IEEE Transactions on Signal Processing. 2011;59:1569–1585. [Google Scholar]
- Giardina F, Guglielmi A, Quintana FA, Ruggeri F. Bayesian first order auto-regressive latent variable models for multiple binary sequences. Statistical Modelling. 2011;11:471–488. [Google Scholar]
- Griffin JE, Steel M. Order-based dependent Dirichlet processes. Journal of the American Statistical Association. 2006;101:179–194. [Google Scholar]
- Härdle W. Smoothing Techniques: With Implementation in S. New York: Springer; 1991. [Google Scholar]
- Hjort N, Holmes C, Müller P, Walker SG, editors. Bayesian Nonparametrics. Cambridge, UK: Cambridge University Press; 2010. [Google Scholar]
- Ishwaran H, James LF. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96:161–173. [Google Scholar]
- Kalli M, Griffin JE, Walker SG. Slice Sampling Mixture Models. Statistics and Computing. 2011;21:93–105. [Google Scholar]
- Kottas A, Müller P, Quintana FA. Nonparametric Bayesian Modeling for Multivariate Ordinal Data. Journal of Computational and Graphical Statistics. 2005;14:610–625. [Google Scholar]
- Lau JW, So MKP. Bayesian mixture of autoregressive models. Computational Statistics and Data Analysis. 2008;53:38–60. [Google Scholar]
- MacEachern SN. ASA Proceedings of the Section on Bayesian Statistical Science. Alexandria, VA: American Statistical Association; 1999. Dependent nonparametric processes. [Google Scholar]
- MacEachern SN. Technical report. Department of Statistics, The Ohio State University; 2000. Dependent Dirichlet processes. [Google Scholar]
- Mena RH, Walker SG. Stationary autoregressive models via a Bayesian nonparametric approach. Journal of Time Series Analysis. 2005;26:789–805. [Google Scholar]
- Müller P, West M, MacEachern SN. Bayesian models for non-linear autoregressions. Journal of Time Series Analysis. 1997;18:593–614. [Google Scholar]
- Plummer M, Best N, Cowles K, Vines K. CODA: Convergence Diagnosis and Output Analysis for MCMC. R News. 2006;6:7–11. [Google Scholar]
- Quintana FA, Müller P. Optimal Sampling for Repeated Binary Measurements. Canadian Journal of Statistics. 2004;32:73–84. [Google Scholar]
- Rodríguez A, Dunson DB. Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis. 2011;6:145–178. doi: 10.1214/11-BA605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodríguez A, Dunson DB, Gelfand AE. Latent stick-breaking processes. Journal of the American Statistical Association. 2010;105:647–659. doi: 10.1198/jasa.2010.tm08241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodríguez A, ter Horst E. Bayesian dynamic density estimation. Bayesian Analysis. 2008;3:339–366. [Google Scholar]
- Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;4:639–650. [Google Scholar]
- Walker SG. Sampling the Dirichlet mixture model with slices. Communications in Statistics: Simulation and Computation. 2007;36:45–54. [Google Scholar]
- Wood S, Rosen O, Kohn R. Bayesian mixtures of autoregressive models. Journal of Computational and Graphical Statistics. 2011;20:174–195. [Google Scholar]
- Zucchini W, MacDonald IL. Hidden Markov Models for Time Series. London: Chapman & Hall; 2009. [Google Scholar]















