Abstract
Alcohol and other drug abuse are frequently treated in a group therapy setting. If participants are allowed to enroll in therapy on a rolling basis, irregular patterns of participant overlap can induce complex correlations of participant outcomes. Previous work has accounted for common session attendance by modeling random effects for each therapy session, which map to participant outcomes via a multiple membership construction when modeling normally-distributed outcome measures. We build on this earlier work by extending the models to semicontinuous outcomes, or outcomes that are a mixture of continuous and discrete distributions. This results in multivariate session effects, for which we allow temporal dependencies of various orders. We illustrate our methods using data from a group-based intervention to treat substance abuse and depression, focusing on the outcome of average number of drinks per day.
Introduction
Group-based interventions are important in psychology, and have been used to treat patients with many conditions, including drug dependence (Crits-Christoph, Johnson, Connolly Gibbons & Gallop, 2013), depressive symptoms (Watkins et al., 2011), and eating disorders (Tasca et al., 2010). When delivered in real-life or community settings, participants are often admitted into group-based interventions on a rolling (or open) basis, entering and departing the group at different times. However, rolling admissions can make assessing the effectiveness of a group-based intervention difficult. Pairs of participants who attend some of the same sessions may have outcomes that are correlated with one another, relative to pairs who do not attend sessions together. Further, even when two participants do not directly attend sessions together, their outcomes may still be correlated, due to group dynamics that persist over time and can impact participant outcomes (Morgan-Lopez & Fals-Stewart, 2007).
Rolling admission attendance patterns may result in an unbroken chain of clients’ session attendance, such that the sample cannot be divided into two or more distinct clusters where clients attended sessions only with other clients in their same cluster. One approach is to account for clustering with random effects for the sessions of the group and to allow them to be correlated to reflect the expected similarity of effects of sessions with considerable overlap in participants (Paddock, Hunter, Watkins, & McCaffrey, 2011; Paddock & Savitsky 2013). This is a type of dynamic group model (Bauer, Gottfredson, Dean, & Zucker, 2013; Cafri, Hedeker, & Aarons, 2015), where the effect of the group is allowed to vary over time.
Further, for many interventions in substance use treatment, primary outcome measures might be semicontinuous, i.e., a mixture of continuous and discrete distributions. For example, for a primary alcohol and other drug use outcome, zero use might be achieved by a substantial percentage of study participants while others have a level of use that can be measured as a continuous, average amount per day. Simply dichotomizing these outcome measures wastes information and may tell an incomplete story. A better approach is to jointly model the discrete and continuous portions of the response distribution.
The specific goals of this paper are twofold. First, we present a new Bayesian model for semicontinuous outcomes that is appropriate when analyzing data from rolling admissions groups. There is a large literature around semicontinuous outcomes modeling in general (e.g., Liu, Strawderman, Cowen & Shi, 2010; Neelon, Zhu, & Neelon, 2015) and in the area of substance use in particular (Olsen & Schafer, 2001; Dembo, Wareham, Greenbaum, Childs & Schmeidler, 2009; Liu, Ma & Johnson, 2008). However, our model is the first to combine such an outcome with dynamic group modeling of session random effects in rolling groups (Paddock et al., 2011; Paddock and Savitsky 2013). Second, we use our model development and its application to data to illustrate the ease with which construction of new Bayesian models can proceed in a modular way to combine otherwise disparate modeling strategies in order to accommodate important real-life features of psychological data. More broadly, our goal is to provide many details on the specification, implementation, and interpretation of a non-standard Bayesian model. Many of these considerations will be important for the Bayesian analysis of other types of data that are of interest in psychology and related fields.
Data and modeling overview
In our motivating analysis, we focus on assessing the impact of a rolling admissions group therapy intervention (relative to usual care) on such an outcome: the average number of drinks per day, over a 30-day period post-treatment. This scalar outcome nonetheless implies a statistical model that has many of the characteristics of a bivariate outcome model, since we explicitly model the probability of no alcohol use and the amount of use, conditional on some use.
Our model will incorporate information on session attendance via the use of session-level random effects. These random effects accommodate the notion that some group therapy sessions may have more positive associations with post-treatment outcomes than other sessions. For instance, in one session the facilitator may do a better job of addressing the key points in the curriculum than in another. In our models, the individual’s outcome depends on the average of the session-level random effects for the sessions that client attended in what is termed a multiple membership model (Hill & Goldstein, 1998). Given our assumed data-generating process, one dimension of the bivariate session random effect models the probability of any use and the other models the amount of use, conditional on use. Further, we anticipate that group dynamics may persist over time. For example, a disruptive client may adversely impact the treatment for all who are attend a session with him. Because clients attend multiple sessions, positive or negative group dynamics may persist over time. For this reason, we allow the session-level random effects to be correlated over time.
Clients in this study were allowed to join the group at the start any one of three modules, where each module is a set of six sections covering a similar theme; in the motivating study the module themes are thoughts, activities, and people interactions. The expectation was that clients would complete all sessions in each module, and the sequence of three module themes is offered repeatedly in order to allow clients to complete three modules. Figure 1 displays hypothetical attendance patterns for this data structure, and demonstrates some of the issues that are possible with rolling admissions data. For instance, although Client 1 and Client 2 enter the therapy group at the same session, they only attend one session together. Hence, if overlap in session attendance drives a correlation in client-level outcomes, we would expect less correlation between these two than if they had attended all sessions together. Even though Clients 2 and 4 started at different times, they still attend most of their sessions together. Finally, although Clients 1 and 5 attend no sessions together, they both attend sessions with Client 3. Group dynamics may therefore induce correlation between the outcomes of Client 1 and 5. Taken together, these types of attendance patterns make it difficult to capture relevant features of the data with standard random effects models, such as those assuming independence of random session effects or those with a random effect based on when the client entered the group.
Figure 1.
Modules and sessions in the group therapy intervention. Shaded boxes indicate client attendance for hypothetical clients.
Some of the key features of our model are depicted in Figure 2, which is a stylized representation of the assumed data-generating process. Notation will be introduced later, but the bivariate session effects along the top of the figure may be correlated with earlier session effects. We investigate several time series models to describe the autocorrelation of the session effects, including the number of previous session effects that are conditionally correlated with a given session’s effects, and whether to require stationarity for the session effect process in the sense that the joint distribution of a subset of session effects does not depend on when the subset of sessions occurred. Only the random effects of the sessions that the ith individual attended influence that client’s outcome; the activated session effects are denoted with dashed lines that correspond to Client 1’s attendance displayed in Figure 1. The first dimension of the session effect impacts the probability of any use, and the second impacts the amount of use, conditional on use. Client-level covariates can impact the probability of any use and the amount of use conditional on use. Finally, the any use and amount of use conditional on any use variables are combined to produce the observed, scalar outcome.
Figure 2.
Schematic of the assumed data-generating process for the
Bayesian and frequentist approaches
There are practical limitations to likelihood (frequentist)-based approaches to fitting statistical models that account for multiple realistic data features such as the model described above, with a group-based intervention and semicontinuous outcomes. First, dynamic group models often include relatively large numbers of random effects. This poses implementation challenges for likelihood-based computational approaches such as adaptive quadrature for approximating the likelihood. Strategies to mitigate computational difficulties, such as reducing the number of quadrature points used in the computations, can reduce the quality of the approximation to the likelihood (Kiernan, Tao, & Gibbs, 2012). These approximation methods are particularly problematic for random effects modeling where cluster sizes are small (Lin & Breslow, 1996; Capanu, Gonen & Begg, 2013), as is typical for sessions of rolling admissions groups.
Second, existing off-the-shelf software for dynamic group modeling provides limited choices for modeling realistic data features such as group-based interventions and semicontinuous outcomes. For instance, the “lmer” and “glmer” mixed effects models in the widely-used lme4 R software package can be used to fit multiple membership models. This software can also accommodate cross-classified random effects, which allow for random effects for separate, non-nested grouping factors. For cross-sectional data, the multiple membership model is a generalization of cross-classified random effects model, as the latter only allows for a lower-level unit (i.e., client) to be a member of a single higher-level unit (i.e., session) while the former allows for the lower-level unit (i.e., client) to simultaneously belong to multiple higher-level units (i.e., sessions); however, important model differences emerge in the longitudinal context given that time is treated implicitly in the multiple membership model and explicitly in the cross-classified random effects model (Cafri et al., 2015). However, lme4 cannot accommodate our semicontinuous outcome, and only allows one to specify independent, unstructured, or partially independent or partially complete covariance matrices for the random effects.
SAS includes a few procedures useful for dynamic group modeling, particularly PROC GLIMMIX, which allows for cross-classified random-effects and multiple membership modeling of time-varying group membership with several choices for modeling the covariance structures of the time-varying group effects (Cafri et al., 2015). However, PROC GLIMMIX does not allow for semicontinuous outcomes modeling.
Most semicontinuous outcomes routines in standard software packages lack the capability to model random effects. One exception is SAS PROC NLMIXED, which is a popular choice for fitting semicontinuous models with random effects (Su, Tom, & Farewell, 2015; Liu, Strawderman, Cowen, & Shih, 2010). However, it cannot be used for semicontinuous outcomes modeling in the dynamic group context because it would not allow for correlated random effects across group therapy sessions.
More flexibility is possible in a Bayesian framework with computational tools such as Markov Chain Monte Carlo (MCMC). MCMC is not subject to the aforementioned limitations of approximate approaches such as quadrature that are frequently used in non-Bayesian analyses. Moreover, Bayesian models are often modular in the sense that important features of different models can be combined with little effort; in our case we can include semicontinuous outcomes in a model with highly structured random effects. We provide code to fit our analysis model in the freely available Bayesian statistical modeling package Stan (Hoffman & Gelman, 2014), and note WinBUGS (Lunn, Thomas, Best & Spiegelhalter, 2000) is another freely-available Bayesian statistical modeling package that would work for this type of Bayesian model construction.
In the remainder of the paper, we more formally describe the details of our model, with particular attention paid to semicontinuous outcomes, multiple membership models, and the use of time series models to describe the bivariate session effects distributions. Then we present an overview of Bayesian computation, emphasizing the methods that we use to fit our models. We then apply our models to simulated and real data. Finally, we describe extensions and other areas where such models may be of use.
Semicontinuous models of alcohol use following rolling admission group therapy
The data considered in this paper come from a randomized controlled trial that compared group cognitive behavioral therapy against usual care in an outpatient substance abuse treatment program. As indicated in Figure 1, the group therapy curriculum was 18 sessions long, divided into three six-session modules. Participants were allowed to enter the group at the beginning of any of the three modules, resulting in a semi-rolling group therapy structure.
Figure 2 depicts a stylized representation of our model for the hypothetical Individual A in Figure 1. We will focus on the mean number of drinks per day over a 30 day post-treatment period for individual i = 1,…, n, which we denote yi. We use S to denote the total number of group therapy sessions offered over the course of the study, and refer to the number of sessions that each individual attends as si. For each session t we model two random effects: a session-specific random effect that impacts probability of any alcohol use of clients who attended the session, φt,0, and a session-specific random effect that similarly models the mean conditional on some alcohol use, φt,c (Olsen & Schafer, 2001). These session effects account for the possibility that two clients who attend the same sessions may be more likely to have similar drinking behavior than two clients who do not attend any sessions together, and may reflect different group climate and experiences session-to-session or differences in the social dynamics of the group session that may impact participant outcomes. We collect these random effects into
Multiple membership modeling
When the outcome yi is measured following delivery of the group-based intervention, then yi reflects the effect of all sessions attended by client i. Unlike hierarchically-structured data with a standard nesting structure, yi cannot be assigned to a single clustering unit in the hierarchy; rather there are multiple units – in our case, sessions – associated with yi. Thus, the length-S vectors of session random effects φ0 and φc must be mapped onto the participant outcomes. For this, we use multiple membership modeling to account for the non-nested multilevel structure of the data (Hill & Goldstein, 1998). Let M be an n × S matrix defined such that the the (i, j)th element equals 1/si if the ith participant attended the jth session, and equals zero otherwise. If client i attended no sessions, all elements of the ith row are set equal to zero. The ith element of Mφ0 and Mφc records the average of the session random effects for those sessions attended by participant i. Hence, if participants i and j attend exactly the same sessions, the ith and jth elements of Mφ0 will be the same, as will those for Mφc. In Figure 2, the presence or absence of ties between the session effects and the outcomes is determined by M. Referencing Figure 1, the first row of the multiple membership matrix would have the form , reflecting that Client 1’s participation in the rolling group is split across sessions 1, 2, and 5. The second row for Client 2 would be [0.1, 0, 0.1, 0.1, 0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0, …, 0], reflecting attendance split across 10 sessions.
The semicontinuous outcomes model
A canonical model for semicontinuous outcomes requires assuming that — after a suitable transformation, if warranted — the distribution of the outcome measure is a mixture of a normal distribution and positive mass corresponding to zero alcohol use. We use a probit model to describe the probability of no alcohol use; the impact of the session effects on this model will be Mφ0. The contribution of the session effects to the continuous portion of the participant outcomes is given by Mφc. Incorporating all of these pieces, we may write our model hierarchically as follows (where “~” is read “is distributed as”):
- Level 3:
- Level 2:
- Level 1 (latent):
- Level 1 (observed):
In words, our hierarchical model can be described as follows. First, Level 1 (observed) represents the level at which each outcome yi is observed. Reflecting the semicontinuous nature of data, yi might take on a continuous value, , if ui=1, or yi might equal 0 if ui=0. Moving to Level 1 (latent) of the model, ui is a latent variable indicating whether yi is predicted to be equal to 0 or greater than or equal to 0. The combination of the model specification for ui and wi is the Bayesian latent variable-based implementation of the probit model. The probability of any alcohol use is described by this probit model, defined as the probability that a latent normal quantity wi,0 exceeds zero (Albert & Chib, 1993). The probability of use depends on the session random effects via the multiple membership construction applied to the session effects that relate to the use/no use portion of the model. The mean of these session effects is given by Miφ0. We also allow client-level characteristics to predict the probability of any use via the term xiβ0, where β0 is a vector of probit regression parameters and xi is a row vector that includes characteristics of client i, as well as a leading 1 for the intercept. The portion of the model for describes the amount of use conditional on having some use, and it similarly contains characteristics of client i, xi,c (which may or may not be the same as xi,0), an accompanying regression coefficient vector βc, and session random effects associated with , φc, which is also modeled using the multiple membership construction. As errors are assumed to be normally distributed, one extra parameter is needed for the continuous portion of the model: σy is included to account for the error standard deviation. In Level 2 we have the random session effects φ that are drawn from a distribution that depends further on hyperparameters ξ that drawn from a hyperprior in Level 3. We discuss details of these higher levels below.
Vector autoregressive model of session effects
The literature around evaluation of rolling admission group therapy interventions has highlighted the need to account for the tendency of the “history” of the group to impact future participants’ outcomes. Previous work has focused on conditionally autoregressive (CAR) prior specifications to describe the dependence of session random effects that account for this history (e.g., Paddock et al., 2011). The CAR prior’s development was motivated by spatial data analysis, where the notion of what it means for geographic areas to be ‘neighbors’ is intuitive. An analogy can be drawn from spatial data to the rolling group problem: sessions might be defined as neighbors if they are offered at adjacent time points or if there is overlap in the participants who attend those sessions. The CAR prior requires a Markov assumption, which stipulates that a random effect for session t is dependent only on the random effects for sessions that are ‘neighbors’ of that session. This implies that random effects for sessions that are not neighbors are conditionally independent, greatly simplifying computation by reducing the number of other sessions upon which one must condition to model a given session random effect. However, under CAR it is still possible for session random effects to be unconditionally dependent, which is desirable for rolling group data analyses given the nature of client participation. For rolling group studies, most applications have used temporal adjacency to define the CAR covariance structure such that the effect of session t is independent of all others, conditional on the effects of sessions t − 1 and t + 1 (Paddock et al., 2011). For such a model, the session effect at time t is modeled as having a normal distribution with a mean that depends on the average of random effects related to the sessions t − 1 and t + 1.
We model the session effects via a latent, vector autoregressive (VAR) process (e.g., Chapter 9, Prado & West, 2010). We incorporate random session effects into both parts of the model: one random effect models the probability of any use and the second random effect models the amount of use, conditional on having any use. This model allows the bivariate session effects at time t to be directly correlated with the bivariate session effects at times t − 1, t − 2, …, t − p for some integer p, which is the order of the VAR model, thereby allowing longer-range direct dependencies than are possible with the CAR specifications. (The autoregressive model of order 1 is very similar to the CAR model in the time domain, although with different parameterizations and typically different implied prior distributions. See, e.g., Banerjee, Carlin, & Gelfand (2014) for thorough discussion of CAR models.) We investigate multiple orders for the VAR model comparing, e.g., whether it is better to allow the bivariate session effects for session t to depend directly on the effects of session t − 1 and t − 2, or only t − 1. Even in the p = 1 case, we emphasize that the distribution of the session effect at time t is not independent of the session effects at times t − 2 and before. Rather, conditional on the time t − 1 random effect, the time t effect is independent of all earlier random effects under the order 1 VAR model.
Our VAR model is defined as follows. For session t we have the pair of random session effects φt = (φ0,t, φc,t)⊤. For the VAR model of order p these are modeled as
Here, Bj are 2×2 matrices whose values are estimated from the data. The diagonal entries in Bj capture autocorrelations within each dimension of the random effect. The off-diagonal entries allow the random effect related to any use to impact the random effect related to the amount of use at a later time, or vice versa. Finally, Σ is a variance/covariance matrix that describes the variation of the random effect distribution, conditional on the previous random effects.
In time series analyses, stationarity means that the joint distribution of {φt, φt+1, …, φt+k} is the same as for {φt+q, φt+1+q, …, φt+k+q} for integers q and k, which is desirable because it ensures that the variance of the random effects distribution does not “blow up”. For example, in the order p = 1 VAR model, if B1 were a diagonal matrix with diagonal values 2 and 0.5, the variance of φ0,t would increase with t. Stationarity can be enforced via a prior specification that does not support (i.e., has zero density for) non-stationary models, which will be discussed below.
We will also consider the “VAR(0)” version of the model where the session effects are mutually independent of one another. In this case, φt ~ normal(0, Σ). Even without the serial dependence, this model allows for some correlation of participant outcomes due to common participant overlap, via the multiple membership specification. However, for participants who attend many sessions, this model would imply a priori that we expect small impacts of common session attendance since averaging over many independent, mean zero random effects would be expected to be close to zero.
In our motivating application, the clients attend sessions in an unbroken chain, with some session-to-session overlap. Other times, there may be two or more distinct rolling admission groups of clients, with no overlap across groups. In such cases, one can model the groups’ session effects to be mutually independent (or conditionally independent) of one another. That is to say, one could assume that ξ is the same for each group; one could assume that ξ is drawn independently from its prior distribution for each group; or, one could further parameterize p(ξ|ζ), where ζ represents additional hyperparameters that are learned from the data, and are shared across all groups.
Introduction to Bayesian computation
To avoid overlap with other papers in this issue, we assume that the reader is conceptually familiar with such basic Bayesian concepts as prior and posterior distributions. We will give some background on Bayesian inference in order to better motivate the new tools that we employ in this paper. In this section, we provide an overview of the Metropolis-Hastings (MH) algorithm, which is the classic approach that underlies much of Bayesian computation. We also describe an advanced variant of the MH algorithm, called Hamiltonian Monte Carlo (HMC). The software that we use to estimate our models, called Stan (Stan Development Team, 2015a; Stan Development Team 2015b), implements HMC. Additionally, we discuss computational efficiency of HMC compared to a more basic MH implementation.
Most Bayesian computations boil down to calculating integrals where the integrand is the product of some known function f of θ (where θ denotes all of the model parameters) and the density p(θ|y). For example, the posterior mean is the most commonly-reported point estimate in Bayesian analyses; the posterior mean for a scalar parameter is given by ∫ θp(θ|y)dθ, with f equal to the identity function. These integrals are typically in relatively high dimensions. In order to approximate these integrals, the standard approaches first draw samples from the posterior distribution, p(θ|y), in a process described below.
After generating M samples from the posterior distribution, we can estimate the integral of interest via simple averages. If samples from the posterior distribution are denoted as (θ(1), θ(2), …, θ(M)), we make use of the following approximation
For example, we estimate the posterior mean as θ̂ = M−1 ∑i θ(j). The limits of a 90% credible interval (CI) are estimated via the 5th and 95th percentiles of the sampled θ(j) because these points approximate a and b that satisfy .
For complex models, the most common way to draw samples from the posterior distribution is through the use of Markov chain Monte Carlo (MCMC) techniques. “Markov” in MCMC refers to the fact that the samples from the posterior are typically not independent of one another – the value of θ(j) is allowed to depend on θ(j−1) though, conditional on θ(j−1), θ(j) must be independent of θ(j−2) and earlier samples.
The Metropolis-Hastings algorithm
The Metropolis-Hastings (MH) algorithm is the basis of many Bayesian estimation schemes. From the current value θ(j), a new value θ* is proposed via a draw from a proposal distribution whose parameters may depend on θ(j). For example, if θ* is drawn from a normal distribution with mean θ(j) and fixed variance/covariance, then the θ(j+1) will equal θ* if p(θ(j)|y) < p(θ*|y). On the other hand, if if p(θ(j)|y) ≥ p(θ*|y), θ(j+1) will equal θ* with probability p(θ*|y)/p(θ(j)|y) and θ(j+1) = θ(j) otherwise. Thus, in this simple algorithm, there are two sources of potential autocorrelation. First, since the proposal distribution is centered at the current value, even if the proposed value is accepted, the sampled value is likely to be closer to the current value than if an independent value could be drawn. Secondly, if the proposed new value is rejected, θ(j+1) = θ(j).
From a practical perspective, the autocorrelation between samples can mean that very large number of samples need to be drawn in order to reproduce the information that a much smaller sample would provide, if the samples could be drawn independently of each other. This, in turn, can require excessive computational burdens.
Hamiltonian Monte Carlo
Improving the efficiency of Bayesian computation has been an active research topic in statistics for several decades. This line of research has achieved a high level of both sophistication and performance, with Hamiltonian Monte Carlo (HMC; also called Hybrid Monte Carlo) being one of the most successful classes of MCMC methods (Duane, Kennedy, Pendleton, & Roweth, 1987; Neal, 2011). At a high level, HMC augments the parameter space with an additional set of parameters. These extra parameters have a purely computational utility – they do not change the fitted values or interpretation of the model in any way. Their sole purpose is to improve the efficiency of the MCMC algorithm by reducing the autocorrelation between draws.
Although the technical details are beyond the scope of this paper, HMC draws inspiration from physical systems. Neal (2011) motivates HMC in two dimensions by imagining a frictionless puck that slides over a smooth surface of varying heights. The potential energy of the puck is proportional to the height of the surface at the puck’s current location. In HMC, the model parameters θ are interpreted as a location in a physical system with the negative log posterior density describing the location-specific potential energy. In addition, HMC introduces auxiliary variables (i.e., parameters in the model that are purely used to facilitate computation) that can be interpreted as physical momentum (defined as mass times velocity).
At each step in the MCMC, a momentum variable is sampled for each dimension of the parameter space. Then, on the basis of the sampled momentum and the potential energy function (i.e., the negative log posterior), the new proposal is given by estimating the new location of the hypothetical, frictionless puck. Although the Hamiltonian dynamics are deterministic, the random draws of momentum introduce randomness into the system. Compared to the naïve random walk Metropolis Hastings algorithm described above (that simply draws proposals from a normal distribution centered at the current state), HMC proposed values can be both far from the current state of the parameter, as well as having a high acceptance probability.
HMC relies on two primary tuning parameters in order to efficiently draw samples from the posterior distribution. First, Hamiltonian dynamics are defined in continuous time. In order to implement the algorithm in a computer, time is discretized to a small step size that must be chosen. Then, the discretized Hamiltonian dynamics are allowed to carry on for some number of steps. Both the step size and the number of steps can impact performance substantially, and should be tailored to the posterior distribution of interest.
Stan software
The No-U-Turn Sampler (NUTS) is a variation of HMC that automatically chooses these tuning parameters (Hoffman & Gelman, 2014). Absent the NUTS technology, users would have to perform preliminary fits of the model in order to select tuning parameters that result in good mixing behavior. NUTS makes a fully automated HMC engine possible. Accordingly, much as a driver can safely operate a car without understanding the finer points of internal combustion engines, tools have been developed to quickly estimate complex Bayesian models without needing to understand the subtleties of advanced MCMC techniques, or perform the tedious calculations necessary to implement the model estimation from scratch (such as the gradient of the log posterior for HMC). The newest generation of tools for Bayesian computation allows users remarkable flexibility to explore broad classes of models with efficient computation and relatively little programming. In this paper, we will employ an actively-developed suite of tools for Bayesian computation called Stan, which makes the NUTS HMC algorithm readily accessible to applied analysts and will be used to estimate our models (Stan Development Team, 2015a, 2015b; Hoffman & Gelman, 2015). Details of our code are presented in the Appendix.
Computational efficiency
The Bayesian computational techniques described above are broadly applicable. We now focus on the computational efficiency of Stan software, as applied to the type of models that are the focus of this paper. When comparing the computational efficiency of one Bayesian algorithm against another, two primary factors are at play. First is the amount of time it takes to complete one full iteration of the MCMC algorithm (i.e., to update each of the model parameters once). This depends on both the details of the algorithm and the software implementation. The second factor depends on how strongly – from iteration to iteration – the drawn parameters are autocorrelated with one another (also called the “mixing” behavior of the MCMC), where the ideal algorithm would quickly produce independent draws from the posterior distribution.
We compare Stan to a Metropolis-Hastings (MH) within Gibbs MCMC sampler, which would be more typical of the type of MCMC algorithm that an analyst might program “from scratch” if tools such as Stan or WinBUGS were not available. For simplicity, we use artificial data and assume that all parameter values are known except for blocks of parameters B, β0, βc, and φ. For all blocks of parameters except B, the conditional distribution is multivariate normal. We update B via a Metropolis-Hastings step, where a new value is proposed by adding normally-distributed noise (with standard deviation 0.1) to each element of B. To accommodate the probit portion of the model, we augment the parameter space with latent, normally-distributed variables that – when updated in the MCMC – allow us to use standard normal theory to update the β0 parameters (Albert & Chib, 1993).
To compare the two algorithms, we run both MCMC algorithms for 5,000 iterations. In each case, we discard the first 2,500 samples to minimize influence of the starting values of the parameters in the MCMC. We record the time required to complete all 5,000 iterations. In order to account for both the iteration-to-iteration time of the algorithm, as well as the mixing performance, we report the effective number of Monte Carlo samples per minute. For example, if an MCMC algorithm is run for 1,000 iterations, autocorrelation in the Monte Carlo samples may only yield the information from 300 independent draws from the posterior distribution, which is estimated by the effective number of samples. We use the “coda” package in R to calculate the effective number of samples.
While we find that Stan runs slightly slower on an iteration-to-iteration basis (taking approximately 80% longer than the algorithm coded from scratch to run for 5,000 iterations), it is much more efficient overall. In Table 1, we see that Stan achieves 1,132 effective Monte Carlo samples per minute for the leading element of B versus only 5.2 for the MCMC sampler we programmed from scratch. The efficiency of the algorithms for treatment effect parameters are similar, however the MH within Gibbs estimates of the β parameters should not be considered reliable if the effective number of MCMC samples is so small for B.
Table 1.
Effective number of Monte Carlo samples for Markov chains run for 2,500 iterations and effective samples per minute for Stan and Metropolis Hastings (MH) within Gibbs algorithms
| Method | Parameter | Effective Samples |
Effective samples/minute |
|---|---|---|---|
| Stan | B1,1 | 2,500 | 1,132 |
| MH in Gibbs | B1,1 | 2.7 | 5.2 |
| Stan | β0 treatment | 2,500 | 1,132 |
| MH in Gibbs | β0 treatment | 1,431 | 2,329 |
| Stan | βc treatment | 2,294 | 1,039 |
| MH in Gibbs | βc treatment | 789 | 1,344 |
There are two primary reasons for Stan’s superior performance. First, Stan creates C++ code to estimate the model which should run far more quickly than the analogous program that is written purely in R. (Stan can be called from R or other statistical packages, but the heavy lifting is done using C++ code that Stan produces.) Second, the more sophisticated HMC algorithm results in better mixing behavior than our MH within Gibbs sampler is able to achieve. The efficiency of the MH within Gibbs sampler could be improved by tuning the proposal distribution for B, but this can be laborious or more complex from a coding perspective; no such tuning is necessary for Stan. Further, this comparison does not capture the amount of programming effort required of the user: the programming required to run the Stan model is far less involved than programming an MCMC from scratch, even when using the much less sophisticated MCMC algorithm that we implemented here.
Simulation studies
We now apply our models to simulated data to illuminate several important considerations when modeling semicontinuous outcomes in rolling admission group therapy studies. We begin by considering the impact on inference when the rolling admission group structure is ignored by the analyst, focusing on coverage rates of 95% credible intervals. We also consider a more challenging scenario where the true data-generating process for the session effects is “almost” non-stationary. Here, we focus on whether one should only consider stationary models (through the lens of credible interval coverage rates and root mean squared error), and how that can be implemented through the use of a Bayesian prior distribution. We also consider what it may mean if – in a situation where one believes that the data generating process is stationary – the posterior distribution has most of its mass in a region of non-stationarity.
Ignoring rolling group structure
We begin by assessing the importance of accounting for rolling group structures when analyzing data that arise from such studies. For our simulation, we assume sample sizes of 100 observations both in the treated and control samples. Treated clients attend between 4 and 8 sessions in a row (with the number of sessions chosen randomly), and the timing of the first session that clients attend is chosen randomly from session 1 through 43; there are 50 sessions total. In our model, the session effects follow a VAR(1) process with φi ~ normal(Bφi−1, Σ), where B is a diagonal matrix with entries equal to 0.75, and Σ is also a diagonal matrix, with non-zero entries equal to 0.25, which corresponds to a standard deviation of 0.5. For both the continuous and point mass portions of the model, we include an intercept and the same four normally-distributed covariates. Both parts of the model also include a treatment indicator. The probit coefficient associated with the treatment indicator is fixed at 0.2 across replications of the simulation; the other elements of β0 are drawn randomly from a normal distribution with mean zero and standard deviation 0.1. The element of βc that corresponds to treatment in the continuous portion of the model is fixed at 0.5; the remaining elements are drawn from a normal distribution with mean zero and standard deviation 0.2.
We independently simulate 250 datasets. Because there is no off-the-shelf software tool that we are aware of that can accommodate a semicontinuous outcome in the context of a dynamic group model, we compare the results of the model that correctly accounts for the rolling group structure versus a model that ignores the fact that some clients attended sessions together, but that is otherwise specified in the same manner. We expect that the model that assumes that client outcomes are independent of one another (conditional on covariates) will be anti-conservative, with too-small credible intervals since the correlated outcomes provide less information than an independent sample.
In our simulation study, we see that both methods result in relatively low bias in the sense that the average treatment effect estimates are close to their true values: for the probit portion of the model, the mean (median) across simulated datasets of the posterior means is 0.20 (0.22) for the model with the correctly-specified likelihood, and 0.17 (0.18) for the model that ignores the session effects. Recall that the true value is 0.2. Similarly, for the normal portion of the model, the mean (median) estimate across simulated datasets is 0.50 (0.52) for the correctly-specified likelihood and 0.52 (0.54) for the model that ignores session effects, where the true value is 0.5.
In large sample situations, Bayesian credible intervals often have good frequentist coverage properties (e.g., Efron, 2015), For our relatively small sample size, we do not necessarily expect, e.g., 95% of our 95% CIs to contain the true parameter value. For the probit treatment effect parameter, we find that 90.4% of our 95% credible intervals cover the true value of 0.2 for the correct model. For the model that ignores the session effects, only 74.0% cover the truth. For the treatment effect related to the continuous portion of the model, 94.8% of the CIs cover the true value of 0.50 for the correct model, whereas only 78.0% cover the truth when the session effects are ignored. This confirms our conjecture that ignoring the session effects would be anti-conservative.
Stationarity
As discussed above, stationarity can be an important feature of time series. Non-stationary VAR models produce variances that increase as a function of time; especially when making predictions into the future, such variance inflation can yield unreasonable predictions. The VAR(1) model is stationary if the eigenvalues of B are less than one in magnitude. In a standard Metropolis Hastings regime for such a restricted model, any proposal that results in a non-stationary model would be rejected, with the Markov chain remaining at the state that preceded the proposed non-stationary point. However, baking such a restriction into a Stan model can be difficult, as restricting the support of parameters is typically made in rectangular regions. As far as we are aware, there is no easy way to encode such complex support restrictions into Stan models ahead of time. However, we can modify the support of the prior using a simple post-processing step: we can simply discard all θ(i) that do not satisfy that condition (Prado & West, 2010, p. 47).
Although this may seem like an ad-hoc method, it is a legitimate and rigorous approach. Over its support, the posterior with the stationarity restriction is equal to the posterior without the support restriction (up to a constant of proportionality). Hence, this post-processing step is a legitimate means to sample from the constrained posterior and is formally justified by a technique known as accept-reject sampling (e.g., Chapter 2 of Robert & Casella, 2004). Therefore, we consider two methods for analyzing the data. The first is to sample from the posterior of the unconstrained (not necessarily stationary) model. The second method discards draws that include sampled B values that correspond to a non-stationary model.
For this portion of the simulation, we retain all features of the simulations on ignoring the rolling group structure, except here B is a diagonal matrix with both diagonal entries set to 0.9. (For a diagonal matrix in this context, values greater than one would correspond to a non-stationary model.) When fitting the model, we do not assume the diagonal structure.
Results
Across 500 simulated datasets, the proportion of sampled non-stationary B matrices ranges from 0.01% to 95.3%, with a median of 9.2%. We find that – from the perspective of root mean squared error – enforcing stationarity improves parameter estimates of the regression coefficients of interest somewhat. For the treatment effect that impacts the probability of any use, the stationary model’s root mean squared error is 0.84 compared to 0.90 for the unconstrained model. For the continuous parameter, the difference is slightly larger and similarly favors the stationary model: the root mean squared errors are 0.85 versus 0.94. (Paired t-tests comparing the mean absolute errors produced by the two methods are highly significant for both treatment effect parameters.) The estimated coverage rates of the resulting 95% credible intervals are very similar: 77.8% for the unconstrained model versus 76.8% for the “amount of use” treatment effect. For the “any use” treatment effect, the estimated coverage rates were identical: 76.2%. All of these numbers are substantially below the nominal 95% that would be expected of a properly-calibrated frequentist confidence interval. These simulations are performed in a relatively low-information setting (due to a modest sample size and high serial correlation in the session effects) where the assumed prior is influential on the posterior estimates. (See, e.g., Little (2006).)
In our simulation study, both the median 95% credible interval width and the empirical (Monte Carlo) standard errors are approximately 5% higher for the unconstrained model when focusing on the treatment effect related to amount of use (conditional on any use). For the probability of use treatment effect, the empirical standard error is approximately 8% higher, which the median credible intervals are still only around 5% wider.
A primary caveat to this analysis is that the true, data-generating model for this simulation was a stationary VAR model; it may be the case that a non-stationary VAR model better approximates the true data-generating process than a stationary VAR model can. For instance, one could imagine a situation where the true B is time-varying with eigenvalues increasing in t but still bounded in magnitude by one. In such a case, a non-stationary VAR model may better describe the data than a stationary model, over the limited time domain typical of group therapy studies.
A highly non-stationary example
We give extra scrutiny to the simulated datasets that yielded the most non-stationary samples. In the case of the most extreme of our 500 replications, only 4.7% of the samples correspond to a stationary VAR process. In this case, we see that the standard deviation (across clients) of the posterior means of elements of Mφ0 (i.e., the estimated session effects, averaged over the sessions each individual attended) have standard deviation of 3.3, which is very large in the probit scale. Of the 100 treated individuals, 44 had their first session at or before the twentieth session; all of these individuals were found (in the simulation) to be drinking post-treatment. On the other hand, of the last 29 individuals to enroll, none were simulated to drink. Said another way, the time of enrollment appeared in this simulated dataset to be highly predictive of whether the client reported any alcohol use post-treatment. We believe that this temporal trend resulted in the non-stationary estimates.
In practical terms, there are several implications. First, if the quality of the group therapy session improves strongly over time, this should be controlled for explicitly if possible, or else the temporal trend may manifest itself in an estimated non-stationary session effect process which may in turn bias the treatment effect estimates. For example, if the facilitator starts with little experience but gains skills over the course of the study, this example suggests that it would be wise to control for the facilitator’s level of experience. In this way, having a high posterior probability of non-stationarity may point toward model misspecification (though, in this case, the likelihood was correctly specified). Second, with relatively modest sample sizes, data patterns that are consistent with non-stationary covariance may happen by chance. (Setting the diagonal elements of B to 0.9 encourages strong correlations in client outcomes, but the strength of the trend we see with this simulated dataset is rare.)
In this particular case, we find that performance is improved by enforcing stationarity in the prior: Without enforcing it for this dataset, the 95% credible interval for the treatment effect associated with no/any use does not cover the true value, while the stationary model does cover the truth. Moreover, the posterior mean of that parameter is 1.7 units closer to the true value for the stationary model than for the unconstrained (not necessarily stationary) model.
Implementation and recommendations
From an implementation perspective, if a large proportion of the sampled B matrices correspond to a non-stationary model, the MCMC for the stationary model will be less efficient than for the unrestricted model (Prado & West, 2010, p. 47). Accordingly, we recommend running the MCMC for more iterations to compensate for this loss of efficiency. For example, if 10,000 iterations are sufficient to achieve convergence when estimating the unconstrained model and 50% of the sampled B matrices correspond to a non-stationary model we would, as a rule of thumb, recommend running the MCMC for 20,000 iterations when restricting to the stationary session effect model, though one should check convergence using only the accepted draws. In extreme cases – if the posterior distribution of the unconstrained model places nearly zero probability on the stationary model – it may be infeasible to estimate the stationary model in this manner, and it may be necessary to program a MH algorithm from scratch that incorporates the prior distribution’s support restrictions. However, we did not encounter such a situation in our simulation studies.
In the end, we recommend at minimum calculating the percent of sampled session effect variance structures that are non-stationary. If it is more than a small percentage, we recommend comparing estimates that result from both versions of the prior (i.e., enforcing versus not enforcing stationarity). If the two models are substantially different in terms of the resulting estimates of the parameters of interest, the stationary model’s results may be more accurate. However, we believe that it is important that consumers of the research understand sensitivity to such modeling assumptions.
Data description
We now turn to our motivating study. Our data come from a randomized controlled study conducted in a publicly-funded outpatient substance abuse treatment program located in Los Angeles County comparing manualized integrated group cognitive behavioral therapy (CBT) for depression and substance use disorders delivered by substance abuse treatment (SAT) program counselors to usual care offered in the substance abuse treatment program (Hunter, Watkins, Hepner et al., 2012). The intervention consisted of 18 two-hour sessions, divided into three modules of six sessions each. The sequence of the three modules was offered repeatedly, so that new enrollees could enter the group on a semi-open (or semi-rolling) basis at the start of a module.
SAT clients were screened for depressive symptoms and the presence of a probable alcohol or drug disorder. Exclusion criteria and further details about the study design are available elsewhere (Hunter et al., 2012).
Clients who agreed to be randomly assigned were enrolled in the study and completed a 60–90 minute baseline interview (n = 73). Following the interview, the research staff assigned clients to one of the two study conditions: the intervention (group CBT) or control (non-group-based usual care). The outcomes we examine here were collected at three months post-baseline, corresponding to the time that clients in the intervention condition completed the intervention, with n = 64 (42 intervention, 22 comparison) completing the outcomes assessment. We employed an intent-to-treat approach, in which all persons who completed the baseline interview were eligible to be included in the analyses, even though one of the “treated” individuals did not attend any of the courses.
The primary study outcome measure examined here is the average number of alcoholic drinks each client consumes per day, over a 30-day period after treatment has concluded. Alcohol use was measured using the Timeline Followback Method (Sobell & Sobell, 1992) to assess both intensity and frequency of drinking in the past 30 days.
Data Analysis
In this section, we will fit models to the data and interpret the results. Along with the model (likelihood) developed above, we will specify prior distributions to complete the Bayesian model specification. We will compare model fit for VAR models of different orders. We will also assess the sensitivity of our results to modeling decisions such as whether non-stationary VAR processes are allowed in the prior as suggested by the simulation study above. Finally, we will summarize the model results in terms of marginal effects of treatment; the ready availability of such functions of model parameters is another strength of the Bayesian approach.
We use our models to describe the average number of alcoholic drinks each client consumes per day, over a 30-day period after treatment has concluded, as measured in the study. We control for baseline (pre-treatment) alcohol use, baseline substance frequency index, race (black, Hispanic, and White/other), gender and a dichotomous indicator of treatment assignment. Individuals in the usual care group in effect have a row of zeros in the matrix M, and therefore their outcomes are not predicted by any of the session effects.
At the three-month follow-up, Figure 3 presents the proportions of individuals who reported drinking in the past month for the treated and control conditions. Figure 4 displays the log of the mean number of drinks per day among those who drank over that month. In these unadjusted graphical summaries, fewer of the treated individuals were drinking in the post period (Figure 3), but there is little visual evidence of differences in amount of consumption, given any (Figure 4).
Figure 3.
Histogram of individuals who report any drinking in the 30-day period post-treatment. The treated individuals were offered open enrollment group therapy and the controls were given usual treatment.
Figure 4.
Density histograms of drinks per day in a 30-day period post-treatment, among those who drank (logarithmic scale). The treated individuals (dashed curve) were offered open enrollment group therapy and the controls were given usual treatment.
To reiterate our model, we assume that there are two random effects for each session, with one impacting the probability of any use, and the other impacting the amount of use, conditional on any use. For each treated client, their outcome may be influenced by the average of the session effects that correspond to the sessions they attended. This averaging is encoded in the multiple membership matrix M. Because group dynamics may persist over time, we allow for various vector autoregressive structures in the session effects. For treated and untreated clients, baseline utilization, demographic characteristics and treatment may impact both the probability of any use (via probit regression parameters β0) and the log amount of use, conditional on some use (via regression parameters βc).
Prior distributions
In addition to the model defined above, our Bayesian analysis requires us to specify prior distributions. For each component of the Bj matrices, we use independent standard normal priors. For the covariance matrix Σ we specify an inverse-Wishart prior with three degrees of freedom centered at the identity matrix. For the standard deviation of the error variance in the outcome equation, we specify a flat prior over the interval [0, 10]. Finally, we also specify flat priors for the regression parameters β0 and βc. This is an example of an improper prior distribution because the prior density related to the regression parameters does not have a finite integral.
Model comparison
Although it is possible in the Bayesian framework to average over potential model specifications (Hoeting, Madigan, Raftery, & Volinsky, 1999), analysts typically focus on selecting a single model. A measure called the deviance information criterion (DIC; Spiegelhalter, Best, Carlin, & Van Der Linde, 2002) has been proposed. DIC – like many model selection criteria – consists of a term that measures goodness of fit, with fitted values being close to observed values preferred, and a term that penalizes the effective number of model parameters. Goodness of fit is assessed via the model deviance, defined as D (θ) = −2 log(p(y|θ)). The deviance evaluated at the parameter estimate θ̂ plus two times the number of model parameters is the definition of AIC; BIC is equal to D(θ̂) + k log(n), where k is the number of parameters and n is the sample size.
For DIC, goodness of fit is measured via the expected or average deviance, . In a Bayesian framework, we evaluate the deviance at each sampled θ(j), and average D (θ) over these values. The second half of the measure aims to penalize model complexity. In the classical AIC and BIC measures, we simply count the number of parameters. However, in random effect models (and Bayesian models more generally), counting the number of parameters is not straightforward. For example, consider a model with random effects for each individual. If the count of parameters included each of the random effects, then the model would appear highly parameterized. If, however, the variance of the random effect distribution is small (in terms of explaining little of the observed variance), the model may behave very similarly to an analogous model with random effects set equal to zero which has one fewer parameter for each individual, under the naïve count of parameters. On the other hand, if many observations are available for each individual, the random effects may genuinely behave like separate parameters, making the naïve count reasonable.
With these issues in mind, DIC penalizes the “effective number of parameters”, which is defined as
Spiegelhalter et al. (2002) note that the effective number of parameters can be read as the difference between the posterior mean deviance, and the deviance of the posterior means. Under normality assumptions and vague prior information, pD approximately counts the number of parameters in the sense of AIC or BIC.
Although DIC has been used widely, Spiegelhalter, Best, Carlin, and Van Der Linde (2014), in revisiting their original paper, highlight some shortcomings of the method, including the primary shortcoming that it can be sensitive to parameterization, even if the implied posterior distributions of the model parameters are the same under the two parameterizations. DIC can also perform poorly when attempting to decide among mixture models with differing numbers of components.
The Watanabe-Akaike (or “widely applicable”) information criterion (WAIC) has been proposed as a remedy for some of the shortcomings of DIC. WAIC approximates Bayesian cross-validated loss (Watanabe, 2010), and is invariant to the parameterization. Like DIC, WAIC is readily computed in environments such as Stan (Vehtari & Gelman, 2014). Spiegelhalter et al. (2014) write that WAIC may be “the most promising innovation” in this area, since their original development of DIC. Moreover, Gelman, Hwang, and Vehtari (2014) find that WAIC “is particularly helpful for models with hierarchical and mixture structures”.
WAIC approximates Bayesian leave-one-out cross-validation. One way to assess model fit would be to fit the model using all but one of the data points and evaluate the log likelihood of the left-out point. This process would then be repeated, leaving out each data point in turn, and averaging the log likelihood for the left out point. If the resulting average predictive log likelihood is high, then the model fits the data well in a predictive sense. Note that this approach balances model fit and penalizes complexity: if the model is not complex enough, the likelihood evaluated at the left out point will be low because the model is “too flat” over all; if the model is over-fitting the data, the predictive likelihood will be low on average because the likelihood will be too high near the observed data points, and not high enough near unobserved (or left-out) observations.
Vehtari and Gelman (2014) describe, and provide code for, WAIC calculations in Stan. In the context of our model, the WAIC calculation depends on the log likelihood of the probit portion of the model for all of the observations and includes the normal log likelihood only for clients who reported some alcohol use. Table 2 gives WAIC values that demonstrate that the VAR model of order 2 is slightly preferred to the order 0, 1, and 3 models (lower WAIC values are preferred).
Table 2.
Parameter estimates and 95% credible intervals (CIs) for treatment effect parameters, along with estimated WAIC values for each model
| Model order, p |
Treatment estimate (95% CI) on any use |
Treatment estimate (95% CI) on amount, given any use |
Estimated WAIC |
|---|---|---|---|
| 0 | −1.02 (−1.91, −0.15) | −0.39 (−2.27, 1.40) | 115.3 |
| 1 | −1.01 (−1.98, −0.17) | −0.47 (−2.24, 1.33) | 114.9 |
| 2 | −0.98 (−1.74, −0.14) | −0.42 (−2.16, 1.34) | 113.1 |
| 3 | −1.00 (−1.85, −0.20) | −0.43 (−2.48, 1.49) | 113.8 |
Parameter estimates
The parameter estimates for our preferred model, which uses the order 2 VAR model of session effects, are reported in Table 3. These results align with the visual representations in Figures 3 and 4: We estimate that the group therapy program has a significant, favorable impact on the probability of drinking at all, but we are not able to detect an impact on the amount of drinking among those who do drink. We see that the treatment effect and its 95% CI are negative for the part of the model that describes the probability of any use (i.e., the treated have a lower estimated probability of drinking at all). For the part of the model that describes the amount of use, conditional on any, the treatment effect is negative, but the evidence is weaker given that the 95% CI includes zero.
Table 3.
Parameter estimates and 95% credible intervals for our preferred VAR(2) model for control variables and the error variance
| Estimate and 95% credible interval | ||
|---|---|---|
| Covariate | Any use | Amount |
| Intercept | −0.53 (−1.56, 0.35) | −1.29 (−3.69, 1.04) |
| Male | −0.72 (−1.51, 0.08) | −0.26 (−1.90, 1.40) |
| Black | −0.17 (−1.13, 0.78) | −1.25 (−2.99, 0.49) |
| Hispanic | 0.14 (−0.86, 1.07) | −0.70 (−2.40, 1.09) |
| Baseline: SFI | 0.46 (0.09, 0.85) | 0.12 (−0.79, 0.96) |
| Baseline: Any drinking | 0.54 (−0.25, 1.43) | 0.52 (−1.14, 2.26) |
| Error variance | -- | 1.76 (1.27, 2.48) |
For comparison, we also fit the model with no session effects included. The credible interval for the treatment effect related to the probability of any use is approximately 11% narrower than that of our preferred model (95% credible interval: (−1.79, −0.19)), which we qualitatively anticipated given our simulation study. The treatment effect for the amount of use, however, is similar to those of the models that include session effects (95% credible interval: (−2.30, 1.44).
Marginal effects
With non-linear models, marginal effects can be a useful summary of the model. The marginal effect is defined as the derivative or discrete difference with respect to one of the covariates in the mean of the outcome, holding all other covariates constant. For linear models, the marginal effect does not depend of the value of other covariates, but non-linear models typically have non-constant marginal effects. Hence, it is common to be interested in an average marginal effect, which averages the marginal effects across the observed covariate distribution. Particularly for models such as zero-inflated count models and semicontinuous models, it can be difficult to interpret model parameters directly in terms of an intervention’s impact on a measure that spans both parts of the model, such as the mean number of drinks across all individuals (and not just the individuals who drink at least some) (e.g., Long, Preisser & Herring, 2014).
In our observed data, we found that the treatment was associated with a lower probability of any drinking, but had little association with the amount of alcohol consumed by those who did drink. But what if we are interested in summarizing the model in terms such as the impact to the average number of drinks, including non-drinkers? Further, the continuous portion of the model is modeled in the log scale. For some audiences, it may be desirable to report a model summary in the original scale.
For non-Bayesian models, calculating confidence intervals for average marginal effects can be challenging. Confidence intervals and hypothesis tests can be derived using delta method techniques, but the algebra can be extensive. Also, such methods are typically based on normal approximations. In the Bayesian framework, however, it is quite simple to calculate marginal effects. Using the sampled model parameters θ(1), …, θ(M), we can calculate any type of average summary that we want. For example, to obtain the marginal effect of the group-based intervention using the predictive margins approach (Graubard & Korn, 1999), for each posterior sample we predict outcomes for everyone in the data as if they all were in the intervention condition and then predict them assuming everyone was in the control condition and then compare the predictions to get the marginal effect. The uncertainty in the multiple draws of the parameters θ(1), …, θ(M) is reflected in the marginal effect variance estimate.
For our data structure, the random effects only apply to individuals who are exposed to the group therapy treatment. Hence, to make predictions under the treated condition, one has to decide how to treat the session effects. The simplest approach is to ignore the session effects altogether, so that the marginal effect can be interpreted as an estimate for individuals attending “typical” sessions. For our effect summary, we assume that treated individuals each attend the mean number of sessions observed in our data: for each client in the dataset, we simulate outcomes for assuming that they attend 12 contiguous sessions, with a random starting session. To calculate a mean for a hypothetical population, we produce 1000 such predictions per person. We repeat this process for each Monte Carlo draw of β0, βc, σ and φ.
Finally, to accommodate the log transformation for the continuous portion of the model, we simulate drinks per day in the log scale, and exponentiate the realizations of the simulated drinks per day, rather than recording only the means in the log scale. This achieves the same effect Duan’s smearing estimator does for back-transformations (Duan, 1983). Incorporating all of these pieces, for our preferred model, we estimate an average reduction of 1.1 drinks per day (95% credible interval: (−7.4, 1.5) change in drinks per day).
Stationarity
As discussed in the simulations, we recommend assessing whether the fitted models are sensitive to whether the prior supports only stationary models, versus whether non-stationary models may be considered. In the case of the VAR(2) model, the support restriction is defined so that B1 and B2 result in all eigenvalues of the following matrix residing in the unit circle:
(e.g., equation 9.3 of Prado & West, 2010). In this matrix, I2 is the 2×2 identity matrix and 02 is 2×2 matrix of zeros. (Eigenvalues of this matrix may include an imaginary component, so by the “unit circle” we mean that the sum of the squares of the real and imaginary magnitudes of the each eigenvalue is less than 1.)
Nearly all (99.9%) of the sampled parameters for the VAR(1) model are consistent with a stationary model. Hence restricting the prior to the stationary model makes no appreciable difference. The VAR(2) model places slightly more posterior mass on non-stationary models, with 98.3% of sampled parameters yielding stationarity. Even so, there is no appreciable difference in the point or interval estimates associated with the treatment effects when changing to the prior that demands stationarity.
Sensitivity analyses
In addition to selecting a model that has an appropriate autoregressive order for the session effects, we can evaluate the sensitivity of our results to different assumptions about the prior distributions. This can be useful because 1) it is good to know if claimed results are particularly sensitive to the specification of the prior distributions and 2) consumers of research may feel that different priors are appropriate than those used by the producer of the research. With this in mind, we specify three alternative priors for the regression parameters in our models. In our main model presented above, we used “flat” improper priors, p(β) ∝ 1. We also specified independent standard normal priors for the regression coefficients. Finally, we specified the prior mean of the intercept for the probit model to be −1 or 1, which adjusts the prior expected number of individuals who consume at least some alcohol after treatment. We found that the treatment effect estimates are insensitive to these alternative prior distributions.
Discussion and conclusion
In order to appropriately analyze the data considered in this paper, we had to contend with a number of features. First, given the substantial number of clients with no alcohol use at follow-up, we needed to account for discrete mass at zero. Second, we expected that individuals who attended many of the same group therapy sessions would be more likely to have similar outcomes than individuals who attended separate sessions. Finally, the literature suggests that group therapy dynamics can persist over time; we wanted to consider autoregressive structures of differing orders to accommodate this phenomenon. Accordingly, we modeled our semicontinuous outcome with a vector-valued, latent time series model to describe the session effects. These session effects mapped to client outcomes via a multiple membership construction.
These characteristics make Bayesian analysis ideally suited for this analysis. The highly-structured random effect distributions would be difficult to handle in a likelihood-based analysis, where approximating the likelihood can be challenging. On the other hand, as shown in the Appendix, this model can be fit with a modest coding effort in Stan. And, even if modern computation tools such as Stan were not available, it would still not be too much effort to estimate the model “from scratch” in a Bayesian framework, though, as demonstrated with simulated data, such implementations tend to be less efficient computationally when using standard MCMC methods.
We believe that this analysis demonstrates the power that the “modularity” of Bayesian modeling and computation. Each of the components of our model (e.g., semicontinuous outcomes, vector autoregressive processes, and multiple membership models) can be handled in some context in a number of off-the-shelf software packages. However, when all are combined as pieces of a single model, we are not aware of any non-Bayesian software that would facilitate the estimation of such a model. In general, we believe that a Bayesian approach often allows scientists to specify models that better reflect important features of their data than is possible with other approaches.
Relatedly, although we focus on a scalar outcome, the methods in this paper could easily be expanded for multivariate outcomes where there are multiple observations of the same outcome variable over time for each client, or there are semicontinuous observations for different types of alcohol and other drugs. In such a case, we would wish to include participant-level random effects into our model. This would be straightforward in a Bayesian context, whereas many likelihood-based software packages restrict users to a single type of random effect. For example, in our primary model we would have and wi,0 ~ normal(xi,0β0 + Miφ0 + γi,0, 1) where γi = (γi,0, γi,c)⊤ is a bivariate random effect for individuals. Specifying a prior for this new random effect would complete the model specification. In our experience, adding a level to models like this does not entail a large computational cost. The programming change would also be minimal: lines of code would need to be added to declare the new parameters and to specify their prior distribution, but the structure of the program would not need to be substantially altered.
Similarly, if outcomes are measured after each group therapy session, one can drop the multiple membership formulation, having only the session effect immediately preceding the outcome measurement enter the outcomes models. Then we might have: and where and are random intercepts, and and are random slopes. (See Paddock et al. (2011) for a model of this sort for a continuous outcome.) This type of model would also require only a modest coding change from the code used in this paper, removing the multiple membership matrix and adding the four new random effects (and their prior distributions).
In this paper, we chose to select the order of the VAR model according to WAIC. An alternative approach would be to employ model averaging, where the Markov chain used to estimate the model can visit multiple models. For example, we might assume a priori that there are equal chances that the true session effect data generating process is VAR(0), VAR(1), VAR(2), or VAR(3). In a model averaging analysis, the Markov chain can transition from models with one order to models with another. However, model averaging can be computationally difficult for many models (Green, 1995) and tools like Stan and WinBUGS are generally not suited for these calculations.
Although we have focused on semicontinuous models that mix normally-distributed outcomes with discrete mass at zero, the methods we have developed in this paper would apply equally well to a number of other models. For example, our framework could easily be extended to zero-inflated count models by replacing the continuous portion of the model with a Poisson or negative binomial likelihood. Similarly, multivariate normal outcomes can be accommodated by replacing the probit link with the identity. This is another way in which Bayesian methods are modular: One portion of the model can be modified, leaving the rest of the model intact from the perspectives of coding and computation.
Though we have focused on rolling admissions in the context of group therapy, the model presented here has importance in other applied areas. Notably, whereas traditional educational settings have forced students to join classes at well-defined starting points such as starts of semesters or quarters, enrollment in online courses — a rapidly growing portion of today’s higher education system — are not always so orderly. Since students may be able to attend classes that are spread out across longer, less regular time periods, analytic methods such as those developed here may become increasingly important for understanding educational interventions. Another scenario that can be similarly conceptualized is examining outcomes among people participating in a common on-line forum, discussion board or on-line expert panel, given the dynamic nature of the panel of participants and participation over time (Dalal, Khodyakov, Srinivasan, Strauss, & Adams, 2011).
A common difficulty in disseminating results of Bayesian models is that one analyst’s prior may not be the same as a reader’s prior. Just as examining the robustness of results to model specifications (i.e., the likelihood) is a good practice, the same can be done with prior specifications. If the results are highly sensitive to the prior specification, consumers of the research should understand this.
Bayesian methods for most models that are of interest to applied researchers require heavy computational loads to estimate. For this reason, Bayesian methods were not useful for many classes of models before our current age of fast and cheap computers. In tandem, as hardware has become so much more powerful, the theory and practice of Bayesian computation have improved rapidly over the last several decades (e.g., Gelfand & Smith, 1990; Green, 1995; Neal, 2003; Girolami & Calderhead, 2011). Now that Bayesian methods are feasible for a wide range of models, these methods have diffused into the mainstream of applied fields at different rates. Psychology has perhaps been slower to adopt these methods than some other fields, though the change seems to be in full swing at this point: Andrews and Baguely (2013) found that the number of manuscripts that use of Bayesian methods in five prominent psychology journals increased by more than a factor of 4 between the 1990s and 2000s.
Specific examples of Bayesian models in psychology include the use of Bayesian structural equation models (SEMs; Lee 2007; Song & Lee, 2012). Song and Lee (2012) cite a number of situations where a Bayesian approach is more straightforward than classical methods including: when the observed variables are a mixture of continuous and discrete, for multilevel or hierarchical SEMs, when there are missing values in the data, and for longitudinal SEMs. Bayesian non-parametric models are also starting to enter the psychological literature. A primary application of such methods is infinite mixture models, where the number of mixture components is learned from the data (e.g., Gershman & Blei, 2012). This approach properly accounts for uncertainty in the number of mixture components for models such as growth mixture models.
While tools like Stan make these sophisticated models and computational methods readily available to empirical scientists, there are some important limitations. First, specifically for Stan, models with discrete (e.g., integer-valued) parameters can be difficult or impossible to handle, making it ill-suited for some types of mixture models. More broadly, even with sophisticated software implementations, Bayesian methods can be too slow to run in many “big data” applications. Methodologies such as variational Bayesian inference seek to reduce the computational burden of estimating such models (e.g., Blei & Jordan, 2006), but unsolved challenges remain, especially with respect to making these methods accessible to a wide audience of applied researchers.
In conclusion, as the theory, computational tools, software and methodology surrounding Bayesian methods continue to improve, we predict that Bayesian inference will gain even more popularity in psychology and other applied, data-driven fields. Bayesian methods are becoming difficult to match in terms of their flexibility to describe complex data structures, while new computational tools reduce the time and effort required to move from a conceptual model to a data-supported conclusion. Because more realistic models can improve our understanding of empirical data, we believe that these currents will improve fields that harness these innovations.
Acknowledgments
This research was supported by Grant Number R01AA019663 from the National Institute On Alcohol Abuse and Alcoholism. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute On Alcohol Abuse And Alcoholism or the National Institutes of Health.
Appendix
Stan Code
To be concrete, we focus on our preferred model, with VAR(2) random effects. The other models we consider are minor variations with respect to coding. The Stan program begins with a declaration the data elements that will be provided to the estimation algorithm. These are 1) the outcome measure 2) the multiple membership matrix M that contains row-normalized indicators of session attendance, with one row for each individual and 3) a matrix of covariates X that includes the control variables. In the case of the outcome measure, we separate it into an indicator of Any Use/No Use, denoted Y0, and the amount of use, conditional on any use, called Yc. In our coding of the model, Yc does not include elements for individuals who have zero alcohol use, though we also pass the function a vector of index variables called drinkerIndex that records the indices of the non-zero elements of Y0. Finally, we also pass Stan a 2×2 matrix called priorCenter that is a hyperparameter for the prior distributions that we specify for the variance/covariance matrices Σ. As before, we must explicitly provide Stan the dimensions of vectors and matrices.
The data block of the Stan model is given here:
data{
int<lower=0> nSessions;
int<lower=0> nTotal;
int<lower=0> nDrinkers;
int<lower=0> betaLength;
int<lower=0,upper=1> Y_0[nTotal];
int<lower=0> Y_c[nDrinkers];
vector[nDrinkers] amt;
matrix[nTotal, betLen] X;
matrix[nTotal, nSess] M;
matrix[2,2] priorCenter;
}
Next, we specify the model parameters. The regression parameters β0 and βc describe the logit model of any use and the normal model of the amount of use, conditional on use, respectively. The 2×2 matrices B1 and B2 are used to describe the VAR(2) serial correlations.
parameters{
vector[betLen] beta_p;
vector[betLen] beta_c;
matrix[2,2] B_1;
matrix[2,2] B_2;
real<lower=0, upper = 10> sigma;
vector[2] transSess[nSess];
cov_matrix[2] Sigma;
}
Next, we use a type of code block called “transformed parameters”. In this block, we can declare and define deterministic functions of the parameters and data elements. (By deterministic, we mean that there are no sampling statements, denoted by “~ ”, only the assignment operator “<−”.)
transformed parameters{
vector[nTotal] clientLevEffP;
vector[nTotal] clientLevEffC;
vector[nTotal] probMeans;
vector[nSess] sessEffP;
vector[nSess] sessEffC;
vector[nTotal] XBetaC;
for(i in 1:nSess){
sessEffP[i] <− transSess[i,1];
sessEffC[i] <− transSess[i,2];
}
clientLevEffP <− M * sessEffP;
clientLevEffC <− M * sessEffC;
probMeans <− X * betaP + clientLevEffP;
XBetaC <− X * betaC;
}
Some of these transformed parameters are defined for clarity of coding, and others are defined for more functional reasons.
Next, we define the model itself
model{
Sigma ~ inv_wishart(3,centerMat);
B[1,1] ~ normal(0,1); B[1,2]~normal(0,1); B[2,1]~normal(0,1);
B[2,2]~normal(0,1);
B2[1,1] ~ normal(0,1); B2[1,2]~normal(0,1); B2[2,1]~normal(0,1);
B2[2,2]~normal(0,1);
transSess[1] ~ multi_normal(rep_vector(0,2),Sigma);
transSess[2] ~ multi_normal(B * transSess[1], Sigma);
for(i in 3:nSess){
transSess[i] ~ multi_normal(B * transSess[i-1] + B2 * transSess[i-
2], Sigma);
}
for(i in 1:nTotal)
anyDrink[i] ~ bernoulli(Phi(probMeans[i]));
for(i in 1:nDrinkers)
amt[i] ~ normal(X[drinkInd[i]] * betaC +
clientLevEffC[drinkInd[i]], sigma);
}
WAIC in Stan
Stan operates directly on the posterior density of the model parameters, and does not distinguish between features of the posterior that come from the likelihood versus the prior distribution. For WAIC calculations, we need access to the log likelihood values themselves. In order to achieve that, we add another section to the Stan code (before fitting the model), called “generated quantities”:
generated quantities{
vector[nDrinkers] log_lik_amt;
vector[nTotal] log_lik_any;
for(n in 1:nDrinkers){
log_lik_amt[n] <− normal_log(amt[n], XBetaC[drinkInd[n]] +
clientLevEffC[drinkInd[n]], sigma);
}
for(i in 1:nTotal){
log_lik_any[i] <− Phi(X[i] * betaP + clientLevEffP[i]);
}
}
This produces the elements needed to to calculate WAIC as described by Vehtari and Gelman (2014).
Footnotes
Some of the research contained in this paper was presented at the Meeting of the Eastern North American Region of the International Biometric Society in March of 2015 in Miami, FL. This research has not been published previously.
Contributor Information
Lane F. Burgette, Economics, Sociology and Statistics, RAND Corporation, Arlington, Virginia
Susan M. Paddock, Economics, Sociology and Statistics, RAND Corporation, Santa Monica, California
References
- Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association. 1993;88(422):669–679. [Google Scholar]
- Andrews M, Baguley T. Prior approval: The growth of Bayesian methods in psychology. British Journal of Mathematical and Statistical Psychology. 2013;66(1):1–7. doi: 10.1111/bmsp.12004. [DOI] [PubMed] [Google Scholar]
- Bates D, Maechler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. arXiv preprint. 2014 https://arxiv.org/abs/1406.5823. [Google Scholar]
- Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data. Boca Raton, FL: CRC Press; 2014. [Google Scholar]
- Bauer DJ, Gottfredson NC, Dean D, Zucker RA. Analyzing repeated measures data on individuals nested within groups: Accounting for dynamic group effects. Psychological methods. 2013;18(1):1–14. doi: 10.1037/a0030639. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beck A, Steer R, Brown G. BDI-II, Beck Depression Inventory: Manual. 2nd. Boston: Harcourt Brace; 1996. [Google Scholar]
- Blei DM, Jordan MI. Variational inference for Dirichlet process mixtures. Bayesian Analysis. 2006;1(1):121–143. [Google Scholar]
- Cafri G, Hedeker D, Aarons GA. An Introduction and Integration of Cross-Classified, Multiple Membership, and Dynamic Group Random-Effects Models. Psychological Methods. 2015 doi: 10.1037/met0000043. In press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Capanu M, Gönen M, Begg CB. An assessment of estimation methods for generalized linear mixed models with binary outcomes. Statistics in Medicine. 2013;32(26):4550–4566. doi: 10.1002/sim.5866. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cocco K, Carey K. Psychometric properties of the Drug Abuse Screening Test in psychiatric outpatients. Psychological Assessment. 1998;10(4):408–414. [Google Scholar]
- Crits-Christoph P, Johnson JE, Connolly Gibbons MB, Gallop R. Process predictors of the outcome of group drug counseling. Journal of Consulting and Clinical Psychology. 2013;81(1):23. doi: 10.1037/a0030101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dalal S, Khodyakov D, Srinivasan R, Strauss S, Adams J. ExpertLens: a system for eliciting opinions from a large pool of non-collocated experts with diverse knowledge. Technol. Forecast. Socl Change. 2011;78:1426–1444. [Google Scholar]
- Dawson DA, Grant BF, Stinson FS, Zhou Y. Effectiveness of the derived Alcohol Use Disorders Identification Test (AUDIT-C) in screening for alcohol use disorders and risk drinking in the US general population. Alcoholism, clinical and experimental research. 2005;29:844–854. doi: 10.1097/01.alc.0000164374.32229.a2. [DOI] [PubMed] [Google Scholar]
- Dembo R, Wareham J, Greenbaum PE, Childs K, Schmeidler J. Marijuana use among juvenile arrestees: A two-part growth model analysis. Journal of Child & Adolescent Substance Abuse. 2009;18(4):379–397. [Google Scholar]
- Duane S, Kennedy AD, Pendleton BJ, Roweth D. Hybrid Monte Carlo. Physics Letters B. 1987;195(2):216–222. [Google Scholar]
- Duan N. Smearing estimate: a nonparametric retransformation method. Journal of the American Statistical Association. 1983;78(383):605–610. [Google Scholar]
- Efron B. Frequentist accuracy of Bayesian estimates. Journal of the Royal Statistical Society: Series B. 2015;77(3):617–646. doi: 10.1111/rssb.12080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelfand AE, Smith AF. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association. 1990;85(410):398–409. [Google Scholar]
- Gelman A, Hwang J, Vehtari A. Understanding predictive information criteria for Bayesian models. Statistics and Computing. 2014;24(6):997–1016. [Google Scholar]
- Gershman SJ, Blei DM. A tutorial on Bayesian nonparametric models. Journal of Mathematical Psychology. 2012;56(1):1–12. [Google Scholar]
- Girolami M, Calderhead B. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B. 2011;73(2):123–214. [Google Scholar]
- Graubard BI, Korn EL. Predictive margins with survey data. Biometrics. 1999;55(2):652–659. doi: 10.1111/j.0006-341x.1999.00652.x. [DOI] [PubMed] [Google Scholar]
- Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82(4):711–732. [Google Scholar]
- Hill PW, Goldstein H. Multilevel modeling of educational data with cross-classification and missing identification for units. Journal of Educational and Behavioral Statistics. 1998;23(2):117–128. [Google Scholar]
- Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Statistical Science. 1999:382–401. [Google Scholar]
- Hoffman MD, Gelman A. The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. The Journal of Machine Learning Research. 2014;15(1):1593–1623. [Google Scholar]
- Hunter SB, Watkins KE, Hepner KA, Paddock SM, Ewing BA, Osilla KC, Perry S. Treating depression and substance use: A randomized controlled trial. Journal of substance abuse treatment. 2012;43(2):137–151. doi: 10.1016/j.jsat.2011.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kiernan K, Tao J, Gibbs P. Tips and strategies for mixed modeling with SAS/STAT® procedures. SAS Global Forum. 2012;2012:332–2012. [Google Scholar]
- Lee SY. Structural Equation Modeling: A Bayesian Approach. Vol. 711. John Wiley & Sons; 2007. [Google Scholar]
- Lin X, Breslow NE. Bias correction in generalized linear mixed models with multiple components of dispersion. Journal of the American Statistical Association. 1996;91(435):1007–1016. [Google Scholar]
- Little RJ. Calibrated Bayes: a Bayes/frequentist roadmap. The American Statistician. 2006;60(3):213–223. [Google Scholar]
- Liu L, Ma JZ, Johnson BA. A multi-level two-part random effects model, with application to an alcohol-dependence study. Statistics in Medicine. 2008;27(18):3528–3539. doi: 10.1002/sim.3205. [DOI] [PubMed] [Google Scholar]
- Liu L, Strawderman RL, Cowen ME, Shih YCT. A flexible two-part random effects model for correlated medical costs. Journal of Health Economics. 2010;29(1):110–123. doi: 10.1016/j.jhealeco.2009.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Long DL, Preisser JS, Herring AH, Golin CE. A marginalized zero-inflated Poisson regression model with overall exposure effects. Statistics in Medicine. 2014;33(29):5151–5165. doi: 10.1002/sim.6293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS-a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing. 2000;10(4):325–337. [Google Scholar]
- Maisto SA, Carey MP, Carey KB, Gordon CM, Gleason JR. Use of the AUDIT and the DAST-10 to identify alcohol and drug use disorders among adults with a severe and persistent mental illness. Psychological Assessment. 2000;12(2):186–192. doi: 10.1037//1040-3590.12.2.186. [DOI] [PubMed] [Google Scholar]
- Morgan-Lopez AA, Fals-Stewart W. Analytic methods for modeling longitudinal data from rolling therapy groups with membership turnover. Journal of Consulting and Clinical Psychology. 2007;75(4):580. doi: 10.1037/0022-006X.75.4.580. [DOI] [PubMed] [Google Scholar]
- Neal RM. Slice sampling. Annals of Statistics. 2003:705–741. [Google Scholar]
- Neal RM. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo. 2011;2 [Google Scholar]
- Neelon B, Zhu L, Neelon SEB. Bayesian two-part spatial models for semicontinuous data with application to emergency department expenditures. Biostatistics. 2015;16(3):465–479. doi: 10.1093/biostatistics/kxu062. [DOI] [PubMed] [Google Scholar]
- Olsen MK, Schafer JL. A two-part random-effects model for semicontinuous longitudinal data. Journal of the American Statistical Association. 2001;96(454):730–745. [Google Scholar]
- Paddock SM, Hunter SB, Watkins KE, McCaffrey DF. Analysis of rolling group therapy data using conditionally autoregressive priors. Annals of Applied Statistics. 2011;5(2A):605. doi: 10.1214/10-AOAS434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paddock SM, Leininger TJ, Hunter SB. Bayesian restricted spatial regression for examining session features and patient outcomes in open-enrollment group therapy studies. In press at Statistics in Medicine. 2016 doi: 10.1002/sim.6616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paddock SM, Savitsky TD. Bayesian hierarchical semiparametric modelling of longitudinal post-treatment outcomes from open enrolment therapy groups. Journal of the Royal Statistical Society: Series A. 2013;176(3):795–808. doi: 10.1111/j.1467-985X.2012.12002.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prado R, West M. Time series: Modeling, computation, and inference. CRC Press; 2010. [Google Scholar]
- Raudenbush SW. A crossed random effects model for unbalanced data with applications in cross-sectional and longitudinal research. Journal of Educational and Behavioral Statistics. 1993;18(4):321–349. [Google Scholar]
- Robert C, Casella G. Monte Carlo Statistical Methods. Springer Science & Business Media; 2004. [Google Scholar]
- Savitsky TD, Paddock SM. Bayesian non-parametric hierarchical modeling for multiple membership data in grouped attendance interventions. Annals of Applied Statistics. 2013;7(2):1074–1094. doi: 10.1214/12-AOAS620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song XY, Lee SY. A tutorial on the Bayesian approach for analyzing structural equation models. Journal of Mathematical Psychology. 2012;56(3):135–148. [Google Scholar]
- Skinner HA. The drug abuse screening test. Addictive Behaviors. 1982;7:363–371. doi: 10.1016/0306-4603(82)90005-3. [DOI] [PubMed] [Google Scholar]
- Sobell LC, Sobell MB. Timeline followback: A technique for assessing self-reported ethanol consumption. In: Allen J, Litten RZ, editors. Measuring alcohol consumption: Psychosocial and biological methods. Totowa, NJ: Humana Press; 1992. pp. 41–72. [Google Scholar]
- Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B. 2002;64(4):583–639. [Google Scholar]
- Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A. The deviance information criterion: 12 years on. Journal of the Royal Statistical Society: Series B. 2014;76(3):485–493. [Google Scholar]
- Stan Development Team. RStan: the R interface to Stan, Version 2.7.0. 2015a http://mc-stan.org/rstan.html. [Google Scholar]
- Stan Development Team. Stan: A C++ Library for Probability and Sampling, Version 2.8.0. 2015b http://mc-stan.org. [Google Scholar]
- Su L, Tom BD, Farewell VT. A likelihood-based two-part marginal model for longitudinal semicontinuous data. Statistical Methods in Medical Research. 2015;24(2):194–205. doi: 10.1177/0962280211414620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tasca GA, Ramsay T, Corace K, Illing V, Bone M, Bissada H, Balfour L. Modeling longitudinal data from a rolling therapy group program with membership turnover: Does group culture affect individual alliance? Group Dynamics: Theory, Research, and Practice. 2010;14(2):151. [Google Scholar]
- Tsay RS. Analysis of Financial Time Series. Vol. 543. John Wiley & Sons; 2005. [Google Scholar]
- Vehtari A, Gelman A. WAIC and cross-validation in Stan. [Accessed June 15, 2016];Technical report. 2014 Available at http://www.stat.columbia. edu/~gelman/research/unpublished/waic_stan.pdf.
- Watanabe S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. The Journal of Machine Learning Research. 2010;11:3571–3594. [Google Scholar]
- Watkins KE, Hunter SB, Hepner KA, Paddock SM, de la Cruz E, Zhou AJ, Gilmore J. An effectiveness trial of group cognitive behavioral therapy for patients with persistent depressive symptoms in substance abuse treatment. Archives of general psychiatry. 2011;68(6):577–584. doi: 10.1001/archgenpsychiatry.2011.53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- White AM, Hingson RW, Pan IJ. Hospitalizations for alcohol and drug overdoses in young adults ages 18–24 in the United States, 1999–2008: results from the Nationwide Inpatient Sample. Journal of Studies on Alcohol and Drugs. 2011;72(5):774. doi: 10.15288/jsad.2011.72.774. [DOI] [PMC free article] [PubMed] [Google Scholar]




