Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Oct 14.
Published in final edited form as: Biom J. 2015 Aug 20;58(1):206–221. doi: 10.1002/bimj.201400232

Augmented Beta rectangular regression models: A Bayesian perspective

Jue Wang, Sheng Luo *
PMCID: PMC5064841  NIHMSID: NIHMS820952  PMID: 26289406

Abstract

Mixed effects Beta regression models based on Beta distributions have been widely used to analyze longitudinal percentage or proportional data ranging between zero and one. However, Beta distributions are not flexible to extreme outliers or excessive events around tail areas, and they do not account for the presence of the boundary values zeros and ones because these values are not in the support of the Beta distributions. To address these issues, we propose a mixed effects model using Beta rectangular distribution and augment it with the probabilities of zero and one. We conduct extensive simulation studies to assess the performance of mixed effects models based on both the Beta and Beta rectangular distributions under various scenarios. The simulation studies suggest that the regression models based on Beta rectangular distributions improve the accuracy of parameter estimates in the presence of outliers and heavy tails. The proposed models are applied to the motivating Neuroprotection Exploratory Trials in Parkinson’s Disease (PD) Long-term Study-1 (LS-1 study, n=1741), developed by The National Institute of Neurological Disorders and Stroke Exploratory Trials in Parkinson’s Disease (NINDS NET-PD) network.

Keywords: Augmented Beta, Beta rectangular distribution, GAMLSS family, longitudinal data, Markov chain Monte Carlo, Proportional data

1 Introduction

In many clinical trials or biomedical studies, researchers often collect some outcomes in the form of percentages, proportions, fractions, and rates measured in the open unit interval (0, 1) (referred to as proportional data) (Kieschnick & McCullough, 2003). Examples include the Alzheimer’s disease assessment scale (range from 0–70, which can be transformed into the unit interval) (Rogers et al., 2012), microbial data expressed as percentages (Zhao et al., 2001), and proportion of parasitized eggs in a biological control assay (Vieira et al., 2000). Beta regression models (Ferrari & Cribari-Neto, 2004) are increasingly used by researchers from various fields to directly model the covariate effects on the proportional response through a generalized linear model (GLM) framework. When the outcomes are measured longitudinally, Beta regression models with random effects have been proposed to account for the within-subject correlation (Verkuilen & Smithson, 2012; Figueroa-Zúniga et al., 2013). However, because the support of the Beta distributions is between zero and one, boundary values zero and one are not allowed in Beta regression models. Several previous studies have discussed this issue through various approaches. Smithson & Verkuilen (2006) introduced a Lemon-squeezer (LS) transformation which first linearly transforms the variable from the original scale to the open unit interval (0, 1) and then compresses the range to avoids zeros and ones. The re-scaling might work nicely for small proportions of zeros and/or ones, but the parameter estimates can be sensitive with higher proportions (Galvis et al., 2014). Such inefficiency may be even worse with the presence of within-subject correlation or multi-level clustering in typical longitudinal data. Ospina & Ferrari (2010) proposed a mixed continuous-discrete distribution to capture the probabilities at zero and one, by using a continuous distribution on the open unit interval (0, 1) and a degenerate distribution that assigns values of zero and one with non-negative probabilities. This approach allows one to directly model the variable without transformation. Using the same idea, Galvis et al. (2014) proposed a generalized linear mixed model framework by augmenting the probabilities of zeros and ones to the Beta regression model via a zero-one-augmented Beta (ZOAB) model. They termed the model as “augmented” model rather than “inflated” model because the Beta distribution does not include zero and one in its support, similar in spirit to Hatfield et al. (2011, 2012).

The Beta regression models are based on Beta distributions, which can take on a variety of shapes to account for non-normality and skewness in proportional data by considering different values of its parameters (Johnson et al., 1994). However, as noted by Hahn (2008), García et al. (2011), and Bayes et al. (2012), the Beta distribution considers neither the tail area events nor the outlying events, fails to represent excess variability and over-occurrence of tail-area events, which could limit its applications for modeling the proportional data. Various methods have been proposed to address the issue of outlying and tail area events. Rigby & Stasinopoulos (2005) proposed a generalized additive models for location, scale and shape (GAMLSS) family and developed an R package GAMLSS (Stasinopoulos & Rigby, 2007), which allows all parameters of the response variable’s distribution to be modeled as linear and non-linear functions of the covariates. In the current context, the zero and one inflated Beta regression model (BEINF) in the GAMLSS family can be used for modeling proportional data in [0, 1]. Additionally, Hahn (2008) proposed a Beta rectangular distribution, which is a mixture distribution consisting of a Beta distribution and a uniform (rectangular) distribution between 0 and 1. The Beta rectangular distribution reduces to the Beta distribution when the mixture probability is 0. Comparing to the Beta distribution, the Beta rectangular distribution assigns more weight to extremal tail-area events, and more probability to the outliers and extremal events. Bayes et al. (2012) proposed a Beta rectangular regression model for cross-sectional data and obtained more robust inference against outlying observations than the Beta regression. However, the Beta rectangular model accounts for neither the within-subject correlation in longitudinal data nor the boundary values zeros and ones.

In this article, we generalize the model by Bayes et al. (2012) and develop an augmented Beta rectangular regression model to account for the occurrence of boundary values 0 and 1 in the closed unit interval [0, 1]. Moreover, we account for the within-subject correlation in the longitudinal data by introducing random effects under the generalized linear mixed model framework. The rest of the article proceeds as follows. In Section 2, we describe a motivating clinical trial, the proportional outcome variable of interest, and the issue of outlying observations. In Section 3, we briefly review the Beta and Beta rectangular distributions and their regression models, and develop the augmented Beta rectangular regression model. In Section 4, we discuss the Bayesian inference and Bayesian model selection criteria. In Section 5, we conduct an extensive simulation study with three settings to compare the performance of various models. In Section 6, we apply the proposed model to the motivating clinical trial. Concluding remarks and discussions are given in Section 7.

2 A motivating clinical trial

This methodological development is motivated by the Parkinson’s Disease (PD) Long-term Study-1 (LS-1 study, n=1741), developed by The National Institute of Neurological Disorders and Stroke Exploratory Trials in Parkinson’s Disease (NINDS NET-PD) network. The LS-1 study is a multi-center, double-blind, phase III study of creatine in patients with early treated PD to assess whether creatine slows PD clinical decline defined by a combination of cognitive, psysical, and quality of life measures. A total of 1741 patients with early PD were randomly assigned to receive either placebo or creatine. Participants were followed until the last enrolled participant has completed 5 years of observation. Thus, many participants had extended follow-up, to a maximum of 6 years. In-person evaluations were conducted at baseline and then annually beginning at 12 months. The LS-1 study represents the largest cohort of patients with early treated PD ever enrolled in a clinical trial (Elm, 2012; Kieburtz et al., 2015). The detailed description of the design of the LS-1 study can be found in Elm (2012).

The primary outcome of interest in this article is standardized generic instrument EuroQol vertical visual analog scale (EQ-VAS), which is widely used in PD related research (Kieburtz et al., 2015). EQ-VAS is a patient-reported outcome (PRO) which takes values between 100 (best imaginable health) and 0 (worst imaginable health), on which patients provide a global assessment of their health. Participants of LS-1 study are early PD patients with diagnosis within 2 years. Because PD is a slow progression disease, it is likely that early PD patients in the LS-1 study were still in a very good health condition and they considered themselves as in best imaginable health. Each patient contributed between 1 to 7 (mean=4.73, sd=1.70) EQ-VAS measurements. There are 24 intermittent missing values observed in EQ-VAS and during the follow-up, 78 and 323 individuals died and dropped out the study, respectively. The intermittent missing data and the missing data after loss of follow-up or death are assumed to be missing at random (MAR) in this article. Figure 1 (left panel) displays the histogram based on all EQ-VAS observations. Because there is only one patient reporting zero value for EQ-VAS, we add 0.1 to that observation. However, the presence of a substantial number of 100’s (168 out of 8,227 observations, 2.04%), if unaccounted for, is a potential issue for Beta or Beta rectangular regression models.

Figure 1.

Figure 1

Histogram of the EQ-VAS scores (left panel) and the longitudinal profiles (right panel) of the EQ-VAS scores of 50 randomly selected subjects (grey lines) and 3 subjects with outlying observations (black dashed, dotdashed and dotted lines) from the LS-1 study and the lowess smooth curve (black solid line).

Figure 1 (right panel) displays the longitudinal profiles of the EQ-VAS scores of 50 randomly selected subjects. Because PD is a slow progression disease, it is unexpected to observe sudden value change in the outcome variables such as EQ-VAS, as indicated by the nearly-horizontal lowess smooth curve (black solid line) (Cleveland, 1979). However, compared to baseline measurements, two patients (denoted by dashed and dotdashed lines) have a sudden value drop at their year 1 follow-up visit, and return to a higher level at their year 2 visit. Similarly, another patient’s EQ-VAS score (denoted by dotted line) at the fifth year is significantly lower than the two adjacent years. Hence, these three observations are potential outliers. We divide the EQ-VAS variable by 100 to rescale it to the interval (0, 1]. We are interested in examining the effect of outliers, as well as the boundary values of 1, on the inference of regression models based on the Beta and Beta rectangular distributions.

3 Model and estimation

3.1 Beta distribution

In this section, we briefly review the reparameterization of Beta distribution (Ferrari & Cribari-Neto, 2004). A random variable Y follows a Beta distribution if the probability density function (pdf) in terms of its mean μ and precision parameter ϕ is given by,

fB(Y=yμ,ϕ)=Γ(ϕ)Γ(μϕ)Γ((1-μ)ϕ)yμϕ-1(1-y)(1-μ)ϕ-1, (1)

where 0 < y < 1, 0 < μ < 1, and ϕ > 0. Then we have E(Y) = μ and Var(Y) = μ(1 −μ)/(1 + ϕ). We adopt the notation Y ~ Beta(μ, ϕ).

The Beta regression model can be defined by linking the mean and covariates of interest under a GLM framework as logit(μi) = Xiβ, where Xi is the covariates vector of interest for subject i. The precision parameter ϕ can be either regressed on covariates after logarithm transformation or considered as a constant among different subjects. By specifying different values of μ and ϕ, the Beta distribution is flexible to allow different shapes and skewness.

3.2 Beta Rectangular Distribution

The Beta rectangular (BR) distribution is a mixture distribution consisting of a Beta distribution and a uniform (rectangular) distribution between 0 and 1. Its probability density function with support (0, 1) is given by

fBR(Y=yμ,ϕ,q)=q+(1-q)fB(yμ,ϕ),

where 0 ≤ q ≤ 1 is the mixture probability, and fB(y|μ, ϕ) is the pdf of the Beta distribution as in (1). We denote the BR distribution as Y ~ BR(μ, ϕ, q). Obviously, if q = 1, the BR distribution reduces to the uniform (rectangular) distribution between 0 and 1 and if q = 0, it reduces to the Beta distribution fB(y|μ, ϕ). The mean and variance of the BR distribution are E(Y)=(1-q)μ+q2=γ and Var(Y)=μ(1-μ)1+ϕ(1-q)[1+q(1+ϕ)]+q12(4-3q).

However, a regression analysis typically models the mean of the response (Ferrari & Cribari-Neto, 2004). To obtain a more appropriate regression structure for the mean of the BR distribution, Bayes et al. (2012) let γ=q2+(1-q)μ and α=q1-(1-q)2μ-1, whose parameter space is {0 ≤ γ ≤ 1, 0 ≤ α ≤ 1}. Under this reparameterization, the pdf of the BR distribution is

fBR(Y=yγ,ϕ,α)=α(1-2γ-1)+(1-α(1-2γ-1))×fB(y|γ-0.5α(1-2γ-1)1-α(1-2γ-1),ϕ). (2)

We denote the reparameterized BR distribution as Y ~ BR(γ, ϕ, α), where γ is the mean, ϕ is the precision parameter, and α is a shape parameter controlling the thickness of the tails. When the mixture probability q = 1, then α = 1 and γ = 0.5, the BR distribution reduces to the uniform (rectangular) distribution between 0 and 1. When q = 0, then α = 0 and γ = μ, the BR distribution reduces to the Beta distribution. In general, when 0 < q < 1, then 0 < α < 1, the BR distribution has heavier tails than its Beta distribution counterpart. To visualize this, Figure 2 displays the density functions of various BR distributions with different values of γ, ϕ, and α. It suggests that when α > 0, the BR distribution has a heavier tail than the corresponding Beta distribution.

Figure 2.

Figure 2

The density functions of various Beta rectangular distributions with different values of γ, ϕ, and α. α = 0 (solid line), α = 0.1 (dashed line), α = 0.3 (dotted line), α = 0.5 (dotdash line).

Similar to the Beta regression model, the Beta rectangular regression model can be defined as logit(γi) = Xiβ and the precision parameter ϕ can be either regressed on covariates or considered as a constant among different subjects.

3.3 One-augmented Beta rectangular random effects model

In this section, we generalize the Beta rectangular regression model to account for the longitudinal data structure and the boundary values zero and one. For the ease of illustration, we only consider the boundary value of one (rescaled from 100, as in Figure 1 left panel), we illustrate how to extend the model to account for both zero and one in Web Section 2. Let yij be the observed outcome (e.g., EQ-VAS) at visit j (j = 1, …, Ji, where j = 1 is baseline and Ji is the number of visits for subject i) from subject i (i = 1, …, I, where I is the number of subjects). Let yi = (yi1, …, yiJi)′ be the outcome vector for subject i and let y = (y1, …, yI)′ be the observed outcome matrix. We then propose a one-augmented BR (OABR) model, denoted by Y ~ OABR(p1ij, γij, ϕij, α), whose probability density function follows:

f(Yij=yijp1ij,γij,ϕij,α)={p1ijifyij=1(1-p1ij)fBR(Yij=yijγij,ϕij,α)ifyij(0,1), (3)

where p1ij = P(Yij = 1), fBR(Yij = yij |γij, ϕij, α) is the reparameterized BR density function given in (2), γij = E[Yij], and ϕij is the precision parameter of subject i at visit j. Next we propose the OABR regression model by regressing the covariates onto p1ij, γij, and ϕij, which are transformed by some link functions:

logit[p1ij=P(yij=1ui0)]=Xi0ω+ui0logit(γijui1)=Xi1β+ui1logit(ϕijui2)=Xi2η+ui2, (4)

where the covariate vectors Xi0, Xi1 and Xi2 can be identical or different and they include covariates of interest (e.g., treatment assignment) and potential confounding variables (e.g., subjects characteristics and socioeconomic status) from subject i. We adopt the logit link function for both p1ij and γij, while other link functions (e.g., probit and complementary log-log) can also be used. We assume that the random effects vector ui = (ui0, ui1, ui2)′ follows a multivariate normal distribution N3(0, Σ), where

={σ12ρ12σ1σ2ρ13σ1σ3ρ12σ1σ2σ22ρ23σ2σ3ρ13σ1σ3ρ23σ2σ3σ32}. (5)

The proposed OABR regression model can be modified to accommodate various features in the data. For example, the one-inflated Beta regression model (BEOI) in the GAMLSS family can be obtained by replacing the BR density fBR(Yij = yij |γij, ϕij, α) with the Beta density in (1) or equivalently by setting α = 0. If the outcome matrix y contains both zeros and ones, the OABR regression model can be generalized to the zero-one augmented BR (ZOABR) regression model by adding the probability p0ij = P(Yij = 0) to model (3) representing the probability of observing 0 and regressing covariates onto p0ij transformed by some link function. Note that the ZOABR regression model requires an additional constraint: 0 < p0ij +p1ij < 1 and we illustrate in details how to impose this constraint in Web Section 2. On the other hand, if there are no zeros or ones observed, we can let p0ij = p1ij ≡ 0, then the OABR regression model reduces to the mixed effects BR regression model. Moreover, we only include random intercepts in model (4) for simplicity, more random effects (e.g., random slope) can be easily included in the model.

Let the parameter vector Θ = (ω, β, η, Σ, α). Conditional on the random effects ui, all measurements of each subject are assumed to be independent. We have the full likelihood of subject i as follows:

L(Θ,ui;yi)=[j=1Jip(yijui)]p(ui)=[j=1Jip1ijI(yij=1){(1-p1ij)fBR(yijγij,ϕij,α)}1-I(yij=1)]p(ui), (6)

where I(·) denotes the indicator function, fBR(yij |γij, ϕij, α) is the density function of the BR distribution, and p(ui) is the density function of the random effects ui.

4 Bayesian inference

We adopt Bayesian methods based on Markov chain Monte Carlo (MCMC) algorithms to obtain statistical inference. The fully Bayesian inference has many advantages. First, MCMC algorithms can be used to estimate exact posterior distributions of the parameters, while likelihood-based estimation only produces a point estimate of the parameters, with asymptotic standard errors (Dunson, 2007). Second, Bayesian inference provides better performance in small samples compared to likelihood-based estimation (Lee & Song, 2004). In addition, it is more straightforward to deal with more complicated models using Bayesian inference via MCMC.

4.1 Prior specification

To make inference on the unknown parameter vector Θ, we use Bayesian inference based on Markov Chain Monte Carlo (MCMC) posterior simulations. We use vague prior distributions on all elements in the parameter vector Θ. The prior distributions of all elements in ω, β and η are N(0, τ2). We use the prior distribution Inverse-Gamma(λ1, λ2) for σ’s to ensure positivity. Specifically, we let τ = 10 and λ1 = λ2 = 0.001. The prior distribution for ρ’s is ρ ~ Uniform(−1, 1). We have investigated other selections of vague prior distributions with various hyper-parameters (e.g., τ, λ’s) and obtained very similar results. Please refer to Web Section 3 for details of the sensitivity analysis results. The posterior samples are obtained from the full conditional of each unknown parameter using Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal, 1994) and No-U-Turn Sampler (NUTS) (Hoffman & Gelman, 2013). The HMC and NUTS samplers are implemented in Stan (Stan Development Team, 2014, version 2.6.0), which is a probabilistic programming language implementing statistical inference with HMC and NUTS samplers. The model fitting is performed in Stan by specifying the full likelihood function and the prior distributions of all unknown parameters. For large datasets, Stan may be more efficient than BUGS language in achieving faster convergence and requiring less number of samples. To facilitate easy reading and implementation of the proposed OABR regression model, the Stan codes have been posted in the Web Supplement.

To monitor Markov chain convergence, we use history plots and view the absence of apparent trend in the plots as evidence of convergence. We run multiple chains with diffuse initial values and ensure the scale reduction of all parameters are smaller than 1.1 (Gelman et al., 2013).

4.2 Bayesian model selection and influence diagnostics

There are a variety of model selection criteria in Bayesian inference. The widely used criteria conditional predictive ordinate (CPO) (Geisser, 1993; Dey et al., 1997; Sinha & Dey, 1997; Carlin & Louis, 2009; Ghosh & Hanson, 2010) is adopted to assess model fit and selection. The CPO for the (ij)th observation (observation j from subject i) is defined as

CPOij=p(yijy(ij))=p(yijΘ)p(Θy(ij))dΘ, (7)

where yij denotes the full data and y(ij) denotes the data after deleting the (ij)th observation. CPO is a form of cross-validation with high value indicating that the data for observation (ij) can be accurately predicted by a model based on the data from all other observations. Hence, a model with larger CPOij for all observations suggests a better fit. Although the close form of CPOij is not available for our proposed model, a Monte Carlo estimator of CPOij can be obtained by MCMC samples {Θ(t)}t=1M from posterior distribution p(Θ|y), with M being the total number of post burn-in samples. Because p(yij|y(ij)) = p(y)/p(y(ij)) = 1/∫ p(Θ|y)/p(yij|y(ij), Θ)dΘ, a harmonic-mean approximation of CPOij is CPO^ij=(1Mt=1M1p(yijy(ij),Θ(t)))-1=(1Mt=1M1p(yijΘ(t)))-1 (Dey et al., 1997). A summary statistics of CPO^ij for all individuals is the log pseudo-marginal likelihood (LPML) defined as LPML=i=1Ij=1Jilog(CPO^ij). A larger value of LPML indicates better fit of the model. Moreover, we adopt a pseudo-Bayes factor for comparing two models defined as PBF21 = exp(LPML2 − LPML1) (Ghosh & Hanson, 2010).

To detect the occurrence of outliers and extremal events around the tail-area, we consider the Kullback-Leibler (K-L) divergence defined as K{p(Θy),p(Θy(ij))}=p(Θy)log(p(Θy)p(Θy(ij)))dΘ. Peng & Dey (1995) pointed out that K{p(Θ|y), p(Θ|y(ij))} = log EΘ|Y[{p(yij|Θ)}−1]+EΘ|Y[log{p(yij|Θ)}] = −log(CPOij) + EΘ|Y[log{p(yij|Θ)}], where EΘ|Y(·) denotes the expectation with respect to the joint posterior distribution p(Θ|y). Cancho et al. (2011) proposed the Monte Carlo estimate of the K-L divergence as

K{p(Θy),p(Θy(ij))^}=-log(CPO^ij)+1Mt=1Mlog{p(yijΘ(t))}. (8)

5 Simulation studies

In this section, we conduct an extensive simulation study with three settings to compare the performance of the one-inflated Beta (BEOI) regression model in the GAMLSS family and the proposed OABR regression model. In all three settings, we generate 200 datasets with sample size N = 1200 subjects and seven visits (baseline and six follow-up visits, Ji = 7) for each subject. The data structure is similar to the motivating LS-1 study and the continuous proportional outcome is restricted in the interval (0, 1]. We consider one covariate xi taking value 0 or 1 each with probability 1/2 to mimic the treatment assignment. The time vector ti = (ti1, ti2, …, ti7)′ = (0, 1, 2, 3, 4, 5, 6)′, which is the same as the motivating LS-1 study.

5.1 Simulation Settings I and II: Data simulated from either the BEOI or the OABR regression model

The aims of simulation Settings I and II are to demonstrate that under model overparameterization, the OABR regression model can reduce to the BEOI regression model, while the BEOI regression model does not perform well when data are generated from the OABR regression model. In these two simulation settings, we assume that there are boundary value ones and generate the datasets from the models

logit[p1ij=P(yij=1ui0)]=ω0+ω1xi+ui0logit(γijui1)=β0+β1xi+β2tij+β3xitij+ui1log(ϕijui2)=η0+η1xi+η2tij+η3xitij+ui2, (9)

where the random effects (ui0, ui1, ui2)′ ~ N3(0, Σ), with σ1 = 1.2, σ2 = 0.6, σ3 = 0.4 and ρ12 = 0.4, ρ13 = 0.1, ρ23 = 0.4 as the components of the covariance matrix Σ shown in (5). We set the regression coefficients ω = (ω0, ω1)′ = (−1.5, −0.5)′, β = (β0, β1, β2, β3)′ = (1.5, −0.5, −0.1, 0.2)′, and η = (η0, η1, η2, η3)′ = (2.5, 0.2, 0.1, 0.1)′. We set either α = 0 (Setting I) or α = 0.2 (Setting II), then the data are simulated from either the BEOI (Setting I) or the OABR (Setting II) regression models. We fit the BEOI and OABR regression models to the simulated datasets in both settings. Web Tables 1 and 2 display bias (the average of the posterior means minus the true values), standard deviation (SD, the standard deviation of the posterior means), coverage probabilities (CP) of 95% equal-tail credible intervals, and root mean squared error (RMSE) from the BEOI and OABR regression models in Settings I and II respectively. The results suggest that when data are simulated from BEOI regression model as in Setting I, both the BEOI and OABR regression models generate comparable results with very small bias, RMSE close to SD, and the coverage probability being reasonably close to 0.95. Under model overparameterization, the estimate of the shape parameter α from the OABR regression model is correctly close to zero, suggesting that it is still a reasonable model in this simulation setting. When the data are simulated from the OABR regression model in Setting II (α = 0.2), the OABR regression model can successfully recover all parameters, including α. In contrast, the BEOI regression model gives biased estimates and poor coverage probabilities for most parameters.

5.2 Simulation Settings III: Data simulated from the BEOI regression model with outliers

The aims of simulation Setting III is to compare the performance of the BEOI and OABR regression models while data are simulated from the BEOI model but contaminated with outliers, in order to evaluate the influence of outliers and extremal events around the tail-area. Setting III are similar to Setting I, but we contaminate 1% of the randomly selected observations with high scores (between 0.9 and 1) by decreasing Δ units ( yij=yij-Δ, and Δ = 0.8). To visualize the outliers, we randomly select one dataset in Setting III and plot it in Figure 3. Figure 3 displays the longitudinal profiles (upper panels) of 50 randomly selected subjects and 3 contaminated subjects (denoted by black dashed, dotted, and dotdash lines), in addition to the boxplots of all subjects (lower panels) before (left panels) and after (right panels) the outlier contamination. Some observations from the three highlighted subjects are contaminated and have some potential outliers represented by the sudden value drops. After contamination by outliers, the lower tails of the data become heavier and include more extremal values.

Figure 3.

Figure 3

The longitudinal profiles (upper panels) of 50 randomly selected subjects and 3 contaminated subjects (black dashed, dotted, and dotdash lines), in addition to the boxplots of all subjects (lower panels) before (left panels) and after (right panels) the outlier contamination.

We then fit the BEOI and OABR regression models in Setting III. Table 1 displays the simulation results. In comparison to the BEOI regression model, the OABR regression model provides similar estimates for parameters β and ω, but much smaller bias (e.g., −0.170 v.s. 0.041 for η0 and 0.090 v.s. 0.035 for η1) and much smaller RMSE (e.g., 0.180 v.s. 0.073 for η0 and 0.123 v.s. 0.089 for η1) for parameters η, σ’s and ρ’s. The estimates of the regression parameter vector ω from both models have small bias because the probability of being 1 is not contaminated. The estimate of the shape parameter α is 0.074 in the OABR regression model, assigning some probability to the occurrences of outliers and extremal events.

Table 1.

Simulation results when data are simulated from the BEOI (Setting III) regression models and are contaminated by 1% outliers.

BEOI regression model OABR regression model

Bias SD SE RMSE Bias SD SE RMSE
Setting III: Data simulated from the BEOI regression model with outliers
ω0 = −1.5 −0.001 0.071 0.068 0.071 −0.007 0.070 0.068 0.070
ω1 = −0.5 0.005 0.102 0.095 0.102 0.005 0.099 0.096 0.098
β0 = 1.5 −0.103 0.033 0.032 0.108 −0.063 0.034 0.032 0.072
β1 = −0.5 0.015 0.044 0.043 0.047 0.019 0.045 0.043 0.049
β2 = −0.1 0.007 0.006 0.005 0.009 0.004 0.006 0.005 0.007
β3 = 0.2 −0.010 0.008 0.007 0.012 −0.008 0.008 0.007 0.011
η0 = 2.5 −0.170 0.059 0.063 0.180 0.041 0.061 0.058 0.073
η1 = 0.2 0.090 0.084 0.088 0.123 0.035 0.082 0.080 0.089
η2 = 0.1 0.011 0.017 0.015 0.020 0.009 0.017 0.016 0.019
η3 = 0.1 −0.058 0.024 0.021 0.063 −0.014 0.023 0.022 0.028
σ1 = 1.2 0.000 0.057 0.055 0.057 0.003 0.055 0.055 0.055
σ2 = 0.6 −0.043 0.016 0.015 0.045 −0.029 0.015 0.014 0.032
σ3 = 0.4 0.420 0.017 0.027 0.420 −0.069 0.042 0.042 0.081
ρ12 = 0.4 −0.024 0.046 0.044 0.052 −0.006 0.042 0.042 0.042
ρ13 = 0.1 −0.133 0.065 0.060 0.148 −0.025 0.124 0.123 0.126
ρ23 = 0.4 −0.046 0.034 0.041 0.057 −0.058 0.079 0.078 0.098
α = 0 0.074 0.003 0.007 0.074

Figure 4 displays the K-L divergence measures of all observations from a randomly selected simulation dataset, estimated by the BEOI (left panel) and OABR (right panel) regression models. The BEOI regression model identifies many observations as potential outliers indicated by K-L divergence measures being larger than 3. However, using the OABR regression model, there seems to be no in-fluential observations with all K-L divergence measures smaller than 2. This figure suggests that in comparison to the BEOI model, the OABR regression model can effectively control the potential outlying observations.

Figure 4.

Figure 4

Estimated K-L divergence measures from the BEOI (left panel) and OABR (right panel) regression models for a randomly selected simulation dataset in Setting III.

In conclusion, the simulation results suggest that when the data are simulated from the BEOI regression model, the overparameterized OABR regression model generates results comparable to the true BEOI regression model. However, when the data are simulated from the OABR regression model, the BEOI regression model provides parameter estimates with large bias and RMSE and poor coverage probabilities. When the data are simulated from the BEOI regression model, but are contaminated by some outliers, the OABR regression model provides parameter estimates with much smaller bias and RMSE, comparing to the BEOI regression model.

6 Application to the LS-1 study

In this section, we apply the BEOI and proposed OABR regression models and the Bayesian inference framework to the motivating LS-1 study. For all results in this section, we run two parallel MCMC chains with overdispersed initial values and run each chain for 2, 000 iterations. The first 1, 000 iterations are discarded as burn-in and the inference is based on the remaining 1, 000 iterations from each chain. Good mixing properties of the MCMC chains for all model parameters are observed in the trace plots. The scale reduction of all parameters are smaller than 1.1.

In the data analysis, we consider the following covariates: the treatment assignment variable xi (1 for treatment, and 0 for placebo), time tij, and time and treatment interaction. In model (4), we let Xi0 = (1, xi), Xi1 = (1, xi, ti, xi × ti) and Xi2 = (1, xi, ti, xi × ti). We divide the EQ-VAS variable by 100 to rescale it to the interval (0, 1]. Table 2 compares the BEOI and OABR regression models using the model selection criteria discussed in Section 4.2. The OABR regression model performs better than the BEOI regression model with larger LPML. The PBF in favor of the OABR over the BEOI regression model is much greater than 100, indicating decisive evidence of choosing the OABR regression model as the final model.

Table 2.

Model comparison statistics for the LS-1 dataset. LPML: log pseudo-marginal likelihood; PBF: pseudo-Bayes factor.

LPML PBF
BEOI 7494.85 Ref
OABR 7504.75 ≫ 100

To determine the presence of possible outlying observations, Figure 5 displays the K-L divergence measures of all observations estimated from the BEOI (left panel) and OABR (right panel) regression models. The BEOI regression model identifies many observations as potential outliers indicated by the K-L divergence larger than 3. However, using the OABR regression model, there seems to be only one influential observation while all other observations have K-L divergence measures smaller than 3. This figure suggests that the OABR regression model can effectively control the potential outlying observations.

Figure 5.

Figure 5

Estimated K-L divergence measures from the BEOI (left panel) and OABR (right panel) regression models for the LS-1 dataset.

Table 3 displays the posterior mean, standard deviation (SD), and 95% equal-tail credible intervals from the BEOI and OABR regression models. Parameter estimates are noticeably different from two models. For creatine patients, the odds ratio of reporting 100 in EQ-VAS score is 0.604 (exp(−0.505), 95% CI: [0.348, 0.970]) comparing to the placebo patients, in the BEOI regression model, v.s. 0.607 (exp(−0.500), 95% CI: [0.361, 1.065]) in the OABR regression model. The parameters in the second part of Table 3 represent the covariates effects on the mean EQ-VAS score conditional on not being one. Thus, negative parameters suggest deterioration in patients’ global assessment of their health represented by EQ-VAS score. Conditional on other covariates and the random effects, parameter interpretation can be expressed in terms of the covariate effect on the odds γij1-γij (rescale) or 100γij100-100γij in the original scale for the OABR regression model. Specifically, for creatine patients, the ratio between the expected EQ-VAS score γij and the difference to perfect health (1 − γij) is 0.941 (exp(−0.061), 95% CI: [0.872, 1.012]) times the ratio of placebo patients. For one year increase in time, the ratio between the expected EQ-VAS score and the difference to perfect health decreases by 7.5% (1 − exp(−0.078), 95% CI: [0.066, 0.085]). The regression parameters in the BEOI regression model can be interpreted in a similar way. The parameters in the third part of Table 3 represent the covariates effects on the precision parameter ϕij. The results suggest that the precision parameter is not affected by treatment, but changes over time.

Table 3.

Results of fitting the BEOI and OABR regression models in the LS-1 dataset.

BEOI regression model OABR regression model

Mean SD 95% CI Mean SD 95% CI
For probability of being one
Int −6.237 0.388 −7.053 −5.587 −6.367 0.369 −7.122 −5.687
Trt −0.505 0.255 −1.056 −0.030 −0.500 0.273 −1.019 0.063
For conditional mean
Int 1.573 0.036 1.508 1.635 1.599 0.026 1.547 1.649
Trt −0.041 0.040 −0.121 0.024 −0.061 0.037 −0.137 0.012
Time −0.076 0.005 −0.087 −0.066 −0.078 0.005 −0.089 −0.068
Time:Trt 0.001 0.008 −0.014 0.015 0.002 0.008 −0.013 0.017
For precision parameter
Int 3.133 0.068 3.024 3.251 3.338 0.055 3.227 3.441
Trt −0.018 0.077 −0.179 0.143 −0.075 0.076 −0.231 0.068
Time −0.044 0.018 −0.081 −0.012 −0.057 0.019 −0.095 −0.019
Time:Trt −0.004 0.023 −0.054 0.041 0.004 0.027 −0.049 0.056
σ1 2.555 0.241 2.084 3.081 2.632 0.225 2.212 3.086
σ2 0.670 0.013 0.646 0.697 0.654 0.014 0.627 0.681
σ3 0.802 0.025 0.759 0.853 0.588 0.031 0.524 0.648
ρ12 0.610 0.051 0.522 0.705 0.646 0.053 0.542 0.742
ρ13 −0.118 0.092 −0.258 0.068 −0.041 0.096 −0.226 0.153
ρ23 0.657 0.026 0.603 0.706 0.643 0.037 0.569 0.717
α 0.053 0.008 0.039 0.068

Note that the estimate of the correlation coeffcient ρ12 between two random effects υi0 and υi1 is 0.646 (95% CI: [0.542, 0.742]). The significant positive correlation coeffcient indicates that the patients with better global assessment of their health (higher &upsi;i1) are more likely to report perfect health (higher &upsi;i0). The facts that the estimate of the shape parameter α is 0.053 (95% CI: [0.039, 0.068]) and the PBF in favor of the OABR regression model over the BEOI regression model is much larger than 100 (as in Table 2) suggest the existence of potential outliers and heavier tails than what the Beta distributions can model.

7 Discussion

In this article, we propose the one-augmented Beta rectangular (OABR) regression model for responses measured in the interval (0, 1], which is extended to data in the closed unit interval [0, 1]. This model accounts for not only the within-subject correlation and the occurrence of boundary values, but also the over-occurrence of tail-area events including heavy tails and outlying observations. We adopt Bayesian inference framework based on Markov Chain Monte Carlo (MCMC) simulation for parameter estimation. The extensive simulation study suggests that the proposed OABR regression model improves the accuracy of parameter estimates when outliers and heavy tails exist. In comparison, the regression model based on the Beta distribution (BEOI regression model in the GAMLSS family) provides parameter estimates with large bias and RMSE and poor coverage probabilities in the presence of heavy tails or outliers. We apply the proposed model to the motivating LS-1 study dataset. The OABR regression model has better fit than the BEOI regression model. The treatment creatine has insignificant effects in either the probability of reporting perfect health (100 in the EQ-VAS score) or the mean EQ-VAS score. This finding is not surprising because the LS-1 study was terminated early for futility based on results of a planned interim analysis (Kieburtz et al., 2015). However, the patient-reported global health assessment deteriorates along with time. The proposed model and Bayesian inference can be easily implemented by the publicly available software packages such as BUGS and Stan, and be easily accessible to by applied researchers.

There are some limitations in our proposed OABR regression model that we will address in our future study. First, the parameters of the augmented Beta rectangular distribution (e.g. p1ij, γij, and ϕij) are modeled as linear functions of covariates. However, the linear assumption may not be realistic in some scenarios. As pointed out by one reviewer, the GAMLSS family allows non-linear or smooth functions in the regression models and it can handle smooth effects of covariates. In our future research, we would like to investigate a class of varying-coefficient models (Sun & Wu, 2005) that incorporate the time-dependent covariate effects via penalized splines with a truncated polynomial basis and a fixed number of knots (Ruppert, 2002). Second, both the monotone (e.g., dropout) and non-monotone (e.g., missed visit) missing exist in the LS-1 study. In this article, we assume they are all missing at random (MAR). However, monotone missing is likely to be caused by some terminal events such as dropout or death. The terminal events are often correlated with the health outcomes such as EQ-VAS and they create the issue of informative censoring. How to address the informative censoring issue in the proposed augmented Beta rectangular regression model is an interesting direction of future research. Moreover, when the non-monotone missingness due to missed visit is associated with either the unobserved value or the underlying health status (e.g., sicker patients are more likely to miss visits), the non-monotone missing data are missing not at random (MNAR) (Little & Rubin, 2002). Under the MNAR assumption, the missing data mechanism needs to be modeled simultaneously with the outcome variable to avoid biased parameter estimates (Diggle et al., 2002). In addition, we have chosen a multivariate normal distribution for the random effects vector because it is flexible in modeling the covariance structure within and between various types of recurrent events and it has meaningful interpretation on correlation. In generalized linear mixed models, misspecification of random effects distribution has little impact on the parameters that are not associated with the random effects (Jacqmin-Gadda et al., 2007; Rizopoulos et al., 2008; McCulloch et al., 2011). The impact of random effects misspecification in the proposed modeling framework warrants further investigation. We will also investigate the effect of random effects misspecification and relax the normality assumption by considering Bayesian non-parametric (BNP) framework based on Dirichlet process mixture (Escobar, 1994).

Supplementary Material

Web Supplement

Acknowledgments

The project described was supported by the National Center for Research Resources and the National Center for Advancing Translational Sciences, National Institutes of Health, through Grant KL2 TR000370. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing high-performing computing resources that have contributed to the research results reported within this article. URL: http://www.tacc.utexas.edu. The author thanks the editor, associate editor, and two anonymous referees for their reading and valuable comments.

References

  1. Bayes Cristian L, Bazán Jorge L, García Catalina, et al. A new robust regression model for proportions. Bayesian Analysis. 2012;7(4):841–866. [Google Scholar]
  2. Cancho Vicente G, Dey Dipak K, Lachos Victor H, Andrade Marinho G. Bayesian nonlinear regression models with scale mixtures of skew-normal distributions: estimation and case influence diagnostics. Computational Statistics & Data Analysis. 2011;55(1):588–602. [Google Scholar]
  3. Carlin BP, Louis TA. Bayesian Methods for Data Analysis. Chapman & Hall/CRC; 2009. [Google Scholar]
  4. Cleveland William S. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association. 1979;74(368):829–836. [Google Scholar]
  5. Dey DK, Chen MH, Chang H. Bayesian approach for nonlinear random effects models. Biometrics. 1997:1239–1252. [Google Scholar]
  6. Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data. Oxford University Press; 2002. [Google Scholar]
  7. Duane S, Kennedy AD, Pendleton BJ, Roweth D. Hybrid Monte Carlo. Physics Letters B. 1987;195(2):216–222. [Google Scholar]
  8. Dunson David D. Bayesian methods for latent trait modelling of longitudinal data. Statistical Methods in Medical Research. 2007;16(5):399–415. doi: 10.1177/0962280206075309. [DOI] [PubMed] [Google Scholar]
  9. Elm Jordan J. Design innovations and baseline findings in a long-term Parkinson’s trial: The national institute of neurological disorders and stroke exploratory trials in Parkinson’s Disease Long-Term Study–1. Movement Disorders. 2012;27(12):1513–1521. doi: 10.1002/mds.25175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Escobar MD. Estimating normal means with a Dirichlet process prior. Journal of the American Statistical Association. 1994;89(425):268–277. [Google Scholar]
  11. Ferrari S, Cribari-Neto F. Beta regression for modelling rates and proportions. Journal of Applied Statistics. 2004;31(7):799–815. [Google Scholar]
  12. Figueroa-Zúniga JI, Arellano-Valle RB, Ferrari SLP. Mixed beta regression: A Bayesian perspective. Computational Statistics & Data Analysis. 2013;61:137–147. [Google Scholar]
  13. Galvis DM, Bandyopadhyay D, Lachos VH. Augmented mixed beta regression models for periodontal proportion data. Statistics in Medicine. 2014;33(21):3759–3771. doi: 10.1002/sim.6179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. García CB, Pérez J García, van Dorp., JR Modeling heavy-tailed, skewed and peaked uncertainty phenomena with bounded support. Statistical Methods & Applications. 2011;20(4):463–486. [Google Scholar]
  15. Geisser S. Predictive Inference. Vol. 55. CRC Press; 1993. [Google Scholar]
  16. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. CRC press; 2013. [Google Scholar]
  17. Ghosh P, Hanson T. A semiparametric Bayesian approach to multivariate longitudinal data. Australian & New Zealand Journal of Statistics. 2010;52(3):275–288. doi: 10.1111/j.1467-842X.2010.00581.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hahn ED. Mixture densities for project management activity times: A robust approach to PERT. European Journal of Operational Research. 2008;188(2):450–459. [Google Scholar]
  19. Hatfield LA, Boye ME, Carlin BP. Joint modeling of multiple longitudinal patient-reported outcomes and survival. Journal of Biopharmaceutical Statistics. 2011;21(5):971–991. doi: 10.1080/10543406.2011.590922. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hatfield Laura A, Boye Mark E, Hackshaw Michelle D, Carlin Bradley P. Multilevel Bayesian models for survival times and longitudinal patient-reported outcomes with many zeros. Journal of the American Statistical Association. 2012;107(499):875–885. [Google Scholar]
  21. Hoffman MD, Gelman A. The no-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research. 2013 in press. [Google Scholar]
  22. Jacqmin-Gadda Hélène, Sibillot Solenne, Proust Cécile, Molina Jean-Michel, Thiébaut Rodolphe. Robustness of the linear mixed model to misspecified error distribution. Computational Statistics & Data Analysis. 2007;51(10):5142–5154. [Google Scholar]
  23. Johnson NL, Kotz S, Balakrishnan N. Continuous univariate distributions. Vol. 2. New York: 1994. [Google Scholar]
  24. Kieburtz K, Tilley Barbara C, Elm Jordan J, et al. Effect of creatine monohydrate on clinical progression in patients with parkinson disease: A randomized clinical trial. JAMA. 2015;313(6):584–593. doi: 10.1001/jama.2015.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kieschnick R, McCullough BD. Regression analysis of variates observed on (0, 1): percentages, proportions and fractions. Statistical Modelling. 2003;3(3):193–213. [Google Scholar]
  26. Lee Sik-Yum, Song Xin-Yuan. Evaluation of the Bayesian and maximum likelihood approaches in analyzing structural equation models with small sample sizes. Multivariate Behavioral Research. 2004;39(4):653–686. doi: 10.1207/s15327906mbr3904_4. [DOI] [PubMed] [Google Scholar]
  27. Little RJA, Rubin DB. Statistical Analysis With Missing Data. Wiley, John & Sons; 2002. [Google Scholar]
  28. McCulloch Charles E, Neuhaus John M, et al. Misspecifying the shape of a random effects distribution: why getting it wrong may not matter. Statistical Science. 2011;26(3):388–402. [Google Scholar]
  29. Neal RM. An improved acceptance procedure for the hybrid Monte Carlo algorithm. Journal of Computational Physics. 1994;111(1):194–203. [Google Scholar]
  30. Ospina R, Ferrari SLP. Inflated Beta distributions. Statistical Papers. 2010;51(1):111–126. [Google Scholar]
  31. Peng F, Dey DK. Bayesian analysis of outlier problems using divergence measures. Canadian Journal of Statistics. 1995;23(2):199–213. [Google Scholar]
  32. Rigby RA, Stasinopoulos DM. Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2005;54(3):507–554. [Google Scholar]
  33. Rizopoulos Dimitris, Verbeke Geert, Molenberghs Geert. Shared parameter models under random effects misspecification. Biometrika. 2008;95(1):63–74. [Google Scholar]
  34. Rogers JA, Polhamus D, Gillespie WR, Ito K, Romero K, Qiu R, Stephenson D, Gastonguay MR, Corrigan B. Combining patient-level and summary-level data for Alzheimers disease modeling and simulation: a beta regression meta-analysis. Journal of Pharmacokinetics and Pharmacodynamics. 2012;39(5):479–498. doi: 10.1007/s10928-012-9263-3. [DOI] [PubMed] [Google Scholar]
  35. Ruppert D. Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics. 2002;11(4):735–757. [Google Scholar]
  36. Sinha D, Dey DK. Semiparametric Bayesian analysis of survival data. Journal of the American Statistical Association. 1997;92(439):1195–1212. [Google Scholar]
  37. Smithson M, Verkuilen J. A better lemon squeezer? Maximum-likelihood regression with beta-distributed dependent variables. Psychological Methods. 2006;11(1):54. doi: 10.1037/1082-989X.11.1.54. [DOI] [PubMed] [Google Scholar]
  38. Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.1. 2014. [Google Scholar]
  39. Stasinopoulos D Mikis, Rigby Robert A. Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software. 2007;23(7):1–46. [Google Scholar]
  40. Sun Y, Wu H. Semiparametric time-varying coefficients regression model for longitudinal data. Scandinavian Journal of Statistics. 2005;32(1):21–47. [Google Scholar]
  41. Verkuilen J, Smithson M. Mixed and mixture regression models for continuous bounded responses using the Beta distribution. Journal of Educational and Behavioral Statistics. 2012;37(1):82–113. [Google Scholar]
  42. Vieira AMC, Hinde JP, Demétrio CGB. Zero-inflated proportion data models applied to a biological control assay. Journal of Applied Statistics. 2000;27(3):373–389. [Google Scholar]
  43. Zhao L, Chen Y, Schaffner DW. Comparison of logistic regression and linear regression in modeling percentage data. Applied and Environmental Microbiology. 2001;67(5):2129–2135. doi: 10.1128/AEM.67.5.2129-2135.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Web Supplement

RESOURCES