Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2015 Mar 19;16(3):441–453. doi: 10.1093/biostatistics/kxv005

Bayesian partial linear model for skewed longitudinal data

Yuanyuan Tang 1, Debajyoti Sinha 1, Debdeep Pati 1,*, Stuart Lipsitz 2, Steven Lipshultz 3
PMCID: PMC5963473  PMID: 25792623

Abstract

Unlike majority of current statistical models and methods focusing on mean response for highly skewed longitudinal data, we present a novel model for such data accommodating a partially linear median regression function, a skewed error distribution and within subject association structures. We provide theoretical justifications for our methods including asymptotic properties of the posterior and associated semiparametric Bayesian estimators. We also provide simulation studies to investigate the finite sample properties of our methods. Several advantages of our method compared with existing methods are demonstrated via analysis of a cardiotoxicity study of children of HIV-infected mothers.

Keywords: Dirichlet process, Median regression, Partial linear model, Semiparametric, Skewed error

1. Introduction

Existing methods for analysis of longitudinal continuous response (Verbeke and Molenberghs, 2009; Diggle and others, 2002) mostly focus on estimating the mean response function. These are appropriate when the outcome variable is either approximately Gaussian or symmetric. When the response variable is heavily skewed, a model based on median regression (e.g., Koenker and Bassett, 1978) may be more appropriate than mean regression-based models. The assumption of symmetric errors is not valid for many real-life biomedical and econometric studies. For example, in a Pediatric Pulmonary and Cardiac Complications (in short P2C2) study of children of HIV infected mothers, the longitudinal response of interest is the interval between consecutive R-waves measured via EKG (called RRBO) during clinic visits. This RRBO is known to be heavily skewed even after a log-transformation. In general, the main difficulties of using a transformation-based analysis of a heavily skewed response such as RRBO include determining a suitable transformation to symmetry, evaluating the covariate effects on RRBO from the fitted model of the transformed RRBO, and specifying prior opinions about the regression function of the transformed RRBO using the available prior opinion about the untransformed RRBO. To address these issues, we develop a flexible Bayesian model for skewed longitudinal response and focus on median regression function.

Our second important goal is to present a partial linear model (Härdle and Liang, 2007) to incorporate the effect of one of the covariates (say, time) on median as an unknown nonparametric function, while providing a good physical interpretation of the parametric effects of the rest of the covariates of interest (say, HIV status of a child for the P2C2 study). Existing partial linear models (e.g., Speckman, 1988) for longitudinal data often only allow either mean regression or parametric error density (Ho and Lin, 2010). The Inline graphic-estimation procedure of He and others (2002) uses a partial linear model for median. However, it does not specify the likelihood and any within-subject dependence structure. On the other hand, a likelihood-based Bayesian analysis can ensure accurate estimation, characterize all sources of uncertainty, and is suitable for prediction. Some authors (e.g., Kottas and Gelfand, 2001; Hanson and Johnson, 2002; Lin and others, 2012) implemented Bayesian median regression for univariate responses. In Section 2, we develop a novel Bayesian median regression method which incorporates a partial linear model for skewed longitudinal responses. Unlike existing Bayesian approaches for median regression of longitudinal responses using subject-specific random effects (e.g., Reich and others, 2010; Yue and Rue, 2011), our copula (Pitt and others, 2006)-based model can accommodate a wide range of within-subject dependencies while maintaining any desired form of the marginal error density. For example, in the P2C2 study, the copula structure allows us to maintain the desired stationarity of the marginal error density. Unlike our copula model, a random effects model has various shortcomings including assumptions of parametric random effects density and specifications of hyper-priors associated with the parameters of the random effects density. For longitudinal data, the class of correlation matrices can also be enriched using random effects by parameterizing it via partial autocorrelations (Lee and others, 2013). However, our main objective of using copula is to have flexible dependence structure while preserving the median zero property of the marginal residual density. For random effects models, it is not straightforward to enforce the median zero marginal error density unless we make the restrictive assumption of symmetric random effects density. We also demonstrate the ease of the specification of the priors of our model based on the available prior opinion about the marginal behavior of the response and the dependence structure. For example, to specify the prior process on marginal error density, we only use a “guess” (prior mean) of the error density along with a confidence/precision around the “guess”.

Our other novel contribution is to provide a rigorous theoretical justification for our semiparametric Bayesian method for longitudinal response in Section 3. Although consistency of the frequentist estimators of a partial linear model has been investigated previously (Härdle and Liang, 2007; Bhattacharya and Zhao, 1997), analogous results within a fully Bayesian paradigm are restricted to the parametric error distributions (Bickel and others, 2012), semiparametric mean regression for univariate response (Amewou-Atisso and others, 2003), and completely nonparametric regression and density estimation for multivariate response (Shen and others, 2013; Pati and others, 2013). Instead, our consistency results pertain to longitudinal response with within-subject dependencies. We obtain posterior consistency of the regression parameters as well as nonparametric effect of time using only minor regularity conditions on the tail of the error density and on the level of approximation of the nonparametric effect of time. Our minor sufficient conditions for posterior consistency assure a practitioner that our posterior inference is driven by observed data when the data-information is sufficiently large. In addition, existing Bayesian methods for partial linear model for even univariate skewed response suffer from a lack of easily implementable (via widely available softwares) computational tool. Our Bayesian method uses a Markov chain Monte Carlo (MCMC) procedure implementable via a free software such as JAGS (code is available from the authors). Section 2 describes the model and prior specifications. Section 3 studies the theoretical properties of the model. Section 4 presents simulation study results. Section 5 illustrates various advantages of our methods via the analysis of the so-called P2C2 study of children of HIV-infected mothers. Some remarks and discussions are in Section 6. Proofs are mostly deferred to supplementary material available at Biostatistics online.

2. Model and prior specification

Let Inline graphic for subject Inline graphic denote the observed vectors of the longitudinal responses measured at irregular time points Inline graphic with Inline graphic. The partial linear regression model for the response variable Inline graphic at time Inline graphic is given by

2. (2.1)

where Inline graphic is the Inline graphic covariate at Inline graphic, Inline graphic is the vector of regression parameters and Inline graphic is an unknown smooth function of time. Further, Inline graphic is the error vector with median zero and possibly skewed marginal density Inline graphic. This ensures that the median of Inline graphic is Inline graphic. For now, we assume that the marginal error density Inline graphic is stationary (appropriate for our motivating study). However, our model in (2.1) and corresponding methods can straightforwardly deal with non-stationary errors (discussed later). We approximate the true unknown smooth Inline graphic in (2.1) using a piecewise-polynomial B-spline function (Schumaker, 2007) with quantiles of observation time points Inline graphic selected as the knots of the basis functions Inline graphic for a suitably chosen large Inline graphic. As a consequence, the median Inline graphic is approximated by Inline graphic, for some vector Inline graphic. We later prove that Inline graphic can approximate the unknown Inline graphic arbitrarily well either by using a prior for Inline graphic or by allowing Inline graphic to grow at a certain rate with the sample size. We also demonstrate that such a specification facilitates desired asymptotic properties of the Bayesian estimators of Inline graphic and unspecified Inline graphic.

We model our skewed median 0 marginal error density Inline graphic, with cumulative distribution function (cdf) Inline graphic, in (2.1) as

2. (2.2)

where Inline graphic is the skewness parameter and Inline graphic is an unknown non-increasing density function with support Inline graphic. The class of densities in (2.2) is related to the split-density class of Geweke (1989). When Inline graphic, Inline graphic is left (right) skewed. This class of densities encompasses a large subset of skewed densities with unique median and mode at Inline graphic. Using the result of Feller (2008) that any decreasing density Inline graphic with support Inline graphic can be expressed as a mixture of uniforms, Inline graphic in (2.2) can be expressed as a mixture of uniform kernels:

2. (2.3)

where Inline graphic is an unknown cdf with support Inline graphic. Consequently, a flexible prior for Inline graphic will induce a flexible prior for Inline graphic. We later show that it is straightforward to determine priors for Inline graphic and Inline graphic using the prior opinion about Inline graphic. Kottas and Gelfand (2001) used a sub-class of (2.2) where Inline graphic was assumed to be a mixture of half Gaussian distributions.

We use a copula model (Sklar, 1959; Nelsen, 2007), a popular tool for constructing a multivariate density with pre-specified structures of marginal densities, to specify a multivariate density Inline graphic of error vector Inline graphic as

2. (2.4)

where Inline graphic is a Inline graphic-variate copula-density with corresponding marginal (univariate) cdf and density Inline graphic and Inline graphic respectively. A copula model can incorporate a very flexible class of dependence structure including non-linear dependence, and any multivariate density can be expressed as a copula. For ease of implementation, we focus on the Gaussian copula (Pitt and others, 2006), where Inline graphic is standard normal cdf and Inline graphic is the joint multivariate normal density with mean Inline graphic and correlation matrix Inline graphic. However, our methodology can accommodate any other parametric copula structure. We assume Inline graphic to be a known function of an unknown parameter vector Inline graphic. Examples of Inline graphic include the uniform correlation matrix with common Inline graphic for all Inline graphic and the exponentiated correlation matrix with Inline graphic. These Inline graphic and Inline graphic from two different subjects may even have different forms with no common parameters. For example, the two HIV status groups of P2C2 study have different within subject association, accommodated via using a uniform correlation and an exponentiated correlation matrix with different parameters for the two groups. Also note that non-stationarity of Inline graphic in (2.1) can be easily accommodated using different error distributions Inline graphic for time point Inline graphic. This joint density ensures the desired stationarity of the marginal density Inline graphic for RRBO while allowing a flexible within subject association.

Given observed data Inline graphic, the likelihood Inline graphic is proportional to Inline graphic, where Inline graphic. The prior distributions for Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic are assumed to be mutually independent with: (i) Inline graphic and Inline graphic have different independent multivariate normal priors Inline graphic and Inline graphic, respectively; (ii) Inline graphic has a Gamma prior Inline graphic and Inline graphic has an appropriate prior Inline graphic on its range of possible values. (iii) The prior process Inline graphic of the unknown mixing distribution Inline graphic of (2.3) is Dirichlet (Ferguson, 1974) process Inline graphic, where Inline graphic is the prior mean (“prior guess”) of Inline graphic and Inline graphic is the pre-specified precision (or confidence) around Inline graphic. The resulting posterior density Inline graphic is proportional to Inline graphic.

A major practical advantage of our semiparametric Bayesian model is that the hyperparameter Inline graphic of the prior Inline graphic of Inline graphic in (2.3) is specified using the prior guess Inline graphic of the marginal error density Inline graphic, because Inline graphic corresponds to the unique prior guess Inline graphic for Inline graphic via Inline graphic. The proof is very similar to the result of Khintchine (1938). For example, to obtain an exponential “prior guess” for the density Inline graphic for Inline graphic, we need to choose a gamma distribution for Inline graphic. To obtain a heavy tailed Inline graphic as a guess, Inline graphic has to be set to an inverse-gamma density. The precision parameter Inline graphic of Inline graphic reflects the a priori uncertainty of the prior guess Inline graphic; small values of Inline graphic implies that the resulting Inline graphic for Inline graphic can be very different from the guess Inline graphic. To facilitate convenient implementation of the MCMC algorithm to sample from the posterior Inline graphic via standard software such as JAGS, we use a finite approximation of the constructive definition (Ishwaran and James, 2001) of the DP mixture prior for Inline graphic as Inline graphic, where Inline graphic are independent draws from Inline graphic, Inline graphic for Inline graphic are independent with Inline graphic density, Inline graphic and Inline graphic, Inline graphic, Inline graphic.

3. Prior properties and posterior consistency

Unlike the inference under parametric Bayesian models, semiparametric Bayesian methods do not guarantee that the influence of the prior on the posterior would be swamped away in presence of large data (Diaconis and Freedman, 1986), especially when our true function Inline graphic is only approximated by a finite spline bases during analysis. It is important and practical to study sufficient conditions for posterior consistency in such complicated Bayesian semiparametric models because they provide the conditions for the eventual identifiability of the parameters and functions of interest from observed data when the sample-size is sufficiently large. For the ease of exposition, we assume that the number of repeated observations Inline graphic is same across all subjects (Inline graphic in (2.1)), the dimension Inline graphic of Inline graphic is Inline graphic and the within-subject dependence is based on Gaussian copula with a uniform correlation matrix Inline graphic for all Inline graphic.

As a first step of this practically important theory, we investigate whether the “support”, of our semiparametric prior distribution, denoted by Inline graphic, on the joint error density is large enough to contain all practically relevant true error densities. The support of the prior provides an idea of the flexibility of the prior process. We define the class Inline graphic to be the set of all univariate, possibly asymmetric, unimodal densities with median zero, and satisfying the condition of (2.2) for some Inline graphic. Define Inline graphic to be the set of all Inline graphic-dimensional multivariate densities under Gaussian copula and with unknown uniform correlation Inline graphic. We denote the prior Inline graphic on Inline graphic as Inline graphic, where Inline graphic and Inline graphic are the priors on Inline graphic and Inline graphic respectively. Recall that prior Inline graphic is defined via (2.3) with Inline graphic and Inline graphic. Next, we need to specify suitable topology, equivalently definitions of some meaningful neighborhoods of the true joint density, to define the support of the prior Inline graphic. The Kullback–Leibler support Inline graphic of any prior Inline graphic on a density space Inline graphic is defined by the subset Inline graphic of Inline graphic satisfying Inline graphic where Inline graphic. Similarly, the weak support Inline graphic of the prior Inline graphic is defined under weak topology. Define Inline graphic, Inline graphic. In the following Lemma 3.1, with proof in Appendix A of supplementary material available at Biostatistics online, we characterize the weak and the Kullback–Leibler support of Inline graphic.

Lemma 3.1 —

(1) Inline graphic; (2) Inline graphic if suppInline graphic, Inline graphic is absolutely continuous with respect to the Lebesgue measure, Inline graphic is the set of all distributions on Inline graphic and Inline graphic.

Next, we show that our B-splines approximation of Inline graphic, the set of continuous functions from Inline graphic to Inline graphic, is adequate to approximate any true continuous Inline graphic if we use appropriately chosen prior for the number of bases Inline graphic. Let Inline graphic denote the prior on Inline graphic based on linear combination of B-splines with Gaussian prior for the basis coefficients. We characterize the support of Inline graphic below in Lemma 3.2 whose proof is in Appendix B of supplementary material available at Biostatistics online.

Lemma 3.2 —

For any Inline graphic, Inline graphic.

Remark 3.3 —

When the true function Inline graphic is a combination of B-splines, the prior for Inline graphic induced through Inline graphic can be even degenerate because for any Inline graphic, Inline graphic for large enough Inline graphic.

We now provide sufficient conditions that guarantee that, as the sample size Inline graphic of the observed data Inline graphic increases, the posterior distributions of the scalar Inline graphic, multivariate error density Inline graphic, and nonparametric Inline graphic concentrate around any arbitrarily small pre-defined neighborhoods around their true values Inline graphic, Inline graphic and Inline graphic, respectively. The choices of topologies for these neighborhoods depend on practical considerations. We use a strong neighborhood around Inline graphic, the primary parameter of interest, and a weak neighborhood Inline graphic around Inline graphic, considered a nuisance function for partial linear model. We consider a neighborhood Inline graphic around Inline graphic because we can only hope to recover the true Inline graphic at the observed time points Inline graphic, unless we impose additional and somewhat unrealistic assumptions on the design and sampled time points (Amewou-Atisso and others, 2003). Consider Inline graphic, for arbitrary Inline graphic. We state our result on posterior consistency in Theorem 3.4 whose proof is in Appendix C of supplementary material available at Biostatistics online.

Theorem 3.4 —

Assume that Inline graphic. Consider the prior Inline graphic on Inline graphic. Assume that the prior for Inline graphic satisfies Inline graphic with Inline graphic for some constant Inline graphic. Then Inline graphic a.s. under Inline graphic, Inline graphic, Inline graphic and Inline graphic is the true distribution generating data Inline graphic.

Remark 3.5 —

For example, when Inline graphic, the condition Inline graphic of Theorem 3.4 holds with Inline graphic. When the true Inline graphic is actually a linear combination of finite B-spline basis functions, one can achieve the same conclusion of Theorem 3.4 with a deterministic sequence Inline graphic. Although our proof of the posterior consistency uses a Gaussian copula and equi-correlation Inline graphic, our techniques can be used to prove the results for any correlation-matrix and other parametric copula model including heavy-tailed copula.

4. Simulation study

Using a variety of simulation studies, we investigate the finite sample properties of our Bayesian methods and compare those with some of the existing methods. We compare estimates of Inline graphic and Inline graphic obtained using the following classes of error distributions and associated methods: (i) Inline graphic-estimator (He and others, 2002) of median functional (called “Inline graphic-estimator” in Table 1), (ii) Bayesian estimates with parametric skewed double exponential error density (called “SDE” in Table 1), (iii) Bayesian estimates with semiparametric skewed density of (2.3) (called “SPMSD” in Table 1). The parametric density of SDE model in (ii) is Inline graphic with Inline graphic prior for Inline graphic. To compare the performance of these three methods, we use the mean squared error Inline graphic for estimated Inline graphic, and the mean integrated squared error Inline graphic for estimated Inline graphic, where Inline graphic is the estimates of Inline graphic obtained via analyzing dataset Inline graphic. Similarly, Inline graphic for Inline graphic are the estimators of the spline coefficient vectors. The MISE is a measure of difference between the estimated and unknown Inline graphic at observed time points.

Table 1.

Approximate (via Monte Carlo) sampling mean and mean square error (MSE within parenthesis) of the estimated regression parameters (Inline graphic), and approximate mean integrated squared error (MISE) for function Inline graphic. True parameter values are Inline graphic

Estimation models Inline graphic(Inline graphic) Inline graphic(Inline graphic) MISE
Simulation study Inline graphic : common variance
Inline graphic-estimator 2.994 (1.209) Inline graphic (1.320) 19.020
SDE 2.985 (0.695) Inline graphic (0.773) 13.950
SPMSD 2.988 (0.591) Inline graphic (0.717) 7.412
Simulation study Inline graphic : heteroscedastic error
Inline graphic-estimator 3.002 (0.301) Inline graphic (0.217) 2.721
SDE 2.996 (0.176) Inline graphic (0.158) 2.816
SPMSD 3.000 (0.228) Inline graphic (0.205) 2.521
Simulation study Inline graphic : error with outliers
Inline graphic-estimator 2.995 (0.687) Inline graphic (0.673) 8.137
SDE 2.981 (0.581) Inline graphic (0.550) 6.161
SPMSD 2.985 (0.526) Inline graphic (0.626) 5.787
Simulation study Inline graphic : Gumbel error
Inline graphic-estimator 3.008 (5.952) Inline graphic (6.483) 80.260
SDE 2.913 (4.878) Inline graphic (4.219) 71.230
SPMSD 2.919 (4.282) Inline graphic (3.902) 34.290

We use standard diagnostic tests as well as trace plots of the MCMC samples to monitor the convergence of MCMC. Also, we get essentially identical posterior summaries with different MCMC starting points and for moderate changes of values of the hyperparameters. Each of the simulation settings uses Inline graphic replicates of simulated data sets, each with Inline graphic subjects and every subject measured at five random time points sampled from Inline graphic. All simulation models have the marginal regression structure Inline graphic, with skewed Inline graphic, Inline graphic and quadratic Inline graphic. We sample Inline graphic and Inline graphic independently from Inline graphic.

For Bayesian analysis, we use independent Inline graphic priors for Inline graphic and Inline graphic, Inline graphic prior for Inline graphic, and Inline graphic priors for Inline graphic and Inline graphic. We use quadratic splines for estimation of Inline graphic in all methods. For the Inline graphic-estimator, we select the knots using BIC as described in He and others (2002). For the SPMSD model with DP prior for Inline graphic, we use Inline graphic and Inline graphic for Inline graphic (fairly noninformative).

The first simulation model has skewed error Inline graphic with Inline graphic, where the diagonal elements Inline graphic and off diagonal elements Inline graphic for Inline graphic. This implies that Inline graphic has median Inline graphic and expectation Inline graphic. The summary of the results of simulation study 1 given in Table 1 shows that Bayesian methods using model of (2.1) with both parametric (SDE) and nonparametric error densities (SPMSD) outperform the Inline graphic-estimator.

Our second simulation setting aims to compare performances of the estimators when observations are generated from a model with heteroscedastic as well as skewed Inline graphic. This simulation model uses error Inline graphic dependent on the covariate Inline graphic using Inline graphic, where Inline graphic is simulated using the same error distribution of the first simulation model. In this setting, the performances of the estimates based on our model in (2.1) are, somewhat surprisingly, even robust to the heteroscedastic error density. As the residual density in (2.1) is highly flexible, the posterior means for the regression parameters are close to the true values, although the posterior distribution is different under misspecified error density. Also the parametric SDE method substantially outperforms the Inline graphic-estimation method for estimating Inline graphic.

Our third simulation study aims to evaluate the robustness of the methods to the presence of possible large outliers. The error vectors Inline graphic are simulated from the mixture of two distributions, Inline graphic for all Inline graphic with probability Inline graphic otherwise Inline graphic for all Inline graphic (resulting Inline graphic has outlier with probability Inline graphic), where Inline graphic is simulated from the multivariate density used for the first simulation model. In this study, the SPMSD method performs substantially better compared with the Inline graphic-estimator in estimating Inline graphic and Inline graphic. In the fourth simulation setting, we generate observations with marginal density of Inline graphic as the Gumbel with location Inline graphic and scale Inline graphic. This simulation study (also the previous one) aims to evaluate the robustness of our methods to the wrong specification of the error density in (2.3). In both simulation settings 3 and 4, the Bayesian methods outperform the Inline graphic-estimator with respect to the MSE of Inline graphic and the MISE of Inline graphic. Overall, it is clear that our semiparametric Bayesian estimators based on skewed uniform kernel error are very robust to the actual form of the density of Inline graphic. They perform better than the M-estimation method even when the modeling assumptions of M-estimation are more aligned with the true error density. In particular, the gain in the precision of the estimates from the uniform kernel skewed error model is substantial when the error density is highly skewed.

5. Analysis of P2C2 study

To monitor the progression of cardiac abnormalities in children born to HIV-infected women (Lipshultz and others, 1998), a cohort of Inline graphic infants born to HIV-infected women were followed from birth to 7 years of age in the P2C2 study. The continuous response of interest, RRBO (time between consecutive heartbeats), is measured at irregular different time points for different children. There are Inline graphic observations in total for all the Inline graphic children. There are Inline graphic children with only one observation, and Inline graphic children with as many as Inline graphic observations. The HIV status Inline graphic of a child (equal to 1 for Inline graphic HIV positive and equal to 0 for Inline graphic HIV negative children) is the main covariate of interest. This study has been previously analyzed by Parzen and others (2011) and Lipsitz and others (2009) using dichotomized values of RRBO. However, there has not been any previous analysis on the original continuous response of RRBO. The main goals of analysis include developing an easily interpretable and appropriate model of the effects of HIV status and age on highly skewed longitudinal response RRBO, estimation of the HIV status effect with good precision, and prediction of future RRBO of the same as well as new (similar) children.

Our model for RRBO Inline graphic of patient Inline graphic at age Inline graphic is the partial linear model of (2.1) with Inline graphic. For our Bayesian analysis, the unknown smooth function Inline graphic is approximated by cubic B-spline Inline graphic with (0.25, 0.375, 0.50, 0.625, 0.75) quantiles of age as the fixed knots. We compare M-estimation method with two Bayesian estimation methods: (1) Bayesian method with parametric skewed double-exponential error (called SDE method) and (2) Bayesian method with skewed semiparametric error (called SPMSD method) of (2.3). Based on the exploratory variogram plots in Figure 1, we use different structures of Inline graphic to model the within-patient Gaussian copula of (2.4) for two HIV status groups. We use a uniform correlation structure with Inline graphic for HIV-negative and Inline graphic for HIV-positive group for Inline graphic, where Inline graphic is the vector of association parameters.

Fig. 1.

Fig. 1.

Variogram plot.

We specify priors for Inline graphic and Inline graphic using prior opinion about the median RRBO of patients from two groups. For example, our Inline graphic prior for Inline graphic (regression parameter of HIV status) represents the prior opinion that two children with same age, but from different groups, can have a maximum difference of 1.96 in median RRBO (with Inline graphic95% prior probability). Simulation of possible trajectories of Inline graphic using priors for Inline graphic is used to evaluate whether they represent the prior opinion about possible change in median RRBO over time. We use independent Inline graphic priors for each Inline graphic, Inline graphic and Inline graphic; independent Inline graphic priors for Inline graphic and Inline graphic (different skewness parameters for two groups); and independent Inline graphic priors for Inline graphic and Inline graphic. For the sake of brevity, we omit the details and associated figures explaining our above choices of prior densities. We would like to emphasize that the Bayesian analysis presented here only serves the purpose of the illustration of the methodology. The priors chosen above may not correspond to the true prior opinions of the clinical investigators. However, our sensitivity analysis suggests very moderate influence of these priors on the estimation. For the DP prior of the mixing distribution Inline graphic in (2.3), the choice of Inline graphic is determined by the prior opinion about the error density Inline graphic. For example, a skewed double exponential “prior guess” of Inline graphic needs a corresponding Gamma density for Inline graphic. A small precision of Inline graphic used by us results in a very small confidence assigned to our “prior guess” Inline graphic of Inline graphic. This implies that actual Inline graphic can be very different from the guess Inline graphic. For the Inline graphic-estimator (He and others, 2002), we use BIC for knots selection. The estimators of the regression parameters and their corresponding estimated precision obtained from different methods are given in Table 2. We also present the estimators of the association parameters Inline graphic and of the skewness parameters Inline graphic and Inline graphic for two groups (available only for Bayesian analysis). We can see that the standard errors of the estimated effect of HIV status obtained from our two Bayesian methods are 30% (a substantial gain in precision) smaller than that those obtained from the Inline graphic-estimation method. All methods show strong evidence that HIV-positive status group has a lower median RRBO compared to the HIV-negative status group.

Table 2.

Results of the analysis of Inline graphic study with the estimators and the corresponding standard errors (called SE, given within parenthesis) of the regression parameters Inline graphic and other model parameters obtained from different methods

Estimation method SDE SPMSD Inline graphic-estimator
Parameters Estimate (SD) Estimate (SD) Estimate (SD)
Inline graphic 0.440 (0.005) 0.448 (0.006) 0.428 (0.038)
Inline graphic Inline graphic0.041 (0.0039) Inline graphic (0.0046) Inline graphic0.042 (0.0063)
Inline graphic for Inline graphic 0.715 (0.043) 0.749 (0.037)
Inline graphic for Inline graphic 0.839 (0.068) 0.864 (0.058)
Inline graphic 0.234 (0.028) 0.242 (0.028)
Inline graphic 0.168 (0.031) 0.192 (0.035)

In Figure 2, we illustrate the predictive and the goodness-of-fit performances of our semiparametric Bayesian method using plots of the predicted quartile functions (Inline graphic, and Inline graphic, respectively, for the first, second, and third quartile functions) as well as the scatter plots of the responses from each group. The figure suggests different error densities for two groups. Also, for a grid of, say, 0.2 years over time, the estimated quantile functions do capture nearly intended proportions (Inline graphic) of observations in all four ranges (Inline graphic and Inline graphic) for each status group (suggesting excellent fit for our Bayesian methods). The results in Table 2 also give strong evidence of different skewness and different association structures for two groups.

Fig. 2.

Fig. 2.

Plot of observations and predictive quantiles (under semiparametric Bayesian model) for HIV groups. Age is scaled to the interval [0, 1].

6. Conclusions and remarks

In this paper, we have developed a novel semi-parametric Bayesian method and associated theoretical justifications for longitudinal data with skewed continuous responses. Our method can handle several challenges including any restriction on the marginal error density, a nonparametric effect of time, simple physical interpretation of the regression parameters, and prior specification based on prior opinion about the medians and the error density. Our MCMC computational tool is implementable even via freely available software such as WINBUGS and JAGS. Our models based on copula can even allow for different within-subject association structures, skewness levels, and scales for different treatment groups.

The results of our simulation studies in Table 1 show that our Bayesian estimators gives a much lower MSE for a finite sample compared with the Inline graphic-estimators of He and others (2002) even when the error density and correlation structure of the simulation model are very different from the model assumptions used for Bayesian estimation. Frequentist robust estimators such as M-estimators are aimed at consistent estimation of regression parameter under a very wide class of error densities. However, a likelihood-based Bayesian estimator such as ours often enjoys better precision than the competitors at finite samples, and is also particularly suitable for many biomedical studies where prediction is of practical importance. To validate the performance of the methods, we compared out-of-sample prediction errors from competing methods, and these comparisons reveal better predictive performance of our semiparametric Bayesian method. A logical way to choose the appropriate Bayesian model among various Bayesian models with different association structures would be to use the Inline graphic-criterion of Ibrahim and others (2001). In this paper, however, we are comparing among frequentist and Bayesian methods. For this purpose, we believe that diagnostic plots of Figure 2 and estimated precision of the regression parameters of Table 2 are two reasonable tools for comparing methods.

For the analysis of P2C2 study, our choices of two different structures of Inline graphic for two HIV-status groups are based on our preliminary data analysis and subject–area knowledge. In principle, our copula-based longitudinal model can allow a very wide class of association structures including uniform (equicorrelation), AR(1), and exponential, as long as the correlation matrix Inline graphic is positive definite. It is also straightforward to extend our method to allow other non-Gaussian copulas. Although the Gaussian copulas have several advantages including computational tractability and good physical interpretation, it has some limitations too. For example, a well-known criticism of Gaussian copula is the lack of good tail dependence. However, it is not a major concern when we are primarily interested in the inference about median and other quartile functions of the marginal density.

For the first time in the literature, we provide a theoretical justification of Bayesian methods for skewed longitudinal data under partial linear model. Our results guarantee consistent regression estimates under very mild regularity conditions and even when the unspecified time effect Inline graphic is approximated by a B-spline. For simplicity of exposition, our posterior consistency results are shown only for the uniform correlation (equicorrelation) structure. Our posterior consistency results can be extended straightforwardly for other positive definite Inline graphic. One possible criticism is that the class of residual densities in the support of our prior does not contain continuous asymmetric densities. However, our simulation studies 3 and 4 suggest that the assumption of split density for error has a negligible effect on the final estimates of Inline graphic even when the true Inline graphic is both continuous and asymmetric.

Instead of using a prior for Inline graphic (as suggested in Section 3), we used a fixed value of Inline graphic for the Bayesian analysis of P2C2 study. This value of Inline graphic is same as the one used for the frequentist analysis using method of He and others (2002) (which uses BIC to decide Inline graphic). We did this only to facilitate a comparison between Bayesian and frequentist analysis results while maintaining a comparable level of approximation of Inline graphic in both methods. Our results on Bayesian consistency are useful to even derive the rate of posterior convergence of the regression parameters. However, for the sake of brevity, we omit the proof of the optimal rate of convergence of our semiparametric Bayesian estimators. Another possible direction for future research is to develop computationally efficient heteroscedastic median zero error distributions to accommodate large outliers.

Funding

Dr Sinha acknowledges support from the National Cancer Institute (NCI) of the National Institutes of Health (R01CA69222 and R01CA60679). Dr Pati acknowledges support from the Office of Naval Research (ONR BAA 14-0001). Dr Lipsitz acknowledges support from the National Cancer Institute (NCI) of the National Institutes of Health (R01CA60679). Conflict of Interest: None declared.

Supplementary Materials

Supplementary material is available online at http://biostatistics.oxfordjournals.org

Supplementary Material

References

  1. Amewou-Atisso M., Ghosal S., Ghosh J. K., Ramamoorthi R. V. (2003). Posterior consistency for semi-parametric regression problems. Bernoulli 9(2), 291–312. [Google Scholar]
  2. Bhattacharya P. K., Zhao P.-L. (1997). Semiparametric inference in a partial linear model. The Annals of Statistics 25(1), 244–262. [Google Scholar]
  3. Bickel P. J., Kleijn B. J. K. (2012). The semiparametric Bernstein–von Mises theorem. The Annals of Statistics 40(1), 206–237. [Google Scholar]
  4. Diaconis P., Freedman D. (1986). On the consistency of Bayes estimates. The Annals of Statistics 14(1), 1–26. [Google Scholar]
  5. Diggle P., Heagerty P., Liang K.-Y., Zeger S. (2002) Analysis of Longitudinal Data. Oxford, UK: Oxford Statistical Sciences Series. [Google Scholar]
  6. Feller W. (2008) An Introduction to Probability Theory and Its Applications, Volume 2 Hoboken, NJ, USA: Wiley series in probability and mathematical statistics. [Google Scholar]
  7. Ferguson T. (1974). Prior distributions on spaces of probability measures. The Annals of Statistics 2(4), 615–629. [Google Scholar]
  8. Geweke J. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econometrica: Journal of the Econometric Society 57(6), 1317–1339. [Google Scholar]
  9. Hanson T., Johnson W. O. (2002). Modeling regression error with a mixture of Polya trees. Journal of the American Statistical Association 97(460), 1020–1033. [Google Scholar]
  10. Härdle W., Liang H. (2007) Partially Linear Models. New York, USA: Springer. [Google Scholar]
  11. He X., Zhu Z.-Y., Fung W.-K. (2002). Estimation in a semiparametric model for longitudinal data with unspecified dependence structure. Biometrika 89(3), 579–590. [Google Scholar]
  12. Ho H. J., Lin T.-I. (2010). Robust linear mixed models using the skew t distribution with application to schizophrenia data. Biometrical Journal 52(4), 449–469. [DOI] [PubMed] [Google Scholar]
  13. Ibrahim J. G., Chen M.-H., Sinha D. (2001). Criterion-based methods for Bayesian model assessment. Statistica Sinica 11(2), 419–444. [Google Scholar]
  14. Ishwaran H., James L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96(453). [Google Scholar]
  15. Khintchine A. Y. (1938). On Unimodal Distributions. Izvestiya NauchnoIssledovatel'skogo Instituta Matematiki i Mekhaniki 2, 1–7. [Google Scholar]
  16. Koenker R., Bassett G., Jr (1978). Regression quantiles. Econometrica: Journal of the Econometric Society 46(1), 33–50. [Google Scholar]
  17. Kottas A., Gelfand A. E. (2001). Bayesian semiparametric median regression modeling. Journal of the American Statistical Association 96(456), 1458–1468. [Google Scholar]
  18. Lee K., Daniels M. J., Joo Y. (2013). Flexible marginalized models for bivariate longitudinal ordinal data. Biostatistics 14, 462–476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Lin J., Sinha D., Lipsitz S., Polpo A. (2012). Semiparametric Bayesian survival analysis using models with log-linear median. Biometrics 68(4), 1136–1145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Lipshultz, S. E., Easley, K. A., Orav, E. J., Kaplan, S., Starc, T. J., Bricker, J. T., Lai, W. W., Moodie, D. S., McIntosh, K., Schluchter, M. D. and others(1998). Left ventricular structure and function in children infected with human immunodeficiency virus the prospective p2c2 HIV multicenter study. Circulation 97(13), 1246–1256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Lipsitz S. R., Fitzmaurice G. M., Ibrahim J. G., Sinha D., Parzen M., Lipshultz S. (2009). Joint generalized estimating equations for multivariate longitudinal binary outcomes with missing data: an application to acquired immune deficiency syndrome data. Journal of the Royal Statistical Society: Series A (Statistics in Society) 172(1), 3–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Nelsen R. B. (2007) An Introduction to Copulas. New York, USA: Springer Series. [Google Scholar]
  23. Parzen M., Ghosh S., Lipsitz S., Sinha D., Fitzmaurice G. M., Mallick B. K., Ibrahim J. G. (2011). A generalized linear mixed model for longitudinal binary data with a marginal logit link function. The Annals of Applied Statistics 5(1), 449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Pati D., Dunson D. B., Tokdar S. T. (2013). Posterior consistency in conditional distribution estimation. Journal of Multivariate Analysis 116, 456–472. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Pitt M., Chan D., Kohn R. (2006). Efficient Bayesian inference for Gaussian copula regression models. Biometrika 93(3), 537–554. [Google Scholar]
  26. Reich B. J., Bondell H. D., Wang H. J. (2010). Flexible Bayesian quantile regression for independent and clustered data. Biostatistics 11(2), 337–352. [DOI] [PubMed] [Google Scholar]
  27. Schumaker L. (2007) Spline Functions: Basic Theory. Cambridge, UK: Cambridge Mathematical Library. [Google Scholar]
  28. Shen W., Tokdar S. T., Ghosal S. (2013). Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika 100(3), 623–640. [Google Scholar]
  29. Sklar M. (1959) Fonctions de répartition à n dimensions et leurs marges. Paris, France: Université Paris; 8. [Google Scholar]
  30. Speckman P. (1988). Kernel smoothing in partial linear models. Journal of the Royal Statistical Society. Series B (Methodological) 50(3), 413–436. [Google Scholar]
  31. Verbeke G., Molenberghs G. (2009) Linear Mixed Models for Longitudinal Data. Statistics in New York, USA: Springer Series. [Google Scholar]
  32. Yue Y. R., Rue H. (2011). Bayesian inference for additive mixed quantile regression models. Computational Statistics & Data Analysis 55(1), 84–96. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES