Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Mar 18.
Published in final edited form as: J Am Stat Assoc. 2008 Dec 1;103(484):1496–1507. doi: 10.1198/016214508000000850

Assessing Sexual Attitudes and Behaviors of Young Women: A Joint Model with Nonlinear Time Effects, Time Varying Covariates, and Dropouts

Pulak Ghosh 1, Wanzhu Tu 1
PMCID: PMC2657729  NIHMSID: NIHMS74345  PMID: 19300533

Abstract

Understanding human sexual behaviors is essential for the effective prevention of sexually transmitted infections. Analysis of longitudinally measured sexual behavioral data, however, is often complicated by zero-inflation of event counts, nonlinear time trend, time-varying covariates, and informative dropouts. Ignoring these complicating factors could undermine the validity of the study findings. In this paper, we put forth a unified joint modeling structure that accommodates these features of the data. Specifically, we propose a pair of simultaneous models for the zero-inflated event counts: Each of these models contains an auto-regressive structure for the accommodation of the effect of recent event history, and a nonparametric component for the modeling of nonlinear time effect. Informative dropout and time varying covariates are modeled explicitly in the process. Model fitting and parameter estimation are carried out in a Bayesian paradigm by the use of a Markov Chain Monte Carlo (MCMC) method. Analytical results showed that adolescent sexual behaviors tended to evolve nonlinearly over time and they were strongly influenced by the day-to-day variations in mood and sexual interests. These findings suggest that adolescent sex is to a large extent driven by intrinsic factors rather than being compelled by circumstances, thus highlighting the need of education on self protective measures against infection risks.

Keywords: Joint modeling, Markov Chain Monte Carlo, Mood, Sexually transmitted infections, Zero-inflated Poisson

1 Introduction

Human sexual contacts are the primary pathway for sexually transmitted pathogens such as Chlamydia trachomatis (CT), Neisseria gonorrhoeae (NG), and Trichomonas vaginalis (TV). Despite the existence of efficacious antimicrobial agents against the organisms, these diseases remain prevalent in the US population. The burden of the diseases is disproportional on adolescents and young adults. For example, although young people aged 15 to 24 account for only a quarter of the sexually active population, they represent nearly half of new infections (Weinstock, Berman, and Cates 2004). Since the diseases are transmitted through behavior, the development of effective prevention strategies requires an improved understanding of human sexual behaviors, particular those of the young. Because adolescent sexual behaviors tend to change with time and experience, and are likely influenced by proximal phenomena such as mood and sexual interest, it is essential to model behavioral events longitudinally with full consideration of the contextual information. But the analysis of longitudinal behavioral data collected from observational studies is often complicated by potentially nonlinear time effects, large between-subject variability, time-varying covariates, and informative dropouts. This paper presents a unified analytical framework for the modeling of longitudinally collected counts of human sexual events, with explicit accommodation of these various complications.

1.1 An Epidemiological Study of Sexual Behaviors of Young Women

Young women were recruited for participation in a behavioral epidemiological study from three urban primary care clinics. The overall objective of the study is to examine the behavioral factors related to the sexually transmitted infections (STI). Eligibility criteria included that the young woman be between the ages 14 and 17 years, be able to understand English, not have any serious psychiatric disturbances or mental handicaps, and attend one of the three recruiting clinics. These clinics serve a predominantly urban and lower income population. Individuals who did not plan to continue residence in the area for the next three months or who were pregnant were excluded from the study. At the participating sites, all women who met the enrollment criteria, regardless of prior sexual experience, were identified by clinical schedule and those who agreed to participate were enrolled at the current or subsequent clinical visit. Informed consent and parental permission were obtained at the time of enrollment.

All subjects had quarterly clinic visits for the duration of the study period. In addition to the quarterly clinic visits, the study subjects also completed daily behavioral diaries, which provided detailed records of the subject's sexual behaviors in their original time sequence. Specifically, the diary was a structured mini-survey in which the subject reported sexual intercourse, condom protection, STI symptoms, and daily mood and sexual interest. Since coitus is relatively infrequent in adolescents and may exhibit certain day-of-the-week patterns, we summarized the daily events into weekly event counts and focused on the description of weekly rate of sexual intercourse.

The original study is designed to have a total length of followup of 27 months, and it is currently ongoing. In this analysis we used a subset of 282 subjects who had been enrolled into the study for at least six months (24 weeks), including those who had dropped out of the study before the completion of the six-month interview; recent enrollees who had entered the study in the last five weeks were not considered in the current analysis. The subject characteristics that we considered in this analysis included age, lifetime number of partners, and history of STI, all measured at enrollment. STI history is thought to be a marker of the more risky sexual behaviors in young women, and the lifetime number of partners to be markers of subject's sexual experience and partner availability.

The focus of this analysis was to examine whether positive mood and sexual interest were associated with the level of sexual activity in adolescent women. In this study, positive mood was assessed via diary by asking the subject to indicate percent of time in the day that she felt “happy”, “cheerful”, and “friendly”. The responses were on a Likert scale ranging from “not at all” (1 point), “some of it” (2 points), “about half” (3 points), “most of it” (4 points), or “all day” (5 points). The responses to these three items produced a mood score between 3 and 15 points. The purpose of having these correlated items in the scale is to achieve a better representation of the unobserved underlying construct of positive mood. Similarly, daily “sexual interest” was measured by one item in the diary on the same five-point Likert scale. In the present research, we calculated and used the average weekly mood and sexual interest scores in the analysis.

1.2 Analytical Issues of Longitudinally Measured Behavioral Data

The primary response variable of interest of this analysis was the weekly number of coital events. The weekly coital frequency counts of the study cohort are depicted in Figure 1. These counts ranged from 0 to 8 with more than half of the observations being zero. When the number of zeros in the data set exceeds the probability mass that the Poisson distribution allocates to the point of zero, the data are said to be zero-inflated. In the presence of zero inflation, modeling the event count via Poisson regression is no longer appropriate; instead, zero-inflated Poisson (ZIP) regression models are often used in place of the classical Poisson regression models (Lambert 1992). Since Lambert's seminal work on ZIP regression models, a variety of applied ZIP regression models have been successfully used in several important clinical applications (Böhning, Dietz, Schlattmann, Mendonca, and Kirchner, 1999; Yau and Lee, 2001; Cheung, 2002; Lu, Lin, and Shih, 2004, Ghosh, Mukhopadhyay, and Liu, 2006). However, a number of methodological issues have complicated the analysis of the study data.

Figure 1.

Figure 1

Weekly coital frequency counts of 282 young women over a 24 week period.

  1. Repeatedly measured event counts contributed by the same subject tend to be correlated, and current event count is likely to be associated with past event counts. Hall and Zhang (2004) discussed model-fitting procedures for marginal ZIP regression models for clustered count data. Min and Agresti (2005) and Hall and Wang (2005) considered mixed-effects ZIP regression models for repeatedly measured or cluster correlated data. But in this application, since the levels of sexual activities vary significantly from person to person, and the event count of the current week is likely to depend on those of the previous weeks, a more natural approach is to develop an autoregressive model where the outcome at current time point depends on its value at previous time points (Diggle et al, 2002, Chapter 10). Previous investigation has indicated the validity of this structure in the context of sexual behavioral research (Fortenberry et al, 2005).

  2. Human behaviors often change gradually over time. Such time effects are typically unobservable but ignoring them could have serious consequences. Additionally, repeated behavioral assessments themselves could have a subtle but non-ignorable impact on the behavior being studied. One of the concerns here is that repeated questioning about sexual activities could “activate” the subject, thus gradually influence her behavior. Methodologically, it is often impossible to independently verify the existence of such activation effects because of the lack of appropriate control subjects who are not subjected to these assessments. The dilemma is that, had there been such a control group, it would produce no behavioral data for the comparison due to the lack of assessments! But, if the data collection instruments have an activating effect, the effect is likely to express itself over time. For this reason, it becomes critically important for us to account for the time effect explicitly in the model to ensure the validity of inference on the important independent variables. This said, we note two difficulties with the modeling of the time effect: First, there is no guarantee that the time effect is linear; indeed, since the true functional form of time covariate is unknown, the assumption of linear time effect may not be always justifiable. Second, the time effect could differ by individual subjects. Examining the event profiles of 20 randomly chosen subjects (Figure 2), it becomes clear that they do not correspond to a particular parametric form, and between-subject variation is quite evident. These considerations have led us to consider a semiparametric approach for the cohort time effect using spline models. We use this approach to explore the features of the population and individual curves within the mixed model framework.

  3. The observation of longitudinal cohorts are often accompanied by dropouts. The probability of a subject's dropping out of a study may be related to the subject's self-reported event rate. In the context of adolescent sexual behaviors, some have suspected that dropout is often a marker of higher risk behavior. A missing data mechanism where a subject's probability of dropping out depends on the rate of the Poisson process was referred to as “informative censoring” by Wu and Carroll (1988) and “informative missingness” by Follman and Wu (1995). We use the term “informative dropout” to describe this situation. In this application, we use a logistic model to depict the drop-out probability as a function of the subject's baseline characteristics, her observed event counts before the dropout time, and a random subject effect. In this example, we had about 20% of subjects that had dropped out of the study during the course of follow-up.

  4. Because mood and sexual interest are collected over time together with coital counts, they can be viewed not only as time varying covariates, but also as realizations of some underlying psychological processes (Fortenberry et al., 2006). Therefore, directly modeling them may enhance our understanding of these effects. For example, an explicit covariate model will allow us to assess the strength of correlations of the mood and sexual interest measures within the subject over time. In addition, these time-varying covariates are not observed at the time of dropout leading to missingness in the covariates. Roy and Lin (2005) has shown that ignoring this missingness in the covariates will yield inconsistent estimates of the model parameters. Thus we directly model the time-varying covariates using a mixed model framework.

Figure 2.

Figure 2

Sample Longitudinal Profiles of Weekly Coital Frequency for 20 Subjects.

It should also be noted that in behavioral data, these complications rarely appear in isolation. Therefore in the present paper, we propose a joint modeling approach for the sexual activity data. Specifically, the behavioral events are modeled by an autoregressive ZIP regression structure with a semiparametric component for the accommodation of a potentially nonlinear time effect. The ZIP regression models share the subject-specific random effect with the logistic model for dropout. Time varying covariates such as mood and sexual interest are modeled via linear mixed model with autoregressive structures. The model is thus semiparametric in nature as the time effect is modeled nonlinearly. Within this unified modeling framework, a Bayesian approach was developed for parameter estimation. As an applied statistical method, this work is in contrast to the previous approaches by providing a joint modeling structure for the depiction of behavioral events in a longitudinal study. This research has incorporated some of the more recent modeling techniques in a ZIP regression set up: (1) it simultaneously models the probability weights of the mixture distributions; (2) it incorporates semiparametric functions for the time effects; (3) it explicitly accounts for informative dropouts; and (4) it accommodates the missing covariates. Since these characteristics are not uncommon in longitudinal studies of behavioral outcomes, the method is potentially useful for a wider class of applications.

2 Model Specification

2.1 Zero-inflated Poisson Model

Let Yij be the count of behavioral events reported by the ith subject in the jth time unit, i = 1, 2, …, m; j = 1, 2,…, n, where m represents the number of subjects in the study, and n is the designed number of time units in the follow-up period. In the context of this research, Yij is the number of sexual episodes reported by the ith subject in the jth week. Depending on the subject's current state of sexual activeness, a large number of zeros may be observed in Y. Following Lambert (1992), Hall (2000), Dagne (2004) and Ghosh, Mukhopadhyay, and Liu (2006), we further assume that for each observed event count, Yij, there is an unobserved random variable for the state of sexual activeness, Uij, where P(Uij = 0) = pij if Yij comes from the degenerate distribution, and P(Uij = 1) = 1 − pij if Yij ∼Poisson λij:

Yij={0with probabilitypijPoisson(λij),with probability(1pij), (1)

where Poisson(λij) is defined by density function P(Yij=yij)=exp(λij)λijyij/yij!.

It should be noted that both the degenerate distribution and the Poisson process can produce zero observations. Such a formulation is often referred to as the zero-inflated Poisson (ZIP) distribution. It then follows that

Pr(Yij=0)=pij+(1pij)exp(λij) (2)
Pr(Yij=yij)=(1pij)exp(λij)λijyijyij!,yij=1,2,. (3)

In this research, one could conceptualize the degenerate distribution as representing a “sexually inactive” state with probability pij, while the Poisson process represents a “sexually active” state with λij being the mean weekly number of sexual episodes.

Because the weekly event counts are simultaneously influenced by the state that the subject is in during the week and the weekly event rate given she is in an active state, we consider simultaneous modeling of both λij and pij.

2.2 Simultaneous Models of Behavioral Event Counts

We assume the following logistic and log-linear regression models for pij and λij:

logit(1pij)=SiTβ1p+TijTβ2p+q=1Qβ3qpYi,jq+Zij2Tbi1+fp(tij)+hip(tij); (4)
log(λij)=SiTβ1λ+TijTβ2λ+q=1Qβ3qλYi,jq+Zij1Tbi2+fλ(tij)+hiλ(tij). (5)

The logistic model in (4) explicitly depicts the probability that the observation is from the degenerate distribution; and the loglinear model in (5) quantifies the “intensity” of the Poisson process. Herein, Si denotes the baseline characteristics and Tij is the time-varying covariate vector for the ith subject at time j. Although we assumed the same set of covariates for the pij and λij in the above formulation, the models can easily be modified to accommodate different covariates in the two processes. The parameters, β1p, β2p, β1λ, β2λ are vectors of regression coefficients for the fixed effects. Note that in (4) and (5), the subject's response at time j, Yij, depends on the subject's past events through embedded qth order autoregressive structures. Parameters β3p=(β31p,,β3Qp) and β3λ=(β31λ,,β3Qλ) are associated with the autoregressive process. In this application, the reported number of sexual episodes may vary from week to week in an unknown fashion. Thus, the time effects on pij and λij are modeled by unspecified nonparametric functions fp(t) and fλ(t), respectively. These unspecified smooth functions reflect the nonlinear effect of the time. However, these functions represent only the population averages; individual trajectories may still vary from subject to subject and the individual pattern may not follow the pattern of the population curve. These subject effects may also contribute to the correlation of the longitudinal measurements within subjects. Therefore we add a subject specific nonparametric function hi(·), which represents the subject's deviation from the group curves. The population curves f(t) is important as it describes the overall cohort time effect on the parameter of interest. At the same time, individual curves hi(t) are introduced to represent subject-specific variations around the population time effect. Together, the population average and individual specific curves serve to improve the fitness of the model to inform the investigators about the nature of the cohort time effect. To accommodate any extra within-subject correlation due to the large within-subject variability in the cohort, we introduce additional random effects (bi1, bi2) into the models.

He, Fung and Zhu (2005), and Zhao, Staudenmayer, Coull, and Wand (2006) discussed the incorporation of semiparametric population curves in generalized linear models. This research further extends those methods by embedding both population average and subject-specific splines in a ZIP regression model. In doing so, the proposed semiparametric ZIP model offers a greater flexibility in the modeling of zero-inflated event counts. The model reduces to a parametric ZIP model when fp (t), fλ (t), hip(t), and hip(t) are constants. Following (Ruppert et al. 2003), we assume that the spline functions take the following general forms of a piecewise polynomial of degree τ:

fp(t)=v1pt+v2pt2++vτptτ+d=1D1ud1p(tκd1p)+τ;fλ(t)=v1λt+v2λt2++vτλtτ+d=1D1ud1λ(tκd1λ)+τ;hip(t)=ρ1iPt+ρ2iPt2++ρτiPtτ+d=1D2uid2p(tκd2p)+τ;hiλ(t)=ρ1iλt+ρ2iλt2++ρτiλtτ+d=1D2uid2λ(tκd2λ)+τ;

where, X+ = x if x > 0; and 0 otherwise, and kd1p, kd1λ, kd2p, kd2λs are the known knot points. The choice of the knots will be described in the data analysis section. Note that in the population spline we do not have any intercept to avoid unidentifiability. We assume ud1pN(0,σup2),ud1λN(0,σuλ2),uid2pN(0,σ1p2),uid2λN(0,σ1λ2). The above spline model of order τ represents adequate fits for most situations. But the number of parameters may not be practical for smaller data sets. In those situations, simpler spline models such as linear splines may be used, or subject specific splines may be dropped. Typically, linear (τ = 1), quadratic (τ = 2) or cubic (τ = 3) splines are common choices in practice as they ensure a certain degree of smoothness in the fitted curve. The above spline models can be embedded in the mixed model framework for a general structure as follows:

Let X1ij = (t, …, tτ) T, Wij1p=[(tk11P)+τ,,(tkD11P)+τ]T, β4p=(ν1p,,ντp)T, u1p=(u11p,,uD11p)T, ρip=(ρ1ip,,ρτip)T, ui2p=(ui12p,,uiD22p)T. In a similar way, we define Wij2λ, β4λ, u1λ, ρiλ, ui2λ. Then,

fp(t)+hip(t)=X1ijTβ4p+Wij1pTu1p+X1ijTρip+Zij2pTui2p=X1ijTβ4p+Wij1pTu1p+VijpTwip, (6)
fλ(t)+hiλ(t)=X1ijTβ4λ+Wij2λTu1λ+X1ijTρiλ+Vij2λTui2λ=X1ijλTβ4λ+Wij2λTu1λ+VijλTwiλ, (7)

where Vijp=(X1ijT,Zij2pT), wip=(ρip,ui2p)T. Similarly, Vijλ, wiλ is defined. Also, E(u1p)=0, cov(u1p)=σup2ID1, E(wip)=0, cov(wip)=diag(ρp,σ1p2ID1), E(u1λ)=0, cov(u1λ)=σuλ2ID1, E(wiλ)=0, and cov(wiλ)=diag(ρλ,σ1λ2ID1).

The splines above are partitioned into fixed linear component plus a random component, with zero expectation, representing smooth deviations about the linear trend.

Letting Xij=(SiT,TijT,Yi,j1,,Yi,jQ,X1ijT)T, βp=(β1p,β2p,β3p,β4p)T, (βλ defined similarly), and plugging expressions (6-7) into equations (4-5), we get:

logit(1pij)=SiTβ1p+TijTβ2p+q=1Qβ3qpYi,jq+X1ijpTβ4p+Wij1pTu1p+VijpTwip+Zij1Tbi1;=XijTβp+Wij1pTu1p+VijpTwip+Zij1Tbi1, (8)
log(λij)=SiTβ1λ+TijTβ2λ+q=1Qβ3qλYi,jq+X1ijλTβ4λ+Wij2λTu1λ+VijλTwiλ+Zij2Tbi2=XijTβλ+Wij2λTu1λ+VijλTwiλ+Zij2Tbi2. (9)

2.3 Informative dropout in Cohort Studies

Dropouts are not uncommon in observational studies of large cohorts. Here we define the dropout as someone who did not come to a scheduled visit and had not come back by the end of the study. Since the measurements are missing after the last kept visit, analysis of the incomplete data poses additional challenges. If the dropouts are due to a mechanism that is unrelated to the investigation, i.e., the unobserved behaviors are missing completely at random, these dropouts can be ignored. However, this is unlikely to be the case for most of the longitudinal studies of human behavior. In adolescent health studies, there are suspicions that the dropouts may be associated with certain traits that can be characterized as lack of discipline. These traits not only influence the dropout process, but also correlate with the sexual behaviors themselves, thus giving us an incentive for a joint modeling of the outcomes and dropout process.

Specifically, for each Yij we define a missing indicator variable Rij, such that Rij = 1 if Yij was missing, and 0 otherwise. Thus, Ri = (Ri1, …, Rin)T is a vector of missing response indicators for individual i. Then a simple model could be constructed to describe the nonignorable missing response:

RijBernoulli(ηij)whereηij=Pr(Rij=1|Yij,Yi(j1)) (10)
g(ηij)=LijTξ+ψ1Yij+s=2Q1ψsYi,j+1s+k=1KζkTijk+Zij3Tbi3 (11)

where Yi, (j−1) denotes a subset of the history of the data, e.g., it can be the previous responses (yi,j−1) and/or previous time-varying covariates Ti,j−1,k. Note that ψ1 ≠ 0 gives nonignorable missingness. Here g(·) is a link function; we let g(x) = logit(x). In the above model, Lij is the vector of baseline covariates and bi3 is the vector of random subject effects corresponding to the dropout model. The Tijk is the kth time-varying covariate. The unknown parameters are (ξ, ψ1, …, ψQ1, ζ1, …, ζK).The baseline covariates and the time-varying covariates may be same as in the response model. The non-ignorable dropout mechanism is modeled by the dependence of the dropout probability on the unobserved outcome yij at the time of dropout and on the outcome before dropping out. As for the random subject effects vector bi = (bi1, bi2, bi3)T we assume biN(0, Δb). Thus, the correlated random effect allows for the association between dropout and the outcome.

2.4 Modeling Time-varying Covariates

In the above model we have covariates which are also measured over time along with the response variable. It is usual that some of the covariates will be unobserved because of dropout in the data. Due to the presence of this missingness in the time-varying covariates, we need to model the covariate process (Roy and Lin, 2006). We develop a multivariate linear mixed model (Shah et al., 1997) to describe the covariate process.

Let Tijk be the kth covariate for the ith subject measured at time j. We assume the following linear mixed model for the different time-varying covariates:

Tijk=AijkTγ0k+γ1kTi,j1,k+BijkTδik+eijk, (12)

where Aijk is the design matrix for the fixed effects. The model assumes that the kth time-varying covariate at the current time depends on its value at the previous time point, δik is the random subject effect for the ith subject in the kth marker, and eijk is the measurement error.

Let Tij = (Tijk, …, TijK)T, eij = (eij1, …, eijk), γ0 = (γ01, …, γ0K), γ1 = (γ11, …, γ1K), δi = (δi1, δi2, …, δiK)T. Then in matrix notation we have,

Tij=AijTγ0+γ1Ti,j1+BijTδi+eij (13)

where eijN(0, Σ = diag {σk2}), δiN(0, Σδ), Σδ being the variance-covariance matrix for the random effects δi. We assume independence between the random effects and error distribution.

The Markovian structure of the model allows for a longitudinal correlation structure for the same covariates over time.

3 Bayesian Inference: Likelihood, Priors and Posterior

3.1 The Likelihood Function

Let Yobs,i = (Yi1, …, Yini)T, and Tobs,i = (Ti1, …, Tini)T denote the observed values of Yi and Ti, respectively. We also assume that for subjects who dropped out from the study, Ydrop,i = (Yi,ni+1, …, Yin)T and Tdrop,i = (Ti,ni+1, …, Tin)T represent the missing response and covariates respectively. Then Yi=(Yobs,iT,Ydrop,iT)T, and Ti=(Tobs,iT,Tdrop,iT)T. We define Si and Zij similarly. Further, we write β=(β1λ,β2λ,β3λ,β1p,β2p,β3p)T, (ψ1, …, ψQ1)T, and ζ = (ζ1, …, ζK)T.

Let Ω = (Ω1, Ω2, Ω3, Ω4, Ω5) be the parameter space. Here Ω1=(β,σup2,σuλ2,σ1p2,σ1λ2) is the parameter vector for the joint model, Ω2 = (ξ, ψ, ζ)T) is the parameter vector for the dropout model, Ω3=(γ=(γ0,γ1)T,σ12,,σK2) is the parameter vector for the time-varying covariate model, Ω4(= Δb) is the parameter of the random effect bi, and Ω5(= Σδ) is the parameter of the random subject effects Wi.

Then under the assumption of nonignorable dropout (ψ1 ≠ 0), the joint likelihood can be written as

L(Yobs,i,Tobs,i,Ri|Si,Ti1,bi,δi;Ω)L(Yi|Yi1,Si,Ti,bi;Ω1)L(Ti|Ti1,Si,δi;Ω2)×L(Ri|Yi,Tobs,i,Sobs,i,bi;Ω3)L(bi;Ω4)L(δi;Ω5), (14)

where

L(Yi|Yi1,Si,Ti,bi;Ω1)=j=2ni[pij+(1pij)eλij]I[Yij=0]×[(1pij)eλijλijYijYij]1I[Yij=0], (15)

with Pij and λij given in equation (4-5), and

L(Ti|Ti1,Si,δi;Ω2)1||ni/2exp{12j=2ni(TijμTij)T1(TijμTij)}, (16)

where μTij=AijTγ0+γ1Ti,j1+BijTδi; and

L(Ri|Yi,Tobs,i,Sobs,i;Ω3,bi)=j=2ni{Pr(rij=1|Yi,Tobs,i,Sobs,i;Ω3,bi)}rij×{1Pr(rij=1|Yi,Tobs,i,Sobs,i;Ω3,bi)}1rij, (17)

and L(bi; Ω4) and L(δi; Ω5) denote the multivariate normal distributions with zero mean vector and variance-covariance matrices Δb and Σδ, respectively.

3.2 Prior Distribution

To complete Bayesian specification of the model, we must assign priors to the unknown parameters. Since we have no prior information from historical data or from experiment, we take the usual route and assign conjugate priors to the parameters. We assume elements of the Ω=(β,ξ,ψ,ζ,γ,Δb,δ,σup2,σuλ2,σ1p2,σ2p2,σ1λ2,σ2λ2,σ12,,σK2) are independently distributed. For each fixed effect we assume a normal density prior and for the variance parameter we assume an inverse-Gamma (IG) prior, while for the variance-covariance matrix we assume an inverse Wishart prior. An inverse-Gamma prior with shape parameter a and scale parameter b is denoted by xIG(a, b) and is given by f(x)xaexp(b2x2). Additionally, we assume a Wishart distribution for the inverse of a variance-covariance matrix, where a Wq(ϱ, S) is a q dimensional Wishart distribution with ϱ degrees of freedom and mean ϱS−1. For our analysis, diffuse priors can be chosen so that the analysis is dominated by the data likelihood. Specifically, to represent the vague prior knowledge, we propose to set the degrees of freedom for the Wishart distribution to be the minimum possible, viz., the rank of the variance-covariance matrix.

We specify the following priors on the model parameters for the fixed effects, π(β) ∼ N(μβ, Σβ), π(ξ) ∼ N(μξ, Σξ), π(ψ) ∼ N(μψ, Σψ), π(ζ) ∼ N(μζ, Σζ), π(γ) ∼ N(μγ, Σγ).

For the variance parameter we assume an inverse-gamma prior as follows: π(σup2)IG(aup, bup), π(σuλ2)IG(a, b), π(σ1p2)IG(a1p, b1p), π(σ1λ2)IG(a, b1λ), and π(σk2)IG(ck, dk); k = 1,2, …, K.

Finally, the variance-covariance parameters of the random subject effect take the following forms: π(Δb1)Wishart(ϱb,Sϱ), π(δ1)Wishart(ϱδ,Sδ).

3.3 Posterior Distribution and Inference

The joint posterior distribution of the parameters of the models conditional on the data is obtained by combining the likelihood in (14) and the prior densities using Bayes' theorem:

f(Ω,b,δ,u|y)i=1m{L(yobs,i,Tobs,i,Ri|Si,Ti1,bi,δi;Ω)}π(β)f(σup2)π(σuλ2)π(σ1p2)×π(σ1λ2)π(ξ)π(ψ)π(ζ)π(γ)π(Δb1)π(δ1)k=1Kf(σk2).

The posterior distributions are analytically intractable. However, models described above can be fitted using the Markov Chain Monte Carlo (MCMC) methods such as the Gibbs sampler (Gelfand and Smith, 1990). Since the full conditional distributions are not standard, a straightforward implementation of the Gibbs sampler using standard sampling techniques may not be possible. However, sampling methods can be performed using adaptive rejection sampling (ARS; Gilks and Wild, 1992). Recently, Ghosh, Mukhopadhyay, and Liu (2006) have advocated the use of ARS for a ZIP model. In this research, we follow their procedure, which first uses a data augmentation step to sample the values of the latent variables (sexual activities) based on the current value of the parameters, and then samples the parameters using the ARS method given the latent variables. Samples were directly obtained from the joint posterior distribution of the parameters as well as the latent variables. Implementation of this method is relatively easy in publicly available software WinBUGS (2005). The samples from the posterior obtained from the MCMC will allow us to achieve summary measures of the parameter estimates, and to obtain credible intervals of the parameters of interest. See Section 4.2 for more computational details of the data analysis.

4 Data Analysis

4.1 Model Specification

Using the proposed model, we analyzed the data collected from the behavioral epidemiological study. The data set contained weekly coital frequency counts (Yij) of 282 young women measured over a period of 24 weeks, i = 1, 2, …, 282; j = 1, 2, …, 24. The vector of baseline characteristics, Si = (AGEi, STDi, PTRi)T, where AGEi, STDi, and PTRi were respectively the ith subject's age, STD history, and lifetime number of sexual partners, is measured at the time of enrollment. The vector of time-varying covariates had two elements, Tij = (MOODij, SIij)T, where MOODij and SIij were respectively the weekly average mood and sexual interest scores reported by the ith subject in the jth week. Under a first order auto-regressive structure (Q = 1), we had the following semiparametric autoregressive ZIP models:

logit(1pij)=β11p+β12pAGEi+β13pSTDi+β14pPTRi+β21pMOODij+β22pSIij+β31pYi,j1+bi1+fp(tij)+hip(tij),log(λij)=β11λ+β12λAGEi+β13λSTDi+β14λPTRi+β21λMOODij+β22λSIij+β31λYi,j1+bi2+fλ(tij)+hiλ(tij).

Please note that for the convenience of model interpretation, we chose to model the probability that the ith subject was in a sexually active state (1 − pij) at jth visit in the logistic model.

For the fitting of the models, there is no clear rule on how many knot points to include or where to locate them in the spline functions. More knots are needed in regions where the function is changing rapidly (Ruppert et al. 2003). Sometimes knowledge of subject matter may be relevant in placing knots where a change in the shape of the curve is expected. Using too few knots or poorly sited knots means the approximation to the curve will be degraded. By contrast, a spline using too many knots will be imprecise. Since the subjects were assessed regularly with equally spaced intervals in this study, we selected the knots from the existing values that were equally spaced within the range [min(x), max(x)]. Thus the six knots were placed at weeks 5, 8, 11, 14, 17, 20.

A model for dropout was assumed in case the dropouts were informative. Preliminary data analysis suggested that the dropout might depend on the current or previous coital frequency counts, as well as on some of the baseline covariates. So we considered the following simple model:

Logit(ηij)=ξ1+ξ2AGEi+ξ3STDi+ψ1Yij+ψ2Yi,j1+bi3, (18)

where bi3 was the random subject effect. As detailed in Section 4.3, although the dropout probability was modeled as a function of subject's enrollment age, STD history, and the weekly coital frequency counts prior to the dropout time, we chose the above model for dropout based on a set of model selection criteria described in section 4.4.

Similarly, the time-varying covariates mood and sexual interest were modeled as follows:

MOODij=γ01+γ11MOODi,j1+Wi1+Wi2tij+eij1,SIij=γ02+γ12SIi,j1+Wi3+Wi4tij+eij2.

Again, the autoregressive structures embedded in the time-varying covariate models allowed us to examine the strength of the autocorrelation within the covariates. This was not only of scientific interest to the investigation, but also helpful for the exploration of the modeling structure. For example, a very strong autocorrelation in mood would not only counter the speculation of mood swing in adolescents, but also render it unnecessary to collect mood measure so frequently, or to treat it as a time-varying variable.

Since the study was still ongoing and data were still being collected at the time of this report, the currently available data set was not large enough to be divided for the purpose of prior elicitation. Prior information based on expert opinion, even if available, is nonetheless user specific. Hence, in this analysis we chose our priors to be proper but weakly informative. Specifically, we take a N (0, 50) prior for each of the regression parameters and for each variance parameter (=1/precision) we use an IG(2.01, 1.01) prior, giving rise to a prior mean of 1 and prior variance of 1,00. For the variance-covariance matrix Δb1 we assumed Wishart (3,(0.1000.1)), and for Δw1 we assumed Wishart (2,(0.1000.1)).

4.2 Computational Details

We ran two chains of the Gibbs sampler with widely dispersed initial values. The initial values for the fixed parameters were selected by starting with the prior mean and covering ±3 standard deviations. The initial values for the precision were arbitrarily selected. Initially, some evidence of poor mixing was found regarding the standard deviation of the random effect slope in the spline model. Following Zhao, Staudenmayer, Coull, and Wand (2006), we then used several other choices of the inverse gamma and the folded-t prior distributions for the standard deviations. The folded-t class of prior densities have been recommended by Gelman (2005) in a hierarchical model over the commonly used inverse-gamma distribution. The folded-t prior has the advantage of improving computational efficiency by reducing dependence among parameters (Liu, rubin, and Wu, 1998; Liu and Wu, 1999) and yields a Gibbs sampler that is less prone to slow mixing when the standard deviations are near zero. We found that use of a moderate to highly dispersed inverse gamma prior behaved erratically. However, the use of folded-t prior on standard deviations dramatically improved the mixing and fits were stable. Thus, we resort to the folded-t class of prior for our results. See Zhao, Staudenmayer, Coull, and Wand (2006) and Gelman (2005) for details of this prior. We also centered the covariates about mean to have better convergence. In our simulation, 25,000 samples were discarded as burn-in, and of the next 75,000 samples, we used every third value to construct the posterior estimate. Convergence was assessed visually by monitoring the dynamic traces of Gibbs iterations and by computing the Gelman-Rubin convergence statistic (Gelman and Rubin, 1992). In order to check for the sensitivity we ran the proposed model with different sets of priors and found little evidence of any prior sensitivity, although slow mixing was evident in analysis using a highly diffuse prior.

4.3 Analytical Results

Of the 282 subjects enrolled into the study 91% were African American. Enrollment age ranged from 14 to 17 with a mean of 15 years and a standard deviation of 1.1 years. Lifetime number of partners reported at the time of enrollment ranged from zero to 28 with a mean of 2.85 (median 2) and a standard deviation of 3.8. Forty four of the study subjects (15.6%) had a history of STD infection.

Table 1 reports the posterior mean, median, standard deviation (SD), and 95% credible interval (CI) for the parameters in the ZIP regression model. Similarly, the parameter estimates for the dropout and time-varying covariate models are reported in Table 2.

Table 1.

Parameter Estimates of the ZIP Regression Models

Parameter Mean Median SD 95% C.I.

Zero-inflated

Logit
β11p (Intercept) 0.4463 0.4402 0.102 (0.046, 1.69)
β12p (AGE) 0.6874 0.7155 0.249 (0.0864, 1.18)
β13p (STD) 0.2583 0.2445 0.2892 (0.0949, 1.923)
β14p (PTR) 0.375 0.337 0.118 (0.1902, 0.462)
β21p (MOOD) -0.2318 -0.2305 0.077 (-0.3904, -0.0908)
β22p (SI) 0.5769 0.5758 0.4489 (-0.295, 1.55)
β31p (AR(1)) 1.306 1.288 0.173 (1.029, 1.713)

Log-linear
β11λ (Intercept) 0.0296 0.0302 0.0035 (-0.0722, 0.1259)
β12λ (AGE) -0.0251 -0.027 0.14 (-0.031, 0.0306)
β13λ (STD) 0.2307 0.2383 0.258 (-0.3383, 0.6504)
β14λ (PTR) 0.0325 0.0293 0.024 (-0.0101, 0.0802)
β21λ (MOOD) 0.1034 0.105 0.026 (0.0572, 0.1493)
β22λ (SI) 0.4193 0.3945 0.2482 (0.1181, 0.906)
β31λ (AR(1)) 0.0355 0.03257 0.00882 (0.0239, 0.0567)

Table 2.

Parameter Estimates for the Dropout Model and Time-varying Covariates

Parameter Mean Median SD 95% C.I.

Dropout parameter

ζ1 (Intercept) 0.2651 0.2809 0.8061 (-0.414, 1.79)
ζ2 (AGE) 0.7953 0.7694 0.1051 (0.6348, 1.032)
ζ3 (STD) 0.0392 0.0296 0.16 (0.013, 1.319)
ψ1 (Current obs) 1.977 1.97 0.2052 (1.571, 2.391)
ψ2 (Prev. Obs) -0.4828 -0.4806 0.1444 (-0.7759, -0.2086)

MOOD (Covariate)

α01 (Intercept) 3.556 3.65 0.182 (2.692, 4.95)
α11 (AR(1)) 0.0955 0.0954 0.084 (0.0788, 0.1116)

SI (Covariate)

α02 (Intercept) 1.138 1.137 0.04707 (1.047, 1.232)
α12 (AR(1)) 0.2908 0.291 0.0103 (0.2698, 0.3117)

The ZIP regression analysis yielded a number of observations: (1) Older age was associated with increased probability of the subject being in the sexually active state (Odds ratio(OR)=exp (β^12p)= 1.99, 95% CI= [exp(0.0864), exp(1.180)]= [1.09, 3.25]), although an increase in age did not necessarily increase the weekly rate of coital frequency given the subject was in a sexually active state. (2) Baseline STD history was a strong indicator for the subject's state of sexual activity (OR=exp (β^13p)= exp(0.2583)=1.29, 95% CI=[1.10, 6.84]). A young woman that had a positive STD history at enrollment was more likely to be in the sexually active state during the study period. However, STD history did not appear to affect the rate of coital events. (3) Lifetime number of partners that the subject reported at baseline was also positively associated with the probability of being in a sexually active state (OR=exp (β^14p)=1.45, 95% CI=[1.21, 1.59]). Again, no similar effect was observed for the rate of coital frequency. (4) Lower positive mood was associated with an increased probability of being in a sexually active state (OR=exp (β^21p)=0.79, 95% CI=[0.68, 0.91]); but for a subject that was in the sexually active state, higher positive mood was associated with increased coital frequency (incident rate ratio or IRR=exp (β^21λ)=1.11, 95% CI=[1.06, 1.16]). (5) Higher sexual interest was associated with increased coital frequency (IRR=exp (β^22λ)=1.52, 95% CI=[1.13, 2.47]). (6) Coital frequency in the prior week was associated with both increased probability of being in a sexually active state (OR=exp (β^31p)=3.69, 95% CI=[2.80, 5.55]) and increased coital frequency in the current week (IRR=exp (β^31λ)=1.04, 95% CI=[1.02, 1.06]). (7) From Figures 3 and 4, it is evident that both the probability of being in a sexually active state and the rate of coital frequency given the subject was in an active state exhibit nonlinear time effects, and the effects vary from subject to subject. In particular, the rate parameter was monotonically increasing over time, suggesting either a developmental effect of sexuality in adolescents or a possible activation effect of repeated questionnaires. For example, Figure 3 shows that in the study cohort, the probability of a subject being in a sexually active state is not entirely monotone, but the intensity of sexual activities is steadily increasing over time. (8) Finally, we noted that the correlation was modest between random subject effects in the logistic and loglinear models (0.29), suggesting a relatively weak positive link between the subject's current state of sexual activity and her intensity or the level of activity of sexual behaviors given she was in an active state. This last observation demonstrates the usefulness of the proposed joint modeling structure in assessing the interrelationship among latent states, which may be particularly useful in the analysis of behavioral data from a variety of fields.

Figure 3.

Figure 3

Spline Estimates of the Average Time Effects on logitp and logλ.

Figure 4.

Figure 4

Spline Estimates of Individual Time Effects on logitp and logλ of Four Individual Subjects.

Similarly, from the estimates of parameters in the dropout and time-varying covariate models (Table 2), we had the following observations: (1) The dropout probability appeared to be related to the baseline covariates AGE, STD, and current and previous coital frequency values. (2) The estimates of the parameters ψ1, ψ2 of the dropout models were 1.977 and -0.4828 respectively, suggesting that dropout might be informative and the missing probability of yij might depend more on the current values of the coital frequency and less on the previous value. Thus, any statistical analysis that ignores the dropout may be biased. (3) Older subjects and those with an STD history were more likely to drop out, perhaps due to competing demands for time in older teens. (4) Both mood and sexual interest measures of the current week were correlated with their corresponding values of the previous week, suggesting continuity in the adolescent mood and sexual interest.

4.4 Model Comparison

To compare candidate models, we computed p(Yobs,i, Tobs,i, Ri|Yobs,−i, Tobs,−i, R−i) (Geisser and Eddy 1979), which is the posterior predictive density of (Yobs,i, Tobs,i, Ri) for subject i conditional on the observed data with a single data point deleted. This value is known as the conditional predictive ordinate (CPO; Gelfand, Dey and Chang (1992); Chen et al, (2000)) and has been widely used for model diagnostic and assessment.

For the ith subject, the CPO statistic under model Ml; 1 ≤ lL is defined as:

CPOi=p(Yobs,i,Tobs,i,Ri|Yobs,i,Tobs,i,Ri)=Eθl[p(Yobs,i,Tobs,i,Ri|θl)|Yobs,i,Tobs,i,Ri]

where −i denotes the exclusion of the data from subject i. The θl is the set of parameters of model Ml and p(Yobs,i, Tobs,i, Ri|θl) is the sampling density of the model evaluated at the ith observation. The expectation above is taken with respect to the posterior distribution of the model parameters θl given the cross-validated data (Yobs,−i, Tobs,−i, R−i). For subject i, the CPOi can be obtained from the MCMC samples by computing the following weighted average:

CPO^i=(1Mm=1M1f(Yobs,i,Tobs,i,Ri|θl(m)))1

where M is the number of simulations. The θlm denotes the parameter samples at the mth iteration. A large CPO value indicates a better fit. A useful summary statistic of the CPOi is the logarithm of the pseudomarginal likelihood (LPML), defined as LPML=i=1nlog(CPO^i). Models with greater LPML values represent a better fit. LPML is well defined under the posterior predictive density and it is computationally stable. LPML has been used extensively in Bayesian analysis for model selection in situations of simpler and more complicated models and has a long history in the statistics literature (see Chen et al, (2000), Chapter 10; Brown and Ibrahim, (2003); Brown, Ibrahim, and DeGruttola (2005)).

We compared the following models using LPML:

  • Model 1: This is the model that we used in the analysis.

  • Model 2: Model 1 without the spline components in ZIP model, i.e., the splines are replaced by a linear time effect (tij).

  • Model 3: Independent model, i.e., the ZIP model is independent of the dropout process.

We then considered several dropout models, keeping other parts of the model unchanged:

  • Model 4: Logit(ηij) = ξ1 + ξ2Agei + ξ3STDi + ψ1yij

  • Model 5: Logit(ηij) = ξ1 + ψ1yij + ψ2yi,j−1

  • Model6: Logit(ηij) = ξ1 + ψ1yij

The LPML values for Models 1-6 were −10405.7, −12198.4, −11201.8, −11086.1, −11066.9, and −11132.5, respectively. The proposed model had the highest LPML values, suggesting that it had the best fit among the six candidate models. The large difference between the LPML values of Models 1 and 2 indicated the presence of a nonlinear time effect, and justified the use of the spline-based model for time effects in the analysis.

4.5 Simulation

In this section, we present a small simulation study to justify the relative complexity of the proposed model and to verify the performance of the model fitting procedure. We first note that the complexity of the model arises primarily from four aspects: (1) explicit modeling of autoregressive effect of the main outcome variable; (2) explicit inclusion of time-varying covariates; (3) spline-based modeling of non-linear time effects; and (4) accommodation of nonignorable dropout. While it is well known that failure to accommodate informative dropouts may lead to questionable inference (Wu and Carroll, 1988; Schluchter, 1992; Little, 1995; Roy and Lin, 2005; Wu, 2007), the impact of inattention to the first three complicating factors has not been well studied. We therefore focus on these three issues in the simulation study. Additionally, the simulation study has also given us a chance to verify the performance of our model fitting procedure.

Specifically, we consider the following model:

logit(1pij)=β11p+β13pXi+β21pZij+β31pYi,j1+bi1+fp(tij)log(λij)=β11λ+β13λXi+β21λZij+β31λYi,j1+bi2+fλ(tij), (19)

where we use fp(t)=12cos2(t+1212) and fλ(t)=0.6sin2(t312), for t = 1, 2, …, 24 to depict the nonlinear time effects (see Figure 5). Also in this model we consider a subject-specific covariate Xi, random intercepts bi = (bi1, bi2)t, as well as a time-varying covariate Zij, where i = 1, 2, …, 50, j = 1, 2, …, 24.

Figure 5.

Figure 5

The True and Estimated Average Time Effects on logitp and logλ from the Simulation Study.

Note: Solid line represents the true curves and dashed line represents the estimated curves.

Data were generated from (19) to mimic the real data presented in the paper. Specifically, for the ith subject, we first generated Xi from a Bernoulli distribution with probability p0(x). For the same subject, we then generated a 24-dimensional vector Zi = (Zi1, …, Zi24)t ∼MVN(μZ, ΣZ) to represent the values of the time-varying covariate Zij for the 24 time points. We gave Σz, an AR(1) variance-covariance structure to maintain a correlation between the Z values in adjacent weeks within the subject. Similarly, random intercepts bi = (bi1, bi2)t were generated from a bivariate normal distribution. We then calculated the values of fp(tij)=12cos2(tij+1212), and fλ(tij)=0.6sin2(tij312) at each time points. Finally, we generate the baseline value for the Yi1 from ZIP (p1, λ1). Models in (19) were then used to calculate pi2 and λi2. From pi2 and λi2, we generated Yi2 ∼ZIP (pi2, λi2). We then repeated this last step to generate the rest of the Y values. Parameter values used in the simulation were chosen to produce data that are similar to the real data. In particular, we take β11p=0.45, β13p=0.25, β21p=0.21, β31p=1.3 and β11λ=0.1, β13λ=0.25, β21λ=0.2, β31λ=0.04. One hundred simulated data sets were used in the simulation study.

Using generated data, we fitted our proposed semiparametric ZIP regression model as well as the ZIP regression model with a linear time effect (i.e., without splines for time effect). Results are presented in Table 3. We computed the “relative bias” (RB) which is defined as the ratio of bias and the absolute value of the true parameter, mean square error (MSE) and coverage probability (CP). The numbers in parenthesis in Table 4 are the true value of the parameters.

Table 3.

Simulation Results

Model with Spline Linear Model

Parameter Mean RB MSE CP Mean RB MSE CP

β11p (0.45) 0.47 -0.02 0.042 .94 0.43 0.03 0.048 .93
β12p (0.23) 0.25 0.02 0.023 .92 0.21 0.02 0.037 .91
β13p (-0.21) -0.2 -0.03 0.029 .96 -0.13 0.063 1.89 .89
β14p (1.3) 1.37 -0.02 0.047 .95 1.1 0.05 0.1 .92

β11λ (0.1) 0.1 0.01 0.041 .94 0.12 0.01 0.052 .94
β12λ (0.25) 0.23 0.02 0.08 .92 0.2 0.04 0.11 .90
β13λ (0.2) 0.22 0.02 0.06 .97 0.08 0.07 1.46 .88
β14λ (0.004) 0.004 0.01 0.032 .97 0.004 0.03 0.04 .98

A number of observations can be made from the simulation results: (1) that the proposed method is able to produce accurate estimates of the model parameters with minimal bias, MSE, and which have good coverage probabilities. (2) In the presence of nonlinear time effects, traditional ZIP regression models with linear time effect often produce biased estimates, larger MSEs, and substantially lower coverage probability in the time-varying covariates, although other covariates appear to be spared from such effects of the model mis-specification. Figure 5 clearly shows that proposed model is able to recover the unobserved nonlinear time effects reasonably well. (3) Finally we observed from the simulation that the proposed semiparametric regression model had larger LPML values then their parametric counterparts, suggesting improved fit of the new model. Based these observations, we contend that the models used in the analysis have good performance in the modeling of zero-inflated behavioral counts. Despite the increased complexity, the new analysis provides a safe-guard against potential effects of mis-specification of the time effects, thus preventing the occurrence of large biases in the estimation of time-varying effects.

5 Discussion

Sexually transmitted infections (STI) are spread primarily through sexual intercourse. A young woman is at risk of STI once she becomes sexually active. Yet little is known about the contextual factors that are associated with the occurrence of coitus, and the temporal patterns in which adolescent sexual behaviors evolve. This study is perhaps the first longitudinal examination of these issues based on a sizable cohort. A major strength of this investigation is the inclusion of young teens that were still in their early years of sexual experience. Other strengths of the study include the longitudinal follow-up and the extensive behavioral information that were collected in the process.

Contrary to the alarming anecdotes reported by the lay press in recent years, the findings of this paper reveal a more complicated picture: Sexual behaviors in adolescents are influenced strongly by intrinsic factors such as mood and sexual interest, rather than being driven completely by circumstances over which the teens have little control. This gives us reason to believe that more effective education and promotion of self-protective behaviors might help to reduce the risk of disease transmission. It also suggests that future prevention strategies should take into account of the emotional needs of the teens. The increasing levels of sexual activity over time are not surprising, but their individual-specific patterns seem to suggest that there are no uniformally followed patterns in terms of sexual behavioral development. Considering the age range of our study participants (14 to 17 years), we believe that intervention measures must start early to be effective.

Methodologically, the most challenging aspect in the modeling of human behavior is perhaps the incorporation of relevant contextual information. In studies of STD epidemiology and human sexuality, this contextual information often includes the concurrent mood, sexual interest, prior behavior, and subtle time effects that cannot be dismissed. Along the same line, issues such as dropout and autocorrelations existing among the time-varying covariates also complicate the analysis. These various factors form an interactive system in which the behaviors of interest are influenced by the other factors, which in turn are influenced by the observed behaviors. Therefore, a unidimensional modeling approach with a narrower focus often fails to capture the full complexity of the situation, and may produce an overly simplistic depiction of the behavior.

To address these shortcomings, we propose a new analytical framework that takes into account most of the major components in the modeling of human behaviors. Our joint model is a complicated system, but it is also necessary in order to place the behavioral event in its original context. Such an approach is likely to help investigators achieve a more comprehensive understanding of the studied behavior. As an applied statistical tool, this method is motivated by a real epidemiological investigation. Although the data analysis that we presented in this paper is preliminary in nature due to the fact that the full data are still being collected, the initial results are promising and they have revealed some previously under-appreciated characteristics of adolescent behavior. As a result, we feel that the basic construction of the model might be appropriate for other longitudinal studies as well. Although we recognize that this is an initial step in seeking a more comprehensive solution, our effort has demonstrated the feasibility of this general strategy. Preliminary results from the simulation study have provided assurance of the modeling procedure. Additionally, it has also highlighted the potential pitfalls of using mis-specified parametric ZIP regression models.

Technically, the model employs some of the more recent developments in statistical methodology. The proposed joint model is flexible and new in several aspects: (1) it represents a semiparametric development of the ZIP model. The semiparametric approach is useful, particularly when the linearity of time effect is in question. (2) it incorporates the dropout process in the ZIP model. Without the accommodation of dropout, models may produce biased results. (3) It take into account of time-varying covariates. (4) the autocorrelation structure among behaviors and time-varying covariates. Our joint modeling approach deals with both missing response and missing covariate simultaneously and is built to borrow strength from each of the modeling components. Through a real application, we have demonstrated that the joint modeling increased the LPML values and resulted in a better fit of the data.

A few limitations of our methods must be underlined. One of the major issues is the robustness of the distributional assumption. In our application, we use a parametric normal distribution for the random effects. A broader class of distributions such as Dirichlet processes, may be a viable alternative. Another issue is the fixed knot points. Random knot points would be more flexible; however, such models will be numerically challenging in this set up. Third, the parametric dropout model may be overly simplistic. Given a sufficient number of dropouts, one can build a more complex parametric or semiparametric structure as suggested by Chen and Ibrahim (2006). It will be worthwhile to see if the complex modeling of the dropout and the use of robust random effects would improve the goodness-of-fit and change the results of the analysis. We are currently exploring these aspects of the modeling. Notwithstanding these limitations, this research has pointed to a new road map for the analysis of longitudinally measured behavioral data.

Acknowledgments

The authors wish to thank Dr. J Dennis Fortenberry, MD, for his many insightful comments on the manuscript. The research is supported by NIH grants RO1 HD042404 and U19 AI031494-14. The authors thank the Editor, Associate Editor, and referees for their valuable suggestions.

References

  1. Böhning D, Dietz E, Schlattmann S, Mendonca L, Kirchner U. The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. Journal of Royal Statistical Society: Ser A. 1999;162:95209. [Google Scholar]
  2. Brown ER, Ibrahim JG. Bayesian Approaches to Joint cure-Rate and Longitudinal Models with Applications to Cancer Vaccine Trials. Biometrics. 2003;59:686–693. doi: 10.1111/1541-0420.00079. [DOI] [PubMed] [Google Scholar]
  3. Brown ER, Ibrahim JG, DeGruttola V. A flexible B-spline Mdoel for multiple Longitudinal Biomarkers and Survival. Biometrics. 2005;61:64–73. doi: 10.1111/j.0006-341X.2005.030929.x. [DOI] [PubMed] [Google Scholar]
  4. Chen Q, Ibrahim JG. Semiparametric Models for Missing Covariate and Response Data in Regression Models. Biometrics. 2006;62:177–184. doi: 10.1111/j.1541-0420.2005.00438.x. [DOI] [PubMed] [Google Scholar]
  5. Chen Ming-Hui, Shao Qi-man, Ibrahim Joseph George. Monte Carlo methods in Bayesian computation. Springer-Verlag Inc; Berlin; New York: 2000. [Google Scholar]
  6. Cheung YB. Zero-inflated models for regression analysis of count data: a study of growth and development. Statistics in Medicine. 2002;21:1461–1469. doi: 10.1002/sim.1088. [DOI] [PubMed] [Google Scholar]
  7. Dagne GA. Hierarchical Bayesian Analysis of Correlated Zero-inflated Count Data. Biometrical Journal. 2004;6:653–663. [Google Scholar]
  8. Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data. Oxford University Press; 2002. [Google Scholar]
  9. Follmann D, Wu M. An approximate generalized linear model with random effects for informative missing data. Biometrics. 1995;51:151168. [PubMed] [Google Scholar]
  10. Fortenberry J, Temkit M, Tu W, Graham C, Katz B, Orr D. Daily mood, partner support, sexual interest, and sexual activity among adolescent women. Health Psychology. 2005;24:252257. doi: 10.1037/0278-6133.24.3.252. [DOI] [PubMed] [Google Scholar]
  11. Fortenberry J, Katz B, Llythe M, Juliar B, Tu W, Orr D. Factors associated with time of day of sexual activity among adolescent women. Journal of Adolescent Health. 2006;38:275281. doi: 10.1016/j.jadohealth.2005.02.006. [DOI] [PubMed] [Google Scholar]
  12. Geisser I, Eddy W. A Predictive Approach to Model Selection. Journal of the American Statistical Association. 1979;74:153–160. [Google Scholar]
  13. Gelfand AE, Dey DK, Chang H. Model determination using predictive distributions with implementation via sampling-based methods (with discussion) In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 4. Oxford: Oxford University Press; 1992. pp. 147–159. [Google Scholar]
  14. Gelfand AE, Smith AFM. Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association. 1990;85:398–409. [Google Scholar]
  15. Gelman A, Rubin D. Inference from alternative simulation using multiple sequences. Statistical Science. 1992:457–472. [Google Scholar]
  16. Ghosh SK, Mukhopadhyay P, Lu JC. Bayesian analysis of zero-inflated regression models. Journal of Statistical Planning and Inference. 2006;136:1360–1375. [Google Scholar]
  17. Hall DB. Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics. 2000;56:1030–1039. doi: 10.1111/j.0006-341x.2000.01030.x. [DOI] [PubMed] [Google Scholar]
  18. Hall DB, Zhang Z. Marginal models for zero inflated clustered data. Statistical Modelling. 2004;4:161–180. [Google Scholar]
  19. He X, Fung WK, Zhu Z. Robust Estimation in Generalized Partial Linear Models for Clustered Data. Journal of the American Statistical Association. 2005;100:1176–1184. [Google Scholar]
  20. Lambert D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34:114. [Google Scholar]
  21. Lu SE, Lin Y, Shih WJ. Analyzing excessive no changes in clinical trials with clustered data. Biometrics. 2004;60:257267. doi: 10.1111/j.0006-341X.2004.00155.x. [DOI] [PubMed] [Google Scholar]
  22. Min Y, Agresti A. Random effect models for repeated measures of zero-inflated count data. Statistical Modelling. 2005;5:119. doi: 10.1177/1471082X0901000404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Ruppert D, Carroll RJ, Wand MP. (Cambridge Series in Statistical and Probabilistic Mathematics (No. 12).Semiparametric Regression. 2003 [Google Scholar]
  24. Roy J, Lin X. Missing Covariates in Longitudinal Data With Informative Dropouts: Bias Analysis and Inference. Biometrics. 2005;61:837–846. doi: 10.1111/j.1541-0420.2005.00340.x. [DOI] [PubMed] [Google Scholar]
  25. Schluchter MD. Methods for the analysis of informatively censored longitudinal data. Statistics in Medicine. 1992;11:1861–1870. doi: 10.1002/sim.4780111408. [DOI] [PubMed] [Google Scholar]
  26. Shah A, Laird N, Schoenfeld D. A random-effects model for multiple characteristics with possibly missing data. Journal of the American Statistical Association. 1997;92:775779. [Google Scholar]
  27. Spiegelhalter D, Thomas A, Best N, Lunn D. MRC Biostatistics Unit, Institute of Public Health and Department of Epidemiology & Public Health, Imperial College School of Medicine; 2005. WinBUGS User Manual, Version 1.4. available at http://www.mrc-bsu.cam.ac.uk/bugs. [Google Scholar]
  28. Weinstock H, Berman S, Cates W., Jr Sexually transmitted diseases among American youth: Incidence and prevalence estimates, 2000. Perspectives on Sexual and Reproductive Health. 2004;36:6–10. doi: 10.1363/psrh.36.6.04. [DOI] [PubMed] [Google Scholar]
  29. Wu L. HIV viral dynamic models with dropouts and missing covariates. Statistics in Medicine. 2007;26:3342–3357. doi: 10.1002/sim.2816. [DOI] [PubMed] [Google Scholar]
  30. Wu MC, Carroll RJ. Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics. 1988;44:175188. [PubMed] [Google Scholar]
  31. Yau KKW, Lee AH. Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme. Statistics in Medicine. 2001;20:2907–2920. doi: 10.1002/sim.860. [DOI] [PubMed] [Google Scholar]
  32. Zhao Y, Staudenmayer J, Coull BA, Wand MP. General Design Bayesian Generalized Linear Mixed Models. Statistical Science. 2006;21:35–51. [Google Scholar]

RESOURCES