Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Apr 30.
Published in final edited form as: Stat Med. 2018 Nov 28;38(9):1634–1650. doi: 10.1002/sim.8051

Bayesian adaptive group lasso with semiparametric hidden Markov models

Kai Kang 1, Xinyuan Song 1,2, X Joan Hu 3, Hongtu Zhu 4,5
PMCID: PMC6445704  NIHMSID: NIHMS1001724  PMID: 30484887

Abstract

This paper presents a Bayesian adaptive group least absolute shrinkage and selection operator method to conduct simultaneous model selection and estimation under semiparametric hidden Markov models. We specify the conditional regression model and the transition probability model in the hidden Markov model into additive nonparametric functions of covariates. A basis expansion is adopted to approximate the nonparametric functions. We introduce multivariate conditional Laplace priors to impose adaptive penalties on regression coefficients and different groups of basis expansions under the Bayesian framework. An efficient Markov chain Monte Carlo algorithm is then proposed to identify the nonexistent, constant, linear, and nonlinear forms of covariate effects in both conditional and transition models. The empirical performance of the proposed methodology is evaluated via simulation studies. We apply the proposed model to analyze a real data set that was collected from the Alzheimer’s Disease Neuroimaging Initiative study. The analysis identifies important risk factors on cognitive decline and the transition from cognitive normal to Alzheimer’s disease.

Keywords: linear basis expansion, Markov chain Monte Carlo, simultaneous model selection and estimation

1 ∣. INTRODUCTION

Hidden Markov models (HMMs) have been widely used in the medical, behavioral, social, environmental, and psychological sciences where longitudinal data are frequently collected.1-6 Basically, HMMs are designed to have two parts: a transition model to investigate the effects of covariates on the dynamic transition process of hidden states and a conditional regression model to examine state-specific covariate effects on the response of interest. In these two parts, the effect of a covariate on the response or on the transition process can be nonexistent, constant, linear, or nonlinear. Identifying the specific forms of such covariate effects is useful not only in achieving a parsimonious model but also in obtaining enhanced parameter estimation and attractive interpretations.

Conventional studies on HMMs have focused on a parametric framework, wherein the forms of covariate effects on responses and/or on transition probabilities are prespecified. However, one fundamental issue overlooked by these parametric HMMs is that the complex relationships among variables are seldom known a priori, and the parametric form is thus too restrictive to correctly reflect the reality. Several nonparametric approaches have been investigated recently to relax the parametric assumption of HMMs. Yau et al7 developed a Bayesian nonparametric HMM, where the sampling distribution of the observations at each state was assumed unknown and modeled via a mixture of Dirichlet processes. Although their method did not rely on the distributional assumption of the observed process, it cannot reveal the functional effects of potential explanatory variables on the outcome of interest. Song et al8 considered Bayesian P-splines for describing the nonparametric relation among latent variables in HMMs, but they did not consider the model selection problem.

Model selection is an important issue beyond estimation in the application of HMMs. Classical model selection methods are mainly developed on the basis of a pairwise comparison through common model selection criteria, such as the Akaike information criterion and the Bayesian information criterion. However, such pairwise-based procedure usually becomes increasingly computationally demanding when the search dimension is high. An appealing alternative is to adopt least absolute shrinkage and selection operator (lasso)–type variable selection techniques. Choi et al9 applied lasso to correlated HMMs to detect the important parameters in transition models. Städler and Mukherjee10 introduced L1 penalization to obtain a sparse HMM with state-specific graphical models. However, the preceding studies consider only parametric HMMs. Recently, some variants of lasso, such as group lasso, adaptive lasso, and adaptive group lasso, have been developed to manage group variables and address the issue of lasso and group lasso possibly suffering from appreciable bias. Owing to the computational efficiency and stability of the Bayesian approach, the Bayesian analogs of lasso and its variants have been proposed.11,12 However, the available Bayesian lasso-type methods are all developed in the context of cross-sectional models without between-state transitions, thereby making them inapplicable to the proposed semiparametric HMMs.

In this paper, we propose a Bayesian adaptive group lasso (BaGlasso) procedure to conduct simultaneous model selection and estimation for semiparametric HMMs. With the use of basis expansion and appropriate penalties, the non-parametric relationships that subsume nonexistent, constant, linear, and nonlinear relationships between covariates and the response can be automatically identified. The proposed procedure has the following appealing features: first, the group effects and additional correlation within the basis expansion are well addressed by the group lasso, thus ensuring estimation accuracy. Second, adaptive penalties imposed on different groups of coefficients enable us to achieve an efficient variable selection. Finally, the proposed procedure avoids tedious pairwise comparisons among competing models with different combinations of covariates in the conditional and transition models. This entirely data-driven feature not only relaxes the dependence on experts’ knowledge in empirical studies but also reduces the computational burden. To the best of our knowledge, this study is the first to introduce Bayesian lasso-type procedure into semiparametric HMMs.

The proposed method is motivated by a real study conducted by the Alzheimer’s Disease Neuroimaging Initiative (ADNI). A set of biomarkers, namely, gender, age, educational levels, marital status, hippocampal volume, and apolipoprotein E (APOE)-ϵ4, is collected across several time points in this data set. The purpose of this study is to detect the potential risk factors of Alzheimer’s disease (AD) from two perspectives. First, considering that the pathology of AD usually evolves from cognitive normal (CN) to mild cognitive impairment (MCI) to dementia, characterizing the disease pathology, identifying hidden states that correspond to the diagnosed stages of cognitive decline, and examining the potential risk factors of the neurodegenerative transition are of scientific interest and practical value. Given that the effects of biomarkers on the pathology from one state to another may vary across nonexistent, constant, linear, and nonlinear ones, allowing their forms to be unspecified and introducing penalties to penalize unimportant effects can reveal the patterns of the effects to the greatest extent. Previous studies13 pointed out that the relationships between some biomarkers and cognitive decline are variant across different states. Therefore, identifying the significant state-specific risk factors of cognitive decline and investigating the subtle forms of their effects are of great interest. However, existing relevant research either restricts the examination of the above relationships under a parametric framework or emphasizes only estimation. The proposed methodology enables us to perfectly accommodate all the aforementioned features and provide new insights into the prevention of AD.

The rest of this paper is organized as follows. Section 2 introduces the semiparametric HMM and discusses the associated identifiability issue. Section 3 illustrates the statistical inference of the proposed model. Specifically, BaGlasso for simultaneous variable selection and parameter estimation as well as the deviance information criterion (DIC) for the determination of the number of hidden states are presented. Section 4 investigates the empirical performance of the proposed method via simulation studies. Section 5 presents an application of the proposed method to the aforementioned ADNI study. Several important biomarkers are detected to have significant functional effects on patients’ cognitive decline across neurodegenerative states and/or on transition probabilities. The extension of the model is discussed in Section 6.

2 ∣. MODEL DESCRIPTION

2.1 ∣. Semiparametric HMMs

Let yit with subject i = 1, … , n at t = 1, … , T be the observation process. Zi = (Zi1, … , ZiT)′, the hidden-state sequence, is commonly assumed to follow a first-order Markov chain taking values in a finite set {1, … , S). Given the hidden state Zit, the conditional semiparametric regression model is formulated as follows:

[yitZit=s]=μs+αscit+j=1qfsj(xitj)+δit, (1)

where cit = (cit1, & , citp)′ and xit = (xit1, & , xitq)′ are a p × 1 vector of discrete covariates and a q × 1 vector of continuous covariates, respectively; intercept μs, fixed effects αs=(αs1,,αsp), and unknown smoothing function fsj(·)s are all defined as state-specific to address the heterogeneity underlying the observations; δit is a random residual independent of yit; and [δitZit = s] ~ N[0, ψs].

In addition to the observable process, the hidden process, Zi, is formulated as follows: let pitus denote the transition probability from state Zi,t–1 = u at occasion t – 1 to state Zit = s at occasion t for individual i. Then, we have

pitus=P(Zit=sZi1,Zi2,,Zi,t1=u)=P(Zit=sZi,t1=u). (2)

Notably, model (2) is guaranteed by the assumed property of Markov chain. A common setting for the initial distribution of Zi1 is the multinomial distribution with probability (π1, … , πS)′, such that πs ≥ 0 and Zi=(Zi1,,ZiT). Thus, the hidden-state sequence Zi = (Zi1, … , ZiT)′ is fully specified by the initial and transition probabilities.

Considering that the hidden states usually have natural ranking information in empirical studies, we assume the hidden states {1, … , S) to be ordered and consider a continuation-ratio logit model14 as follows: for t = 2, … , T, s = 1, … , S – 1, and u = 1, … , S, we have

log(P(Zit=sZi,t1=u)P(Zit>sZi,t1=u))=log(pituspitu,s+1++pituS)=ζus+α~cit+j=1qgj(xitj), (3)

where the left-hand side is the log odds of transition to state s rather than to a state that is higher than s given Zi,t–1 = u, ζus is a transition-specific intercept, cit = (cit1, … , citp)′ and xit = (xit1, … ,xitq)′ are the covariate vectors defined in (1), α~=(α~1,,α~p) is a p × 1 vector of fixed effect, and gj(·)s are unknown smoothing functions. Let ϑitus = P(Zit = sZits, Zi,t–1 = u). Then, the continuation-ratio logits in (3) can be rewritten as

log(P(Zit=sZi,t1=u)P(Zit>sZi,t1=u))=log(P(Zit=s,Zi,t1=u)P(Zits,Zi,t1=u)P(Zit=s,Zi,t1=u))=log(P(Zit=s,Zi,t1=u)P(Zits,Zi,t1=u)1P(Zit=s,Zi,t1=u)P(Zits,Zi,t1=u))=log(P(Zit=sZits,Zi,t1=u)1P(Zit=sZits,Zi,t1=u))=log(ϑitus1ϑitus).

Thus, the continuation-ratio logit (3) can be rewritten as a conventional logistic regression model as follows:

logit(ϑitus)=ζus+α~cit+j=1qgj(xitj), (4)

where logit(ϑitus) is the log odds of Zit = s given Zits and Zi,t–1 = u. In model (3) or (4), α~ and gj(−)s are assumed to be independent of u and s. This proportional odds assumption is compulsory in modeling an ordinal variable because it ensures the that P(Zit < 1) < P(Zit < 2) < ⋯ < P(Zit < S) for ordered states 1 < 2 < ⋯ < S.14,15 Moreover, the proportional odds assumption avoids a tedious inference, in which every possible transition of origination and destination elicits a set of parameters, and it, in turn, greatly reduces the complexity and enhances the interpretability of the transition model.

2.2 ∣. Nonparametric modeling

We use linear basis expansion to estimate the nonparametric functions fsj(·) and gj(·) in (1) and (3). Given that gj(·) can be regarded as a special case (without a state-specific setting) of fsj(·) we describe only the modeling of fsj(·) in this section. Specifically, fsj(xitj) can be approximated as follows:

fsj(xitj)=m=1Mjβsjmhm(xitj)=βsjhitj, (5)

where hm(·)s are basis functions, such as piecewise polynomials or natural cubic splines,16 hitj = (h1(xitj), …, hMj (xitj))′, and Mj is the number of basis functions that are used to estimate the jth unknown smoothing function. For notational simplicity, Mj is set to be invariant to states. An extension to relax this assumption is straightforward.

An important issue regarding the model selection of (1) and (3) is whether a functional effect, eg, fsj(·), truly exists or not. In this study, we utilize a norm ∥·∥ to quantify the magnitude of nonparametric function fsj. Let xsj and Hsj denote the submatrix of xj = (x11j, … ,xnTj)′ and Hj, respectively, with the rows corresponding to Zits deleted, where Hj is formed by

Hj=(h(x11j)h(xnTj))=(h1(x11j)hMj(x11j)h1(xnTj)hMj(xnTj))nT×Mj. (6)

The norm of fsj, ∥fsj∥, is defined as E(fsj2(xsj)). Then, fsj = 0 is equivalent to ∥fsj∥ = 0. On the basis of (5), ∥fsj can be approximated by βsjGsj=(βsjGsjβsj)12 with positive definite matrix Gsj=HsjHsjns, where ns is the number of subjects staying in state s. Denote f^sj as the estimator of ∥fsj∥. In the model selection procedure, if f^sj=0, then fsj = 0. The nonparametric function gj(xitj) can be similarly approximated by

gj(xitj)=m=1Mjβ~jmhm(xitj)=β~jhitj, (7)

where β~jm, hm(·), hitj, Mj, and β~j are defined in the same manner as those in (5). Likewise, ∥gj∥ can be approximated by gjG~j=(β~jG~jβ~j)12, where G~j=HjHj(n×(T1)).

Let yi = (yi1 , ⋯ , yiT)′, Y=(y1,,yn), dit=(cit,xit), Di=(di1,,diT), D=(D1,,Dn), Zi = (Zil, … ,ZiT)′, Z=(Z1,,Zn), and θ be the vector that includes all the unknown parameters. With the linear basis expansion, the complete-data log-likelihood function is given by

logp(Y,D,Zθ)=i=1n[logp(yiDi,Zi,θ)+logp(ZiDi,θ)]=i=1nt=1Tlogp(yitdit,Zit=s,θ)+i=1nt=2Tlogp(Zit=sZi,t1=u,dit,θ)+i=1nlogp(Zi1=sθ)=12i=1nt=1T[log(2πΨs)+(yitηit)2Ψs]+i=1nt=2Tlog(pitus)+i=1nlog(pi10s), (8)

where

ηit=μs+αscit+j=1qβsjhitj,pi10s=πs,s=1,,S,pitu1=exp{aitu1}1+exp{aitu1},pi1uS=j=1S111+exp{aituj},pitus=exp{aitus}1+exp{aitus}j=1s111+exp{aituj},s=2,,S1, (9)

with aitus=ζus+α~cit+j=1qβ~jhitj.

2.3 ∣. Related issues

The proposed model is not identifiable because of the following two model indeterminacies. First, the basis functions involved in basis expansion may contain constant parts. When applying such constant basis functions in every fsj(·) and/or gj(·), each unknown function is not identifiable up to a constant. To address this issue, we need to impose the following constraints on the unknown functions to enforce their integrations in the ranges of predictors to zero17,18:

χjfsj(x)dx=0,fors=1,,S,j=1,,q, (10)

where χj is the domain of xj. Second, the label switching problem, which is caused by the invariance of the likelihood function to a random permutation of the state labels, arises and leads to a multimodal posterior under a symmetric prior specification. We address this issue by imposing constraint μ1 < ⋯ < μS on posterior samples.

3 ∣. BAYESIAN ANALYSIS

3.1 ∣. Adaptive group lasso penalties

We explain the key idea of the adaptive group lasso penalties in the context of a simple linear regression model: y = μ1n + Xβ + δ, where y is the response vector, μ is an intercept, 1n is an n-dimensional vector of all elements being 1, X is a standardized design matrix, δ is the vector of residuals, δ ~ N(0, ψIn), and In is an n-dimensional identity matrix. Tibshirani19 first introduced the lasso procedure for simultaneous model selection and parameter estimation of the above linear regression. The lasso estimator of β can be expressed as

argminβ{(yμ1nXβ)(yμ1nXβ)+γh=1pβh}, (11)

where γ ≥ 0 can be regarded as an L1-penalty that automatically shrinks unimportant covariate effects to 0. Given that the covariates in X are standardized to the same scale, the magnitudes of the coefficients in β can represent the significance of predictors. If some elements of β are close to 0, then the corresponding covariates are unimportant and can be removed from the model.

However, when simply applying lasso to the proposed semiparametric HMMs, at least two problems exist. First, lasso is originally designed for the selection of individual variables. Yuan and Lin20 showed that lasso tends to select more factors than necessary in the presence of group variables. Moreover, the pairwise correlations among group variables jeopardize the model selection accuracy of the lasso estimator.21 In this study, high correlations exist among the basis functions hm(xit)s in the conditional and transition models because they can be viewed as different transformations of xit. Consequently, the linear basis expansion involves group variables and should not be treated separately. Second, lasso applies the same tuning parameter γ to different regression coefficients, thereby introducing the same amount of shrinkage to different covariate effects. This inflexible setting may add considerable bias to the resulting estimates.22,23

To address the aforementioned issues, Yuan and Lin20 proposed group lasso to perform model selection among group variables. Wang and Leng24 further developed adaptive group lasso to assign different tuning parameters to different groups of regression coefficients. Let α=(α1,,αS), βs=(βs1,,βsq), β=(β1,,βS), β~=(β~1,,β~q), and θ=(α,α~,β,β~). On the basis of the proposed model defined in (1)(7), the adaptive group lasso estimator can be formulated as

arg minθ{i=1nt=1T(yitηit)(yitηit)i=1nt=2Tlog(pitus)P(θ)}, (12)

where ηη is the mean ofyit, pitus is the transition probability defined in (2) and (9), and

P(θ)=s=1Sh=1pγαshαsh+h=1pγ~αhα~h+s=1Sj=1qγβsjβsjGsj+j=1qγ~βjβ~jG~j, (13)

in which αsh, α~h, βsj, and β~j are coefficients of fixed effects and basis functions in the conditional and transition models; γash, γ~αh, γβsj and γ~βj are the corresponding tuning parameters; and the norms ∥βsjGsl and β~jG~l are defined in Section 2.2. Notably, the coefficients of discrete covariates, namely, αsh and α~h, are simply assigned adaptive penalties, whereas the coefficients of unknown smooth functions βsj and β~j, which have groupwise features, are assigned adaptive group lasso penalties. The initial probabilities pi10s are excluded from (13) because they are independent of α~h and β~j.

Yuan and Lin20 argued that the penalty function in (13) is intermediate between the L1-penalty used in lasso and the L2-penalty used in ridge regression. Therefore, the adaptive group lasso not only has the same advantages of lasso in model selection but also alleviates the problem caused by the existence of high pairwise correlation among basis functions. Furthermore, with the use of different tuning parameters γβsj and γ~βj, the adaptive group lasso automatically imposes large penalties on groups of unimportant coefficients to efficiently shrink them to 0. Moreover, the penalty terms ∥βsjGsj and β~jG~j can be regarded as the scaled version of the groupwise prediction penalty suggested by Buhlmann and Van De Geer.25 With the great power of adaptive group lasso, the estimation of all unknown parameters and the structure detection for important functional covariate effects on the observed response and on the hidden-state process can be simultaneously and efficiently obtained.

3.2 ∣. BaGlasso and prior specification

Under the Bayesian framework, the adaptive group lasso procedure can be implemented by introducing a multivariate conditional Laplace prior to the regression coefficients in θ=(α,α~,β,β~) as follows:

p(θΨ,σ2)exp(h=1p(γαshΨsαsh+γ~αhσ2α~h)j=1q(γβsjΨsβsjGsj+γ~βjσ2β~G~j)), (14)

where ψ = (ψ1, … , ψS)′. This conditional Laplace prior can be represented as a scale mixture of normals with an exponential mixing density, leading to a hierarchical representation of the full model as follows: for i = 1, … , n, t = 1, … , T, s = 1, … ,S, h = 1, … , p, and j = 1, … , q, we have

yitZit=s,μs,cit,αs,βs,ΨsN(ηit,Ψs),αsΨs,ταs12,,ταsp2indNp(0,Ψs,Σαs),Σαs=diag(ταs1,,ταsp)α~σ2,τ~α12,,τ~αp2Np(0,σ2Σ~α),Σ~α=diag(τ~α1,,τ~αp)βsjΨs,τβsj2indNMj(0,Ψsτβsj2Gsj1),β~jσ2,τ~βj2indNMj(0,σ2τ~βj2G~j1),ταsh2indGamma(1,γαsh22),τ~αh2indGamma(1,γ~αh22)τβsj2indGamma(Mj+12,γβsj22),τ~βj2indGamma(Mj+12,γ~βj22)Ψs1indGamma(αΨs0,βΨs0),σ2Gamma(ασ0,βσ0, (15)

where ind represents “independently distributed according to” and ηit is defined in (9). For the tuning parameters γαsh, γ~αh, γβsj, and γ~βj, we assign gamma priors as follows:

p(γαsh2)indGamma(ααsh0,βαsh0),p(γ~αh2)indGamma(α~αh0,β~αh0),p(γβsj2)indGamma(αβsj0,ββsj0),p(γ~βj2)indGamma(α~βj0,β~βj0), (16)

where ααsh0, α~αh0, αβsj0, α~βj0, βαsh0, β~αh0, ββsj0, and β~βj0 are hyperparameters with prespecified values. We follow a common practice in the literature11,12 to set ααsh0=α~αh0=αβsj0=α~βj0=1, βαsh0=β~αh0=0.1, and ββsj0=β~βj0=0.01 to obtain relatively dispersed gamma priors. The key idea of BaGlasso is to properly update the tuning parameters by using the data, thereby automatically imposing large penalties on unimportant coefficients. This target can be naturally achieved by introducing dispersed priors with small hyperparameters βαsh0, β~αh0, ββsj0, and β~βj0. We explain this regularization procedure further through the posterior distribution of the tuning parameters as follows:

p(ταsh2)InGaussian(γαsh2Ψsαsh2,γαsh2),p(τ~αh2)InGaussian(γ~αsh2σ2α~sh2,γ~αsh2),p(τβsj2)InGaussian(γβsj2ΨsβsjGsj,γβsj2),p(τ~βj2)InGaussian(γ~βsj2σ2β~jG~j,γ~βj2),p(γαsh2)Gamma(ααsh0+1,βαsh0+ταsh22),p(γ~αh2)Gamma(α~αh0+1,β~αh0+τ~αh22),p(γβsj2)Gamma(αβsj0+Mj+12,ββsj0+τβsj22),p(γβj2)Gamma(α~βj0+Mj+12,β~βj0+τβj22), (17)

where “In-Gaussian(·)” denotes the inverse Gaussian distribution. We omit the tedious subscripts and use generic terms τ and γ to simplify notations below. On the basis of (17), if the coefficients are significant, then τ2 tends to be large. As a result, the corresponding tuning parameter γ is dominated by τ2, leading γ to be mostly data driven. If the coefficients are insignificant, then τ2 tends to be small. Consequently, the corresponding tuning parameter γ is dominated by the dispersed prior information, leading to a large value of γ. Thus, the degree of dispersion of the gamma priors in (16) determines the amount of penalties imposed on unimportant predictors. This rationale explains why we assign higher dispersed priors to γβsj2 and γ~βj2 than to γαsh2 and α~αh2 because the coefficients of the nonlinear parts of basis functions are more difficult to shrink to 0 than those of the linear parts.

To conduct a full Bayesian analysis, we specify appropriate prior distributions for other unknown parameters, such as μs, πs, and ζus. For u = 1, … ,S and s = 1, … , S, the following Gaussian priors are considered:

p(μs)indN(μs0,σμs02),p(πs)indN(πs0,σπs02),p(ζus)indN(ζus0,σζus02), (18)

where μs0, σμs02, πs0, σπs02, ζus0, and σζus02 are hyperparameters with preassigned values.

3.3 ∣. Posterior inference

The Bayesian estimate of θ can be obtained through the mean or mode of the posterior samples drawn from p(θY). However, directly sampling from p(θY) is intractable because of the existence of latent states. To address this issue, we adopt the data augmentation technique to work on p(θ, ZY) and utilize the Gibbs sampler to simulate each of the unknowns from its full conditional distribution iteratively. Owing to the nonlinearity of the continuation-logit transition model, the full conditional distributions related to the transition model have complex forms. Thus, Markov chain Monte Carlo (MCMC) methods, such as the forward filtering and backward sampling algorithm26 and the Metropolis-Hastings algorithm,27,28 are used to sample from them. The details of the full conditional distributions are provided in the Appendix.

For nonparametric functions involved in (1) and (3), as suggested by Li et al,29 a functional effect of a covariate is detected as significant and included in the regression if at least one of its coefficients of the basis expansion has a two-sided 95% credible interval estimate that does not cover zero. The latent state Zit, which usually has actual meaning in empirical studies, is also of great interest for scientists. By using posterior samples, we can estimate the hidden state as follows:

Z^it=argmaxs{1,,S}P(Zit=syi,θ)argmaxs{1,,S}1Ll=1LI(Zit(l)=s), (19)

where Zit(l) denotes the latent allocation of yit at the lth iteration, and 1Ll=1MI(Zit(l)=s) is the posterior mean of the latent allocations of yit drawn from the MCMC iterations.

3.4 ∣. Determination of the number of hidden states

In the analysis of HMMs, the number of hidden states, S, is usually determined a priori. We use a modified DIC, which was developed by Celeux et al,30 for model comparison in the presence of incomplete data, to determine the number of hidden states of the proposed model. The modified DIC is defined as follows:

DIC=D(θ)+pD, (20)

where D(θ)=Eθ,Z[2logp(Y,Zθ)Y] is the posterior mean deviance to reflect the goodness of fit of the model, pD is the effective number of parameters to penalize an overcomplex model, and pD = Eθ,Z[−2logp(Y, Zθ)∣Y] + 2EZ[logp(Y, Z)∣Eθ[θY, Z])∣Y]. The expectations involved in (20) can be approximated by averaging the posterior samples collected through the MCMC algorithm.30,31 The model with the smallest value of DIC is selected.

4 ∣. SIMULATION STUDY

This section contains two simulations: Simulation 1 assesses the empirical performance of the proposed BaGlasso for simultaneous estimation and variable selection in the context of semiparametric HMMs, and Simulation 2 examines the performance of the DIC in determining the number of hidden states in semiparametric HMMs.

4.1 ∣. Simulation 1

We consider 100 simulated data sets, each consisting of n = 700 subjects and T = 9 time points. For each data set, observations are generated from a two-state semiparametric HMM with a continuous response yit, two discrete covariates cit = (cit1, cit2)r (p = 2), and three continuous covariates xit = (xit1, xit2, xit3)′ (q = 3). For i = 1, … , 700 and t = 1, … , 9, cit1 and cit2 are independently generated from the Bernoulli distribution with a probability of success of 0.5, and xit1, xit2, and xit3 are generated from U(−1, 1), N(0, 1), and N(t,1), respectively, and they are standardized to the same scale beforehand. Here, xit1 and xit2 are set as time-invariant covariates, whereas xit3 is set as a time-variant one. The conditional regression model is defined as follows:

[yitZit=s]=μs+αs1cit1+αs2cit2+fs1(xit1)+fs2(xit2)+fs3(xit3)+δit, (21)

where f11(xit1) = 0, f12(xit2) = sin(1.5xit2) + xit2 – 0.6, f13(xit1) = −0.8xit3, f21(xit1) = 2.08 – exp(xit1), f22(xit2) = 0, and f23(xit3) = −0.105 + cos(2xit3) + 0.5xit3.

The transition model is defined as

logit(ϑitus)=ζus+α~1cit1+α~2cit2+g1(xit1)+g2(xit2)+g3(xit3), (22)

where g1(xit1) = −log(2 + xit1)/(2 – xit1), g2(xit2) = 1.5xit2, and g3(xit3) = 0. The true population values of the unknown parameters are set as μ = (μ1, μ2)′ = (−1, 1)′, π = (π1, π2)′ = (0.5, 0.5)′, ζ11 = ζ21 = 0.5, α1 = (α11, α12)′ = (0, 0.5)′, α2 = (α21, α22)′ = (−0.5, 0)−, α~=(α~1,α~2)=(1,0), and ψ = (ψ1, ψ2)′ = (0.36, 0.16)′.

In this study, we use a simple version of natural cubic splines derived from a truncated power series basis function16 to approximate the nonparametric functions: hj1(xitj) = 1, hj2(xitj) = xitj, and hj,m+2 = ujm(xitj) − uj,Mj–1(xitj) for m = 1, … , Mj – 2, where uj,m(xitj)=[(xitjκjMj)+3(xitjκjm)+3](κjMjκjm), and κjm, m = 1, … , Mj, are the knots taken in the range of xitj. The prior inputs in (15), (16), and (18) are assigned as follows: μs0 = ζus0 = πs0 = 0, σμs02=σζus02=σπ02=1, αψs0 = ασ0 = 9, βψs0 = βσ0 = 4, ααsh0=α~αh0=αβsj0=α~βj0=1, βαsh0=β~αh0=0.1, and ββsj0=β~βj0=0.01. For each xitj, Mj = 10 knots are used. We impose the constraint μ1 < μ2 in each MCMC iteration to avoid label switching and check the convergence of the algorithm using the estimated potential scale reduction (EPSR) proposed by Gelman et al.32 The MCMC algorithm converges within 5000 iterations. Thus, we collect posterior samples with a size of 20 000 with the first 10 000 as burn-in iterations. The performance of Bayesian estimates is assessed through the bias (BIAS) and the root-mean-square error (RMSE) between the Bayesian estimates and the true population values of the parameters.

Table 1 summarizes the estimation results on the basis of the 100 data sets. The BIAS and RMSE for most of the parameters are close to zero, indicating a satisfactory performance of Bayesian estimation regarding the parametric part. Figure 1 depicts the averages of the pointwise posterior means of the nonparametric functions, along with their 2.5% and 97.5% pointwise quantiles. Three nonexistent functions are successfully shrunk to almost zero by the proposed BaGlasso procedure. The posterior means of other nonzero nonparametric functions are close to their true curves, and all the ranges of the 2.5% and 97.5% pointwise quantiles are small, indicating that the estimated nonparametric curves can correctly recover the complex functional relationships between the response and covariates. Moreover, the average of the correct classification rates calculated from (19) is approximately 95%, implying the good performance of the proposed method in identifying the hidden states of the observations.

TABLE 1.

Bayesian estimates of the parameters in the simulation study

Parameters in the Conditional Regression Model
State 1 State 2
Par True Est RMSE Par True Est RMSE
μ1 −1.0 −0.969 0.041 μ2 1.0 1.006 0.033
α11 0.0 −0.000 0.025 α21 −0.5 −0.499 0.015
α12 0.5 0.501 0.023 α22 0.0 0.001 0.015
ψ1 0.36 0.392 0.034 ψ2 0.16 0.191 0.032
Parameters in the Probability Transition Model
Par True Est RMSE Par True Est RMSE
α~1 −1.0 −0.985 0.080 α~2 0.0 −0.000 0.055
π1 0.5 0.528 0.036 π2 0.5 0.472 0.036
ζ11 0.5 0.501 0.152 ζ21 0.5 0.504 0.152

Abbreviation: RMSE, root-mean-square error.

FIGURE 1.

FIGURE 1

Estimates of the unknown smooth functions in the simulation study. The solid curves represent the true curves, and the dashed curves represent the estimated posterior means and the 2.5% and 97.5% pointwise quantiles on the basis of 100 replications

To reveal the sensitivity of Bayesian estimates to the input of prior distributions, we disturb the prior input as follows: μs0 = ζus0 = πs0 = 2, σμs02=σζus02=σπs02=2, αψs0 = 3, βψs0 = 2, ααsh0=α~αh0=αβsj0=α~βj0=1, βαsh0=β~αh0=0.5, and ββsj0=β~βj0=0.01. The Bayesian results obtained under the disturbed prior are similar and not reported.

Notably, this simulation study contains five covariates in the conditional and transition models, which result in a large number (22×5) of competing models with various combinations of covariates in both models. Traditional Bayesian model selection statistics, such as the Bayes factor and the DIC, are extremely time consuming in performing variable selection because they compare these competing models in a pairwise basis. By contrast, the proposed BaGlasso procedure automatically selects important predictors and avoids the tedious pairwise comparison, thereby greatly reducing the computational time. In this simulation study, the computing time for simultaneous variable selection and parameter estimation in each replication is 48 minutes using a PC Intel Core i7-6700 3.40-GHz CPU and 16 G of RAM.

4.2 ∣. Simulation 2

To examine the performance of the DIC in determining the number of hidden states of a semiparametric HMM, we consider five competing models M1, M2, M3, M4, and M5, where Ms is a model defined by (1)(3) with S = s, s = 1, … , 5. Here, M4 is the true model, whereas M1, M2, M3, and M5 are models with incorrect numbers of hidden states. To mimic the scenario of the ADNI data set in the subsequent real example, we generate 100 data sets from (1)(3) with S = 4, n = 633, T = 4,p = 4, and q = 3. For i = 1, … , 633 and t = 1, … , 4, cit1 to cit4 are independently generated from the Bernoulli distribution with a probability of success of 0.5, and xit1, xit2, and xit3 are generated from U(−1, 1), N(0, 1), and N(t,1), respectively, and they are standardized prior to analysis. The true functions are set as f11(xit1) = 0, f12(xit2) = sin(1.5xit2)+ xit2 – 0.6, f13(xit1) = –0.8xit3, f21(xit1) = 2.08 – exp(xit1), f22(xit2) = 0, f23(xit3) = −0.105 + cos(2xit3) + 0.5xit3, f31(xit1) = 0.5xit1, f32(xit2) = 0, f33(xit1) = −xit3, f41(xit1) = 2xit1, f42(xit2) = 1.5xit2, f43(xit3) = 0, g1(xit1) = −log(2 + xit1)/(2 − xit1), g2(xit2) = 1.5xit2, and g3(xit3) = 0. The true population values of the unknown parameters are set as μ = (μ1, μ2, μ3, μ4)′ = (−4, −2, 2, 4)′, π = (π1, π2, π3, π4)′ = (0.25, 0.25, 0.25, 0.25)′, ζ11 = ζ21 = ζ31 = ζ41 = −1, ζ12 = ζ22 = ζ32 = ζ42 = 0, ζ13 = ζ23 = ζ33 = ζ43 = 1, α1 = (α11, α12, α13, α14)′ = (1, 0, 0.5, 1)′, α2 = (α21, α22, α23, α24)′ = (0.5, −0.5, 0, −1)′, α3 = (α31, α32, α33, α34)′ = (0.5, −1, 1, 0)′, α4 = (α41, α42, α43, α44)′ = (0.5, 1, −0.5, 0)′, α~=(α~1,α~2,α~3,α~4)=(1,0.5,0,1), and ψ = (ψ1, ψ2, ψ3, ψ4)′ = (0.16, 0.16, 0.16, 0.16)′. The prior distributions and other settings are specified in the same manner as in Simulation 1. On the basis of the 100 simulated data sets, the means and standard deviations of the DIC values for M1 to M5 are reported in Table 2, which suggests that the true model M4 is consistently selected in each of the 100 replications.

TABLE 2.

Summary of deviance information criterion (DIC) values in the simulation study

Competing Model DIC (mean) DIC (std) No. of Selections
M1 12 018 79 0
M2 10 912 92 0
M3 10 124 461 0
M4 8988 128 100
M5 10 052 158 0

Note: No. of selections represents the number of times that the DIC value of Ms (s = 1, … ,5) is the smanest among all competing models in 100 replications.

The computer code for conducting the preceding analyses is written in R and is freely available at http://www.sta.cuhk.edu.hk/xysong/codes/BaGLassoHMMs.

5 ∣. ADNI STUDY

To demonstrate the empirical utility of our proposed method, we conduct real data analysis on the basis of the ADNI study. The data set collected imaging, genetic, clinical, and cognitive data from participants under CN controls and participants with mild cognitive impairment or AD. ADNI-1 was first conducted in 2004, and several extensions, namely, ADNI-GO, ADNI-2, and ADNI-3, followed afterward. In this study, we focused on 633 participants collected from ADNI-1 and included their clinical and genetic variables at four time points, namely, baseline, 6 months, 12 months, and 24 months. Functional Assessment Questionnaire (FAQ), a widely used assessment of abilities to function independently in daily life, was used as a response variable (yit) to reflect cognitive decline over time. Patients with higher FAQ scores have lower cognitive abilities. Three continuous covariates, namely, the logarithm of the ratio of hippocampal volume over whole brain (xit1), age at baseline (xit2), and years of education (xit3), were considered. Moreover, we included a genetic variable, APOE-ϵ4 (cit1 and cit2), which was coded as 0, 1, and 2, denoting the number of APOE-ϵ4 alleles. Other discrete demographic characteristics, such as gender (cit3, 0 = female; 1 = male) and marital status (cit4, 0 = has been married; 1 = has not been married), were also included. The three continuous variables, namely, FAQ score, hippocampus, and age, were standardized prior to analysis. The main objective of this study is to examine the complex effects of potential risk factors on the transition of neurodegenerative states and on the cognitive decline of participants across different states.

We first determined the number of hidden states. We considered five competing models Mk, k = 1, … , 5, where Mk represents a semiparametric HMM defined in (1)(3) with k states. We used natural cubic splines for hitj and Mj = 10 in approximating the unknown smoothing functions. The hyperparameters were assigned in the same manner as those in the simulation study, and the identifiability constraint μ1 < ⋯ < μ5 was taken to avoid label switching. We generated several MCMC chains with different initial values to monitor the convergence of the MCMC algorithm. The EPSR plot depicted in Figure 2 indicated that the MCMC algorithm converged within 10 000 iterations. Therefore, we collected 10 000 observations after discarding 10 000 burn-in iterations to calculate the DIC values of the competing models.

FIGURE 2.

FIGURE 2

Plot of estimated potential scale reduction (EPSR) values for the parameters in the ADNI-1 (Alzheimer’s Disease Neuroimaging Initiative) data analysis. The horizontal dotted line is for EPSR = 1.2. MCMC, Markov chain Monte Carlo

The values of D(θ), pD, and DIC corresponding to M1 to M4 are reported in Table 3. When fitting the data to M5, the MCMC algorithm broke down after several iterations. After carefully checking the results, we found that one of the states included only fewer than six subjects after several iterations. This phenomenon implies the nonexistence of such a state and the inapplicability of the five-state model in this study. On the basis of the results in Table 3, the four-state model M4 with the smallest DIC was selected. Then, we used the proposed BaGlasso procedure to conduct a simultaneous estimation and variable selection under M4. Results are presented in Table 4 (parametric part) and Figure 3 (nonparametric part), in which only significant functional effects are reported.

TABLE 3.

Summary of deviance information criterion (DIC) values in the analysis of the Alzheimer’s Disease Neuroimaging Initiative data set

Competing Model D(θ) PD DIC
M1 6294 35 6329
M2 1434 69 1503
M3 1016 97 1113
M4 972 126 1098

TABLE 4.

Estimation results in the ADNI-1 (Alzheimer’s Disease Neuroimaging Initiative) data analysis: parametric part

Parameters in the Conditional Regression Model
State 1 State 2 State 3 State 4
Par Est SE Par Est SE Par Est SE Par Est SE
μ1 −0.608 0.005 μ2 −0.200 0.032 μ3 0.948 0.075 μ4 2.466 0.127
α11 0.000 0.005 α21 0.059 0.040 α31 0.113 0.082 α41 0.256 0.151
α12 0.015 0.013 α22 0.012 0.040 α32 0.068 0.086 α42 0.120 0.143
α13 0.003 0.005 α23 0.019 0.031 α33 −0.303 0.107 α43 −0.427 0.157
α14 0.003 0.005 α24 0.008 0.030 α34 −0.047 0.073 α44 −0.115 0.143
ψ1 0.009 0.000 ψ2 0.073 0.008 ψ3 0.173 0.020 ψ4 0.437 0.052
Parameters in the Transition Model
Par Est SE Par Est SE Par Est SE Par Est SE
α~1 −0.386 0.174 α~2 −0.821 0.253 α~3 0.012 0.078 α~4 −0.150 0.132
π1 0.592 0.022 π2 0.198 0.022 π3 0.149 0.018 π4 0.060 0.014
ζ11 2.513 0.165 ζ21 −1.459 0.246 ζ31 −3.278 0.451 ζ41 −3.343 0.500
ζ12 2.395 0.418 ζ22 1.498 0.253 ζ32 −1.674 0.331 ζ42 −3.320 0.498
ζ13 1.405 0.740 ζ23 2.840 0.447 ζ33 1.657 0.279 ζ43 −2.017 0.426

FIGURE 3.

FIGURE 3

ADNI-1 (Alzheimer’s Disease Neuroimaging Initiative) data analysis results: the estimates of significant unknown smooth functions at the corresponding states. The solid curves represent the pointwise mean curves, and the dashed curves represent the 2.5% and 97.5% pointwise quantiles. Line y = 0 is denoted in each picture by a red dot-dash line to illustrate the range of significant effects for each risk factor. FAQ, Functional Assessment Questionnaire [Colour figure can be viewed at wileyonlinelibrary.com]

We obtain the following observations: first, intercepts μ1, μ2, μ3, and μ4 were ranked in ascending order. Patients in state 1 had the lowest mean score of FAQ, whereas those in state 4 received the highest mean score. That is, patients’ cognitive ability reflected by independent functioning in daily life steadily deteriorated from state 1 to state 4. According to the existing literature,33 state 1 to state 4 can be explained as CN, early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI), and AD, respectively.

Second, BaGlasso selected six significant functional effects across the states. The effect of hippocampus on the FAQ score exhibits a descending trend in all the states. Specifically, in the CN state, participants with a greater hippocampal volume tend to have slightly better memory. This result is consistent with the common sense that the hippocampus helps consolidate outside information from short-term memory to long-term memory. In EMCI and LMCI states, the magnitude of the functional effect of the hippocampus on FAQ becomes increasingly large, confirming that atrophy in hippocampal volume continuously impairs patients’ cognitive ability during the progression from EMCI to LMCI. Published medical reports34-36 also revealed the similar result that the loss of hippocampal volume greatly affects dementia. In the AD state, preventing the loss of hippocampal volume is still beneficial to postpone cognitive decline, but this effect is significant only in a small range of hippocampal volume. The effect of age on FAQ is nonsignificant in the first three states, implying that age influences cognitive function mainly in the AD state. Relatively younger AD patients (around 75 years old) have better functional independence in daily life compared with elder ones. This age effect was also revealed by previous research.37,38 The effect of educational level on FAQ is likewise significant only in the AD state. Such effect becomes large when educational level is high, indicating that patients with higher educational levels tend to experience more pronounced cognitive decline compared with patients with lower educational levels. This finding is in line with the existing literature.39,40

Third, for the parametric part, gender has a negative effect on FAQ in the LMCI and AD states, implying that women suffer more serious cognitive decline than men in the late progression period of AD. This result agrees with existing medical reports.41-43

Fourth, in the transition model, the functional effect of the hippocampus exhibits an ascending trend with the growth of hippocampal volume. In the progression of AD, patients with larger hippocampal volumes are more likely to remain in the current state rather than transit to a worse one compared with those with smaller hippocampal volumes. By contrast, patients with APOE-ϵ4 alleles are more likely to transit to a worse state rather than remain in the current one. Thus, APOE-ϵ4 alleles are important risk factors for the development of AD. This result is consistent with the existing finding.44 However, the estimates of other covariates, such as age, educational level, gender, and marital status, were shrunk to nearly zero by BaGlasso, implying that conditional on hippocampus and APOE-ϵ4, the direct effects of age, educational level, gender, and marital status on the transition probability are weak.

For comparison, we reanalyzed the ADNI data set using a parametric HMM as follows:

[yitZit=s]=μs+αs1cit1+αs2cit2+αs3cit3+αs4cit4+βs1xit1+βs2xit2+βs3xit3+δit,logit(ϑitus)=ζus+α~1cit1+α~2cit2+α~3cit3+α~4cit4+β~1xit1+β~2xit2+β~3xit3.

The Bayesian adaptive lasso procedure was used to perform estimation. Table 5 presents the results of parameters βsj and β~j. The results of μs, ζus, αsh, and α~h are similar to those in Table 4 and not reported. Several differences exist between the results obtained using the parametric and semiparametric HMMs. First, the parametric model shows a negative constant effect of the hippocampus on FAQ in the CN, EMCI, and LMCI states, whereas the semiparametric model reveals that these negative effects have a descending trend. Second, the parametric model indicates that the effects of the hippocampus, age, and educational level on FAQ are all insignificant in the AD state, whereas the semiparametric model reveals that these effects are actually significant in certain covariate ranges. Finally, the parametric model shows that the effect of age on FAQ is negative in the NC and EMCI states but positive in the LMCI state. This diverse effect is hard to interpret and probably caused by overlooking the subtle structure of the age effect in the parametric model.

TABLE 5.

Estimation results of the parametric hidden Markov model in the ADNI-1 (Alzheimer’s Disease Neuroimaging Initiative) data analysis

Parameters in the Conditional Regression Model
State 1 State 2 State 3 State 4
Par Est SE Par Est SE Par Est SE Par Est SE
β11 −0.022 0.004 β21 −0.122 0.023 β31 −0.155 0.039 β41 −0.127 0.065
β12 −0.006 0.003 β22 −0.008 0.017 β32 0.070 0.034 β42 0.088 0.055
β13 −0.004 0.003 β23 −0.014 0.018 β33 0.030 0.029 β43 0.025 0.051
Parameters in the Transition Model
Par Est SE Par Est SE Par Est SE Par Est SE
β~1 0.351 0.042 β~2 −0.033 0.034 β~3 0.004 0.023

6 ∣. CONCLUSION

In this paper, we have introduced a BaGlasso procedure to conduct simultaneous variable selection and parameter estimation in the context of semiparametric HMMs. We developed a full Bayesian approach, along with efficient MCMC methods and the basis expansion technique, to implement the procedure and estimate nonparametric functions. The methodology was demonstrated by a simulation study and an application to the analysis of the ADNI data set. In the proposed model, covariates are allowed to affect both responses and transition probabilities. This feature enables the model to cope with general situations where certain covariates simultaneously influence the two stochastic processes in various ways. An alternative method of including covariates in HMMs is to use an exclusion restriction to split the overall set of covariates into two groups: one contains covariates affecting only the responses, and the other contains covariates affecting the hidden-state transition. However, determining such an exclusion restriction may be subjective and difficult to justify in practice, which, in turn, elicits model selection issues.

This study can be extended in several directions. First, in approximating nonparametric functions, we considered only a simple version of natural cubic splines. Highly sophisticated smoothing techniques, such as splines and local polynomial kernel methods, may be used to enhance the performance of estimation and variable selection. Second, we simply used a single indicator, FAQ, to reflect cognitive ability in the ADNI data analysis. A comprehensive way to characterize cognitive function is to account for other relevant tests, such as the Alzheimer’s Disease Assessment Scale and the Mini-Mental State Examination. Grouping such highly correlated but different perspectives into an integrated latent variable through factor analysis can improve the analytic power and interpretability of the model. Finally, our model framework includes only binary and continuous variables. Given that ordered and unordered categorical data are frequently encountered in medical, social, and psychological sciences, generalizing the existing framework to accommodate a wide variety of data types is of great interest.

ACKNOWLEDGEMENTS

The work of Xinyuan Song was supported by the Research Grants Council of Hong Kong under grant 14303017, The Chinese University of Hong Kong under direct grants, and the National Natural Science Foundation of China under grant 11471277. The work of Joan Hu was supported by the Canadian Institutes of Health Research under grant RN120660 and the Natural Sciences and Engineering Research Council of Canada under grant 177430. The authors are thankful to the Editor, the Associate Editor, and three anonymous reviewers for their valuable comments and suggestions.

Funding information

Research Grants Council of Hong Kong, Grant/Award Number: 14303017; National Natural Science Foundation of China, Grant/Award Number: 11471277; Canadian Institutes of Health Research, Grant/Award Number: RN120660; Natural Sciences and Engineering Research Council of Canada, Grant/Award Number: DAS 177430

APPENDIX A

FULL CONDITIONAL DISTRIBUTIONS

A.1 ∣. Full conditional distributions of Zit

Let yi = (yi1, … , yiT)′, dit=(cit,xit), and Di=(di1,,diT). Then, we have

p(Zit)p(yi,Di,Zitθ)=p(yi1,,yit,di1,,dit,Zitθ)×p(yi,t+1,,yiT,di,t+1,,diTZit,θ)=qit(yi,Di,Zitθ)×qit(yi,DiZit,θ).

We first initialize qi1(yi, Di, Zitθ) = p(yi1, di1, Zitθ) = p(yi1di1, Zi1, θ)p(Zi1θ) and calculate qit(yi, Di, Zitθ) for t = 2, … , T, in a recursion manner as follows:

qit(yi,Di,Zitθ)=qit(yi1,,yit,di1,,diT,Zitθ)=u=1Sp(yi1,,yit,di1,,diT,Zit,Zi,t1=uθ)=[u=1Sp(yi1,,yit,di1,,diTZi,t1=uθ)×p(ZitZi,t1=u,dit,θ)×p(yitdit,Zit,θ)]=u=1S[qi,t1(yi,Di,Zi,t1=uθ)×p(Zi,t1=u,dit,θ)×p(yitdit,Zitθ)], (A1)

where p(ZitZi,t–1 = u, dit, θ) and p(yit, ditZit, wi1, θ) can be calculated on the basis of (8).

Similarly, we initialize qiT(yi,DiZiT,θ)=1 and calculate qit(yi,DiZit,θ) for t = T – 1, – , 1 as follows:

qit(yi,DiZit,θ)=p(yi,t+1,,yiT,di,t+1,,diTZit,θ)=u=1Sp(yi,t+1,,yiT,di,t+1,,diT,Zi,t+1=uZit,θ)=u=1S[p(yi,t+1,,yiT,di,t+1,,diTZi,t+1=u,θ)×p(Zi,t+1=uZit,di,t+1,θ)][×p(yi,t+1di,t+1,Zi,t+1=u,θ)]=u=1S[qi,t+1(yi,DiZi,t+1=u,θ)×p(Zi,t+1=uZit,di,t+1,θ)×p(yi,t+1di,t+1,Zi,t1=u,θ)] (A2)

Thus, Zit can be directly generated from (A1) when all qit(·) and qit()S defined in (A1) and (A2) are well calculated.

A.2 ∣. Full conditional distributions of μs, αs, and ψs

[μs]N[μs,σμs],[αs]N[αs,Σαs],[Ψs1]Gamma[αΨs,βΨs] (A3)

In the above equation, αΨs=(ns+p+j=1qMj)2+α~s0, σμs=(nsΨs1+σμs01)1, and

βΨs=β~s0+12[i=1nt=1TI(Zit=s)(yitμsαscitj=1qβsjhitj)2+j=1qβsjGsj2τβsj2+h=1p+αsh2ταsh2],Σαs=(i=1Nt=1TcitcitΨs1I(Zit=s)+Dαs1)1,Dαs=diag(ταs12,,ταsp2),μs=σμs[Ψs1i=1nt=1TI(Zit=s)(yitαscitj=1qβsjhitj)+σμs01μs0],αs=Σs[Ψs1i=1nt=1TI(Zit=s)cit(yitμsj=1qβsjhitj)+Σαs01αs0].

A.3 ∣. Full conditional distributions of βsj

[βsj]N[βsj,Σsj]I(1nsHsjβsj=0) (A4)

In the above equation, Σsj=Ψs(HsjHsj+τβsj1Gsj)1, βsj=Ψs1ΣsjHsjys, and ys={yits} is an ns × 1 vector with

yits=yitμsαscitlj,l=1qβslhitl,forZit=s.

A.4 ∣. Full conditional distributions of πs, ζus, and α~

p(πs)exp{u=sSi=1nlog(pi10u)×I(Zi1=u)(πsπs0)22σπ02}p(ζus)exp{ν=sSi=1nt=2Tlog(pituν)×I(Zit=ν,Zi,t1=u)(ζusζus0)22σζus02}p(α~)exp{i=1nt=2Tlog(pitus)×I(Zit=s,Zi,t1=u)12(α~α~0)D~α1(α~α~0)} (A5)

In the above equation, D~α=σ2diag(τ~α12,,τ~αp2), and pitu0 and pitus can be calculated on the basis of (9).

A.5 ∣. Full conditional distributions of β~j

p(β~j)exp{i=1nt=2Tlog(pitus)×I(Zit=s,Zi,t1=u)12(β~jβ~j0)D~βj1(β~jβ~j0)} (A6)

The above equation is with the constraint 1n(T1)Hjβ~j=0, where D~βj=σ2τ~βj2G~j1, and pitus can be calculated on the basis of (9).

Notably, the full conditional distributions in (A5) and (A6) are not familiar probability distributions. Therefore, the Metropolis-Hastings algorithm is used to sample from them. Besides, the full conditional distributions in (A4) and (A6) involve constraints, and the procedure for sampling from them can be found in the work of Song and Lu.18

REFERENCES

  • 1.Bartolucci F, Farcomeni A. A multivariate extension of the dynamic logit model for longitudinal data based on a latent Markov heterogeneity structure. J Am Stat Assoc. 2009;104:816–831. [Google Scholar]
  • 2.Chow SM, Grimm KJ, Filteau G, Dolan CV, McArdle JJ. Regime-switching bivariate dual change score model. Multivar Behav Res. 2013;48:463–502. [DOI] [PubMed] [Google Scholar]
  • 3.Vermunt JK, Langeheine R, Bockenholt U. Discrete-time discrete-state Latent Markov models with time-constant and time-varying covariates. J Educ Behav Stat. 1999;24:179–207. [Google Scholar]
  • 4.Schmittmann VD, Dolan CV, van der Maas HL, Neale MC. Discrete latent Markov models for normally distributed response data. Multivar Behav Res. 2005;40:461–488. [DOI] [PubMed] [Google Scholar]
  • 5.Scott SL, James GM, Sugar CA. Hidden Markov models for longitudinal comparisons. J Am Stat Assoc. 2005;100:359–369. [Google Scholar]
  • 6.Bartolucci F, Farcomeni A, Pennoni F. Latent Markov Models for Longitudinal Data. Boca Raton, FL: Chapman & Hall/CRC; 2012. [Google Scholar]
  • 7.Yau C, Papaspiliopoulos O, Roberts GO, Holmes CC. Bayesian nonparametric hidden Markov models with application to the analysis of copy-number-variation in mammalian genomes. J Royal Stat Soc: Ser B (Stat Methodol). 2011;73:37–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Song X, Kang K, Ouyang M, Jiang X, Cai J. Bayesian analysis of semiparametric hidden Markov models with latent variables. Struct Equ Model: Multidiscip J. 2018;25:1–20. [Google Scholar]
  • 9.Choi H, Fermin D, Nesvizhskii AI, Ghosh D, Qin ZS. Sparsely correlated hidden Markov models with application to genome-wide location studies. Bioinformatics. 2013;29:533–541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Städler N, Mukheijee S. Penalized estimation in high-dimensional hidden Markov models with state-specific graphical models. Ann Appl Stat. 2013;7:2157–2179. [Google Scholar]
  • 11.Guo R, Zhu H, Chow SM, Ibrahim JG. Bayesian lasso for semiparametric structural equation models. Biometrics. 2012;68:567–577. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Feng XN, Wang GC, Wang YF, Song XY. Structure detection of semiparametric structural equation models with Bayesian adaptive group lasso. Statist Med. 2015;34:1527–1547. [DOI] [PubMed] [Google Scholar]
  • 13.Kang K, Cai J, Song X, Zhu H. Bayesian hidden Markov models for delineating the pathology of Alzheimer’s disease. Stat Methods Med Res. 2018. Online first. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Agresti A. Categorical Data Analysis. Hoboken, NJ: John Wiley & Sons; 2002. [Google Scholar]
  • 15.Song X, Xia Y, Zhu H. Hidden Markov latent variable models with multivariate longitudinal data. Biometrics. 2017;73:313–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Hastie T, Tibshirani R, Friedman JH. Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. New York, NY: Springer; 2009. [Google Scholar]
  • 17.Panagiotelis A, Smith M. Bayesian identification, selection and estimation of semiparametric functions in high-dimensional additive models. J Econom. 2008;143:291–316. [Google Scholar]
  • 18.Song XY, Lu ZH. Semiparametric latent variable models with Bayesian P-splines. J Comput Graph Stat. 2010;19:590–608. [Google Scholar]
  • 19.Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Stat Soc: Ser B (Stat Methodol). 1996;58:267–288. [Google Scholar]
  • 20.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Royal Stat Soc: Ser B (Stat Methodol). 2006; 68:49–67. [Google Scholar]
  • 21.Kyung M, Gill J, Ghosh M, Casella G. Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 2010;5:369–411. [Google Scholar]
  • 22.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–1360. [Google Scholar]
  • 23.Wang H, Li G, Tsai CL. Regression coefficient and autoregressive order shrinkage and selection via the lasso. J Royal Stat Soc: Ser B (Stat Methodol). 2007;69:63–78. [Google Scholar]
  • 24.Wang H, Leng C. A note on adaptive group lasso. Comput Stat Data Anal. 2008;52:5277–5286. [Google Scholar]
  • 25.Bühlmann P, Van De Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. New York, NY: Springer Science and Business Media; 2011. [Google Scholar]
  • 26.Cappé O, Moulines E, Rydén T. Inference in Hidden Markov Models. New York, NY: Springer; 2005. [Google Scholar]
  • 27.Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of state calculations by fast computing machine. J Chem Phys. 1953;21:1087–1092. [Google Scholar]
  • 28.Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57:97–109. [Google Scholar]
  • 29.Li J, Wang Z, Li R, Wu R. Bayesian group lasso for nonparametric varying coefficient models with application to functional genome-wide association studies. Ann Appl Stat. 2015;9:640–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models. Bayesian Anal. 2006;1:651–673. [Google Scholar]
  • 31.Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit. J Royal Stat Soc: Ser B (Stat Methodol). 2002;64:583–639. [Google Scholar]
  • 32.Gelman A, Roberts GO, Gilks WR. Efficient Metropolis jumping rules In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, eds. Bayesian Statistics. Vol. 5 Oxford, UK: Oxford University Press; 1996:599–607. [Google Scholar]
  • 33.Kantarci K, Gunter JL, Tosakulwong N, et al. Focal hemosiderin deposits and I2-amyloid load in the ADNI cohort. Alzheimer’s Dement. 2013;9:S116–S123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kesslak JP, Nalcioglu O, Cotman CW. Quantification of magnetic resonance scans for hippocampal and parahippocampal atrophy in Alzheimer’s disease. Neurology. 1991;41:51. [DOI] [PubMed] [Google Scholar]
  • 35.Jack CR, Petersen RC, O’Brien PC, Tangalos EG. MR-based hippocampal volumetry in the diagnosis of Alzheimer’s disease. Neurology. 1992;42:183. [DOI] [PubMed] [Google Scholar]
  • 36.Dickerson BC, Wolk D. Biomarker-based prediction of progression in MCI: comparison of AD-signature and hippocampal volume with spinal fluid amyloid-β and tau. Front Aging Neurosci. 2013;5:55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Gao S, Hendrie HC, Hall KS, Hui S. The relationships between age, sex, and the incidence of dementia and Alzheimer disease: a meta-analysis. Arch Gen Psychiatry. 1998;55:809–815. [DOI] [PubMed] [Google Scholar]
  • 38.Lindsay J, Laurin D, Verreault R, et al. Risk factors for Alzheimer’s disease: a prospective analysis from the Canadian Study of Health and Aging. Am J Epidemiol. 2002;156:445–453. [DOI] [PubMed] [Google Scholar]
  • 39.Bruandet A, Richard F, Bombois S, et al. Cognitive decline and survival in Alzheimer’s disease according to education level. Dement Geriatr Cogn Disord. 2008;25:74–80. [DOI] [PubMed] [Google Scholar]
  • 40.Stern Y, Albert S, Tang MX, Tsai WY. Rate of memory decline in AD is related to education and occupation. Neurology. 1999;53:1942. [DOI] [PubMed] [Google Scholar]
  • 41.Vina J, Lloret A. Why women have more Alzheimer’s disease than men: gender and mitochondrial toxicity of amyloid-β peptide. J Alzheimer’s Dis. 2010;20:S527–S533. [DOI] [PubMed] [Google Scholar]
  • 42.Heun R, Kockler M. Gender differences in the cognitive impairment in Alzheimer’s disease. Arch Women’s Ment Health. 2002;4: 129–137. [Google Scholar]
  • 43.Mazure CM, Swendsen J. Sex differences in Alzheimer’s disease and other dementias. Lancet Neurol. 2016;15:451–452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Lee E, Zhu H, Kong D, Wang Y, Giovanello KS, Ibrahim JG. BFLCRM: a Bayesian functional linear Cox regression model for predicting time to conversion to Alzheimer’s disease. Ann Appl Stat. 2015;9:2153–2178. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES