Bayesian adaptive group lasso with semiparametric hidden Markov models

Kai Kang; Xinyuan Song; X Joan Hu; Hongtu Zhu

doi:10.1002/sim.8051

. Author manuscript; available in PMC: 2020 Apr 30.

Published in final edited form as: Stat Med. 2018 Nov 28;38(9):1634–1650. doi: 10.1002/sim.8051

Bayesian adaptive group lasso with semiparametric hidden Markov models

Kai Kang ¹, Xinyuan Song ^1,², X Joan Hu ³, Hongtu Zhu ^4,⁵

PMCID: PMC6445704 NIHMSID: NIHMS1001724 PMID: 30484887

Abstract

This paper presents a Bayesian adaptive group least absolute shrinkage and selection operator method to conduct simultaneous model selection and estimation under semiparametric hidden Markov models. We specify the conditional regression model and the transition probability model in the hidden Markov model into additive nonparametric functions of covariates. A basis expansion is adopted to approximate the nonparametric functions. We introduce multivariate conditional Laplace priors to impose adaptive penalties on regression coefficients and different groups of basis expansions under the Bayesian framework. An efficient Markov chain Monte Carlo algorithm is then proposed to identify the nonexistent, constant, linear, and nonlinear forms of covariate effects in both conditional and transition models. The empirical performance of the proposed methodology is evaluated via simulation studies. We apply the proposed model to analyze a real data set that was collected from the Alzheimer’s Disease Neuroimaging Initiative study. The analysis identifies important risk factors on cognitive decline and the transition from cognitive normal to Alzheimer’s disease.

Keywords: linear basis expansion, Markov chain Monte Carlo, simultaneous model selection and estimation

1 ∣. INTRODUCTION

Hidden Markov models (HMMs) have been widely used in the medical, behavioral, social, environmental, and psychological sciences where longitudinal data are frequently collected.^1-6 Basically, HMMs are designed to have two parts: a transition model to investigate the effects of covariates on the dynamic transition process of hidden states and a conditional regression model to examine state-specific covariate effects on the response of interest. In these two parts, the effect of a covariate on the response or on the transition process can be nonexistent, constant, linear, or nonlinear. Identifying the specific forms of such covariate effects is useful not only in achieving a parsimonious model but also in obtaining enhanced parameter estimation and attractive interpretations.

Conventional studies on HMMs have focused on a parametric framework, wherein the forms of covariate effects on responses and/or on transition probabilities are prespecified. However, one fundamental issue overlooked by these parametric HMMs is that the complex relationships among variables are seldom known a priori, and the parametric form is thus too restrictive to correctly reflect the reality. Several nonparametric approaches have been investigated recently to relax the parametric assumption of HMMs. Yau et al⁷ developed a Bayesian nonparametric HMM, where the sampling distribution of the observations at each state was assumed unknown and modeled via a mixture of Dirichlet processes. Although their method did not rely on the distributional assumption of the observed process, it cannot reveal the functional effects of potential explanatory variables on the outcome of interest. Song et al⁸ considered Bayesian P-splines for describing the nonparametric relation among latent variables in HMMs, but they did not consider the model selection problem.

Model selection is an important issue beyond estimation in the application of HMMs. Classical model selection methods are mainly developed on the basis of a pairwise comparison through common model selection criteria, such as the Akaike information criterion and the Bayesian information criterion. However, such pairwise-based procedure usually becomes increasingly computationally demanding when the search dimension is high. An appealing alternative is to adopt least absolute shrinkage and selection operator (lasso)–type variable selection techniques. Choi et al⁹ applied lasso to correlated HMMs to detect the important parameters in transition models. Städler and Mukherjee¹⁰ introduced L₁ penalization to obtain a sparse HMM with state-specific graphical models. However, the preceding studies consider only parametric HMMs. Recently, some variants of lasso, such as group lasso, adaptive lasso, and adaptive group lasso, have been developed to manage group variables and address the issue of lasso and group lasso possibly suffering from appreciable bias. Owing to the computational efficiency and stability of the Bayesian approach, the Bayesian analogs of lasso and its variants have been proposed.^11,12 However, the available Bayesian lasso-type methods are all developed in the context of cross-sectional models without between-state transitions, thereby making them inapplicable to the proposed semiparametric HMMs.

In this paper, we propose a Bayesian adaptive group lasso (BaGlasso) procedure to conduct simultaneous model selection and estimation for semiparametric HMMs. With the use of basis expansion and appropriate penalties, the non-parametric relationships that subsume nonexistent, constant, linear, and nonlinear relationships between covariates and the response can be automatically identified. The proposed procedure has the following appealing features: first, the group effects and additional correlation within the basis expansion are well addressed by the group lasso, thus ensuring estimation accuracy. Second, adaptive penalties imposed on different groups of coefficients enable us to achieve an efficient variable selection. Finally, the proposed procedure avoids tedious pairwise comparisons among competing models with different combinations of covariates in the conditional and transition models. This entirely data-driven feature not only relaxes the dependence on experts’ knowledge in empirical studies but also reduces the computational burden. To the best of our knowledge, this study is the first to introduce Bayesian lasso-type procedure into semiparametric HMMs.

The proposed method is motivated by a real study conducted by the Alzheimer’s Disease Neuroimaging Initiative (ADNI). A set of biomarkers, namely, gender, age, educational levels, marital status, hippocampal volume, and apolipoprotein E (APOE)-ϵ4, is collected across several time points in this data set. The purpose of this study is to detect the potential risk factors of Alzheimer’s disease (AD) from two perspectives. First, considering that the pathology of AD usually evolves from cognitive normal (CN) to mild cognitive impairment (MCI) to dementia, characterizing the disease pathology, identifying hidden states that correspond to the diagnosed stages of cognitive decline, and examining the potential risk factors of the neurodegenerative transition are of scientific interest and practical value. Given that the effects of biomarkers on the pathology from one state to another may vary across nonexistent, constant, linear, and nonlinear ones, allowing their forms to be unspecified and introducing penalties to penalize unimportant effects can reveal the patterns of the effects to the greatest extent. Previous studies¹³ pointed out that the relationships between some biomarkers and cognitive decline are variant across different states. Therefore, identifying the significant state-specific risk factors of cognitive decline and investigating the subtle forms of their effects are of great interest. However, existing relevant research either restricts the examination of the above relationships under a parametric framework or emphasizes only estimation. The proposed methodology enables us to perfectly accommodate all the aforementioned features and provide new insights into the prevention of AD.

The rest of this paper is organized as follows. Section 2 introduces the semiparametric HMM and discusses the associated identifiability issue. Section 3 illustrates the statistical inference of the proposed model. Specifically, BaGlasso for simultaneous variable selection and parameter estimation as well as the deviance information criterion (DIC) for the determination of the number of hidden states are presented. Section 4 investigates the empirical performance of the proposed method via simulation studies. Section 5 presents an application of the proposed method to the aforementioned ADNI study. Several important biomarkers are detected to have significant functional effects on patients’ cognitive decline across neurodegenerative states and/or on transition probabilities. The extension of the model is discussed in Section 6.

2 ∣. MODEL DESCRIPTION

2.1 ∣. Semiparametric HMMs

Let y_it with subject i = 1, … , n at t = 1, … , T be the observation process. Z_i = (Z_i1, … , Z_iT)′, the hidden-state sequence, is commonly assumed to follow a first-order Markov chain taking values in a finite set {1, … , S). Given the hidden state Z_it, the conditional semiparametric regression model is formulated as follows:

[y_{i t} ∣ Z_{i t} = s] = μ_{s} + α_{s}^{'} c_{i t} + \sum_{j = 1}^{q} f_{s j} (x_{i t j}) + δ_{i t},

(1)

where c_it = (c_it1, & , c_itp)′ and x_it = (x_it1, & , x_itq)′ are a p × 1 vector of discrete covariates and a q × 1 vector of continuous covariates, respectively; intercept μ_s, fixed effects $α_{s}^{'} = (α_{s 1}, \dots, α_{s p})$ , and unknown smoothing function f_sj(·)s are all defined as state-specific to address the heterogeneity underlying the observations; δ_it is a random residual independent of y_it; and [δ_it∣Z_it = s] ~ N[0, ψ_s].

In addition to the observable process, the hidden process, Z_i, is formulated as follows: let p_itus denote the transition probability from state Z_i,t–1 = u at occasion t – 1 to state Z_it = s at occasion t for individual i. Then, we have

p_{i t u s} = P (Z_{i t} = s ∣ Z_{i 1}, Z_{i 2}, \dots, Z_{i, t - 1} = u) = P (Z_{i t} = s ∣ Z_{i, t - 1} = u) .

(2)

Notably, model (2) is guaranteed by the assumed property of Markov chain. A common setting for the initial distribution of Z_i1 is the multinomial distribution with probability (π₁, … , π_S)′, such that π_s ≥ 0 and $Z_{i} = (Z_{i 1}, \dots, Z_{i T})^{'}$ . Thus, the hidden-state sequence Z_i = (Z_i1, … , Z_iT)′ is fully specified by the initial and transition probabilities.

Considering that the hidden states usually have natural ranking information in empirical studies, we assume the hidden states {1, … , S) to be ordered and consider a continuation-ratio logit model¹⁴ as follows: for t = 2, … , T, s = 1, … , S – 1, and u = 1, … , S, we have

\log (\frac{P (Z_{i t} = s ∣ Z_{i, t - 1} = u)}{P (Z_{i t} > s ∣ Z_{i, t - 1} = u)}) = \log (\frac{p_{i t u s}}{p_{i t u, s + 1} + \dots + p_{i t u S}}) = ζ_{u s} + {\tilde{α}}^{'} c_{i t} + \sum_{j = 1}^{q} g_{j} (x_{i t j}),

(3)

where the left-hand side is the log odds of transition to state s rather than to a state that is higher than s given Z_i,t–1 = u, ζ_us is a transition-specific intercept, c_it = (c_it1, … , c_itp)′ and x_it = (x_it1, … ,x_itq)′ are the covariate vectors defined in (1), $\tilde{α} = ({\tilde{α}}_{1}, \dots, {\tilde{α}}_{p})^{'}$ is a p × 1 vector of fixed effect, and g_j(·)s are unknown smoothing functions. Let ϑ_itus = P(Z_it = s∣Z_it ≥ s, Z_i,t–1 = u). Then, the continuation-ratio logits in (3) can be rewritten as

\begin{matrix} \log & (\frac{P (Z_{i t} = s ∣ Z_{i, t - 1} = u)}{P (Z_{i t} > s ∣ Z_{i, t - 1} = u)}) \\ = \log (\frac{P (Z_{i t} = s, Z_{i, t - 1} = u)}{P (Z_{i t} \geq s, Z_{i, t - 1} = u) - P (Z_{i t} = s, Z_{i, t - 1} = u)}) \\ = \log (\frac{P (Z_{i t} = s, Z_{i, t - 1} = u) ∕ P (Z_{i t} \geq s, Z_{i, t - 1} = u)}{1 - P (Z_{i t} = s, Z_{i, t - 1} = u) ∕ P (Z_{i t} \geq s, Z_{i, t - 1} = u)}) \\ = \log (\frac{P (Z_{i t} = s ∣ Z_{i t} \geq s, Z_{i, t - 1} = u)}{1 - P (Z_{i t} = s ∣ Z_{i t} \geq s, Z_{i, t - 1} = u)}) \\ = \log (\frac{ϑ_{i t u s}}{1 - ϑ_{i t u s}}) . \end{matrix}

Thus, the continuation-ratio logit (3) can be rewritten as a conventional logistic regression model as follows:

\log it (ϑ_{i t u s}) = ζ_{u s} + {\tilde{α}}^{'} c_{i t} + \sum_{j = 1}^{q} g_{j} (x_{i t j}),

(4)

where logit(ϑ_itus) is the log odds of Z_it = s given Z_it ≥ s and Z_i,t–1 = u. In model (3) or (4), $\tilde{α}$ and g_j(−)s are assumed to be independent of u and s. This proportional odds assumption is compulsory in modeling an ordinal variable because it ensures the that P(Z_it < 1) < P(Z_it < 2) < ⋯ < P(Z_it < S) for ordered states 1 < 2 < ⋯ < S.^14,15 Moreover, the proportional odds assumption avoids a tedious inference, in which every possible transition of origination and destination elicits a set of parameters, and it, in turn, greatly reduces the complexity and enhances the interpretability of the transition model.

2.2 ∣. Nonparametric modeling

We use linear basis expansion to estimate the nonparametric functions f_sj(·) and g_j(·) in (1) and (3). Given that g_j(·) can be regarded as a special case (without a state-specific setting) of f_sj(·) we describe only the modeling of f_sj(·) in this section. Specifically, f_sj(x_itj) can be approximated as follows:

f_{s j} (x_{i t j}) = \sum_{m = 1}^{M_{j}} β_{s j m} h_{m} (x_{i t j}) = β_{s j}^{'} h_{i t j},

(5)

where h_m(·)s are basis functions, such as piecewise polynomials or natural cubic splines,¹⁶ h_itj = (h₁(x_itj), …, h_{M_j} (x_itj))′, and M_j is the number of basis functions that are used to estimate the jth unknown smoothing function. For notational simplicity, M_j is set to be invariant to states. An extension to relax this assumption is straightforward.

An important issue regarding the model selection of (1) and (3) is whether a functional effect, eg, f_sj(·), truly exists or not. In this study, we utilize a norm ∥·∥ to quantify the magnitude of nonparametric function f_sj. Let x_sj and H_sj denote the submatrix of x_j = (x_11j, … ,x_nTj)′ and H_j, respectively, with the rows corresponding to Z_it ≠ s deleted, where H_j is formed by

H_{j} = (\begin{matrix} h^{'} (x_{11 j}) \\ ⋮ \\ h^{'} (x_{n T j}) \end{matrix}) = {(\begin{matrix} h_{1} (x_{11 j}) & \dots & h_{M_{j}} (x_{11 j}) \\ ⋮ & ⋱ & ⋮ \\ h_{1} (x_{n T j}) & \dots & h_{M_{j}} (x_{n T j}) \end{matrix})}_{n T \times M_{j}} .

(6)

The norm of f_sj, ∥f_sj∥, is defined as $\sqrt{E (f_{s j}^{2} (x_{s j}))}$ . Then, f_sj = 0 is equivalent to ∥f_sj∥ = 0. On the basis of (5), ∥f_sj can be approximated by $‖ β_{s j} ‖_{G_{s j}} = (β_{s j}^{'} G_{s j} β_{s j})^{1 ∕ 2}$ with positive definite matrix $G_{s j} = H_{s j}^{'} H_{s j} ∕ n_{s}$ , where n_s is the number of subjects staying in state s. Denote $‖ {\hat{f}}_{s j} ‖$ as the estimator of ∥f_sj∥. In the model selection procedure, if $‖ {\hat{f}}_{s j} ‖ = 0$ , then f_sj = 0. The nonparametric function g_j(x_itj) can be similarly approximated by

g_{j} (x_{i t j}) = \sum_{m = 1}^{M_{j}} {\tilde{β}}_{j m} h_{m} (x_{i t j}) = {\tilde{β}}_{j}^{'} h_{i t j},

(7)

where ${\tilde{β}}_{j m}$ , h_m(·), h_itj, M_j, and ${\tilde{β}}_{j}$ are defined in the same manner as those in (5). Likewise, ∥g_j∥ can be approximated by $‖ g_{j} ‖_{{\tilde{G}}_{j}} = ({\tilde{β}}_{j}^{'} {\tilde{G}}_{j} {\tilde{β}}_{j})^{1 ∕ 2}$ , where ${\tilde{G}}_{j} = H_{j}^{'} H_{j} ∕ (n \times (T - 1))$ .

Let y_i = (y_i1 , ⋯ , y_iT)′, $Y = (y_{1}^{'}, \dots, y_{n}^{'})^{'}$ , $d_{i t} = (c_{i t}^{'}, x_{i t}^{'})^{'}$ , $D_{i} = (d_{i 1}^{'}, \dots, d_{i T}^{'})^{'}$ , $D = (D_{1}^{'}, \dots, D_{n}^{'})^{'}$ , Z_i = (Z_il, … ,Z_iT)′, $Z = (Z_{1}^{'}, \dots, Z_{n}^{'})^{'}$ , and θ be the vector that includes all the unknown parameters. With the linear basis expansion, the complete-data log-likelihood function is given by

\begin{matrix} \log p (Y, D, Z ∣ θ) = \sum_{i = 1}^{n} [\log p (y_{i} ∣ D_{i}, Z_{i}, θ) + \log p (Z_{i} ∣ D_{i}, θ)] \\ = & \sum_{i = 1}^{n} \sum_{t = 1}^{T} \log p (y_{i t} ∣ d_{i t}, Z_{i t} = s, θ) + \sum_{i = 1}^{n} \sum_{t = 2}^{T} \log p (Z_{i t} = s ∣ Z_{i, t - 1} = u, d_{i t}, θ) + \sum_{i = 1}^{n} \log p (Z_{i 1} = s ∣ θ) \\ = & - \frac{1}{2} \sum_{i = 1}^{n} \sum_{t = 1}^{T} [\log (2 π Ψ_{s}) + (y_{i t} - η_{i t})^{2} ∕ Ψ_{s}] + \sum_{i = 1}^{n} \sum_{t = 2}^{T} \log (p_{i t u s}) + \sum_{i = 1}^{n} \log (p_{i 10 s}), \end{matrix}

(8)

where

\begin{matrix} η_{i t} & = μ_{s} + α_{s}^{'} c_{i t} + \sum_{j = 1}^{q} β_{s j}^{'} h_{i t j}, p_{i 10 s} = π_{s}, s = 1, \dots, S, \\ p_{i t u 1} & = \frac{exp {a_{i t u 1}}}{1 + exp {a_{i t u 1}}}, p_{i 1 u S} = \prod_{j = 1}^{S - 1} \frac{1}{1 + exp {a_{i t u j}}}, \\ p_{i t u s} & = \frac{exp {a_{i t u s}}}{1 + exp {a_{i t u s}}} \prod_{j = 1}^{s - 1} \frac{1}{1 + exp {a_{i t u j}}}, s = 2, \dots, S - 1, \end{matrix}

(9)

with $a_{i t u s} = ζ_{u s} + {\tilde{α}}^{'} c_{i t} + \sum_{j = 1}^{q} {\tilde{β}}_{j}^{'} h_{i t j}$ .

2.3 ∣. Related issues

The proposed model is not identifiable because of the following two model indeterminacies. First, the basis functions involved in basis expansion may contain constant parts. When applying such constant basis functions in every f_sj(·) and/or g_j(·), each unknown function is not identifiable up to a constant. To address this issue, we need to impose the following constraints on the unknown functions to enforce their integrations in the ranges of predictors to zero^17,18:

\int_{χ_{j}} f_{s j} (x) d x = 0, for s = 1, \dots, S, j = 1, \dots, q,

(10)

where $χ_{j}$ is the domain of x_j. Second, the label switching problem, which is caused by the invariance of the likelihood function to a random permutation of the state labels, arises and leads to a multimodal posterior under a symmetric prior specification. We address this issue by imposing constraint μ₁ < ⋯ < μ_S on posterior samples.

3 ∣. BAYESIAN ANALYSIS

3.1 ∣. Adaptive group lasso penalties

We explain the key idea of the adaptive group lasso penalties in the context of a simple linear regression model: y = μ1_n + Xβ + δ, where y is the response vector, μ is an intercept, 1_n is an n-dimensional vector of all elements being 1, X is a standardized design matrix, δ is the vector of residuals, δ ~ N(0, ψI_n), and I_n is an n-dimensional identity matrix. Tibshirani¹⁹ first introduced the lasso procedure for simultaneous model selection and parameter estimation of the above linear regression. The lasso estimator of β can be expressed as

{argmin}_{β} {(y - μ 1_{n} - X β)^{'} (y - μ 1_{n} - X β) + γ \sum_{h = 1}^{p} ∣ β_{h} ∣},

(11)

where γ ≥ 0 can be regarded as an L₁-penalty that automatically shrinks unimportant covariate effects to 0. Given that the covariates in X are standardized to the same scale, the magnitudes of the coefficients in β can represent the significance of predictors. If some elements of β are close to 0, then the corresponding covariates are unimportant and can be removed from the model.

However, when simply applying lasso to the proposed semiparametric HMMs, at least two problems exist. First, lasso is originally designed for the selection of individual variables. Yuan and Lin²⁰ showed that lasso tends to select more factors than necessary in the presence of group variables. Moreover, the pairwise correlations among group variables jeopardize the model selection accuracy of the lasso estimator.²¹ In this study, high correlations exist among the basis functions h_m(x_it)s in the conditional and transition models because they can be viewed as different transformations of x_it. Consequently, the linear basis expansion involves group variables and should not be treated separately. Second, lasso applies the same tuning parameter γ to different regression coefficients, thereby introducing the same amount of shrinkage to different covariate effects. This inflexible setting may add considerable bias to the resulting estimates.^22,23

To address the aforementioned issues, Yuan and Lin²⁰ proposed group lasso to perform model selection among group variables. Wang and Leng²⁴ further developed adaptive group lasso to assign different tuning parameters to different groups of regression coefficients. Let $α = (α_{1}^{'}, \dots, α_{S}^{'})^{'}$ , $β_{s}^{'} = (β_{s 1}^{'}, \dots, β_{s q}^{'})^{'}$ , $β = (β_{1}^{'}, \dots, β_{S}^{'})^{'}$ , $\tilde{β} = ({\tilde{β}}_{1}^{'}, \dots, {\tilde{β}}_{q}^{'})^{'}$ , and $θ^{*} = (α^{'}, {\tilde{α}}^{'}, β^{'}, {\tilde{β}}^{'})^{'}$ . On the basis of the proposed model defined in (1)–(7), the adaptive group lasso estimator can be formulated as

{arg min}_{θ^{*}} {\sum_{i = 1}^{n} \sum_{t = 1}^{T} (y_{i t} - η_{i t})^{'} (y_{i t} - η_{i t}) - \sum_{i = 1}^{n} \sum_{t = 2}^{T} \log (p_{i t u s}) - P (θ^{*})},

(12)

where η_η is the mean ofy_it, p_itus is the transition probability defined in (2) and (9), and

P (θ^{*}) = \sum_{s = 1}^{S} \sum_{h = 1}^{p} γ_{α s h} ∣ α_{s h} ∣ + \sum_{h = 1}^{p} {\tilde{γ}}_{α h} ∣ {\tilde{α}}_{h} ∣ + \sum_{s = 1}^{S} \sum_{j = 1}^{q} γ_{β s j} ‖ β_{s j} ‖_{G_{s j}} + \sum_{j = 1}^{q} {\tilde{γ}}_{β j} ‖ {\tilde{β}}_{j} ‖_{{\tilde{G}}_{j}},

(13)

in which α_sh, ${\tilde{α}}_{h}$ , β_sj, and ${\tilde{β}}_{j}$ are coefficients of fixed effects and basis functions in the conditional and transition models; γ_ash, ${\tilde{γ}}_{α h}$ , γ_βsj and ${\tilde{γ}}_{β j}$ are the corresponding tuning parameters; and the norms ∥β_sj∥_{G_sl} and $‖ {\tilde{β}}_{j} ‖_{{\tilde{G}}_{l}}$ are defined in Section 2.2. Notably, the coefficients of discrete covariates, namely, α_sh and ${\tilde{α}}_{h}$ , are simply assigned adaptive penalties, whereas the coefficients of unknown smooth functions β_sj and ${\tilde{β}}_{j}$ , which have groupwise features, are assigned adaptive group lasso penalties. The initial probabilities p_i10s are excluded from (13) because they are independent of ${\tilde{α}}_{h}$ and ${\tilde{β}}_{j}$ .

Yuan and Lin²⁰ argued that the penalty function in (13) is intermediate between the L₁-penalty used in lasso and the L₂-penalty used in ridge regression. Therefore, the adaptive group lasso not only has the same advantages of lasso in model selection but also alleviates the problem caused by the existence of high pairwise correlation among basis functions. Furthermore, with the use of different tuning parameters γ_βsj and ${\tilde{γ}}_{β j}$ , the adaptive group lasso automatically imposes large penalties on groups of unimportant coefficients to efficiently shrink them to 0. Moreover, the penalty terms ∥β_sj∥_{G_sj} and $‖ {\tilde{β}}_{j} ‖_{{\tilde{G}}_{j}}$ can be regarded as the scaled version of the groupwise prediction penalty suggested by Buhlmann and Van De Geer.²⁵ With the great power of adaptive group lasso, the estimation of all unknown parameters and the structure detection for important functional covariate effects on the observed response and on the hidden-state process can be simultaneously and efficiently obtained.

3.2 ∣. BaGlasso and prior specification

Under the Bayesian framework, the adaptive group lasso procedure can be implemented by introducing a multivariate conditional Laplace prior to the regression coefficients in $θ^{*} = (α^{'}, {\tilde{α}}^{'}, β^{'}, {\tilde{β}}^{'})^{'}$ as follows:

p (θ^{*} ∣ Ψ, σ^{2}) \propto exp (- \sum_{h = 1}^{p} (\frac{γ_{α s h}}{\sqrt{Ψ_{s}}} ∣ α_{s h} ∣ + \frac{{\tilde{γ}}_{α h}}{\sqrt{σ^{2}}} ∣ {\tilde{α}}_{h} ∣) - \sum_{j = 1}^{q} (\frac{γ_{β s j}}{\sqrt{Ψ_{s}}} ‖ β_{s j} ‖_{G_{s j}} + \frac{{\tilde{γ}}_{β j}}{\sqrt{σ^{2}}} ‖ \tilde{β} ‖_{{\tilde{G}}_{j}})),

(14)

where ψ = (ψ₁, … , ψ_S)′. This conditional Laplace prior can be represented as a scale mixture of normals with an exponential mixing density, leading to a hierarchical representation of the full model as follows: for i = 1, … , n, t = 1, … , T, s = 1, … ,S, h = 1, … , p, and j = 1, … , q, we have

\begin{matrix} y_{i t} ∣ Z_{i t} = s, μ_{s}, c_{i t}, α_{s}, β_{s}, Ψ_{s} \sim N (η_{i t}, Ψ_{s}), \\ α_{s} ∣ Ψ_{s}, τ_{α s 1}^{2}, \dots, τ_{α s p}^{2} \overset{ind}{\sim} N_{p} (0, Ψ_{s}, Σ_{α s}), Σ_{α s} = diag (τ_{α s 1}, \dots, τ_{α s p}) \\ \tilde{α} ∣ σ^{2}, {\tilde{τ}}_{α 1}^{2}, \dots, {\tilde{τ}}_{α p}^{2} \sim N_{p} (0, σ^{2} {\tilde{Σ}}_{α}), {\tilde{Σ}}_{α} = diag ({\tilde{τ}}_{α 1}, \dots, {\tilde{τ}}_{α p}) \\ β_{s j} ∣ Ψ_{s}, τ_{β s j}^{2} \overset{ind}{\sim} N_{M_{j}} (0, Ψ_{s} τ_{β s j}^{2} G_{s j}^{- 1}), {\tilde{β}}_{j} ∣ σ^{2}, {\tilde{τ}}_{β j}^{2} \overset{ind}{\sim} N_{M_{j}} (0, σ^{2} {\tilde{τ}}_{β j}^{2} {\tilde{G}}_{j}^{- 1}), \\ τ_{α s h}^{2} \overset{ind}{\sim} Gamma (1, \frac{γ_{α s h}^{2}}{2}), {\tilde{τ}}_{α h}^{2} \overset{ind}{\sim} Gamma (1, \frac{{\tilde{γ}}_{α h}^{2}}{2}) \\ τ_{β s j}^{2} \overset{ind}{\sim} Gamma (\frac{M_{j} + 1}{2}, \frac{γ_{β s j}^{2}}{2}), {\tilde{τ}}_{β j}^{2} \overset{ind}{\sim} Gamma (\frac{M_{j} + 1}{2}, \frac{{\tilde{γ}}_{β j}^{2}}{2}) \\ Ψ_{s}^{- 1} \overset{ind}{\sim} Gamma (α_{Ψ s 0}, β_{Ψ s 0}), σ^{- 2} \sim Gamma (α_{σ 0}, β_{σ 0}, \end{matrix}

(15)

where $\overset{ind}{\sim}$ represents “independently distributed according to” and η_it is defined in (9). For the tuning parameters γ_αsh, ${\tilde{γ}}_{α h}$ , γ_βsj, and ${\tilde{γ}}_{β j}$ , we assign gamma priors as follows:

\begin{matrix} p (γ_{α s h}^{2}) & \overset{ind}{\sim} Gamma (α_{α s h 0}, β_{α s h 0}), p ({\tilde{γ}}_{α h}^{2}) \overset{ind}{\sim} Gamma ({\tilde{α}}_{α h 0}, {\tilde{β}}_{α h 0}), \\ p (γ_{β s j}^{2}) & \overset{ind}{\sim} Gamma (α_{β s j 0}, β_{β s j 0}), p ({\tilde{γ}}_{β j}^{2}) \overset{ind}{\sim} Gamma ({\tilde{α}}_{β j 0}, {\tilde{β}}_{β j 0}), \end{matrix}

(16)

where α_αsh0, ${\tilde{α}}_{α h 0}$ , α_βsj0, ${\tilde{α}}_{β j 0}$ , β_αsh0, ${\tilde{β}}_{α h 0}$ , β_βsj0, and ${\tilde{β}}_{β j 0}$ are hyperparameters with prespecified values. We follow a common practice in the literature^11,12 to set $α_{α s h 0} = {\tilde{α}}_{α h 0} = α_{β s j 0} = {\tilde{α}}_{β j 0} = 1$ , $β_{α s h 0} = {\tilde{β}}_{α h 0} = 0.1$ , and $β_{β s j 0} = {\tilde{β}}_{β j 0} = 0.01$ to obtain relatively dispersed gamma priors. The key idea of BaGlasso is to properly update the tuning parameters by using the data, thereby automatically imposing large penalties on unimportant coefficients. This target can be naturally achieved by introducing dispersed priors with small hyperparameters β_αsh0, ${\tilde{β}}_{α h 0}$ , β_βsj0, and ${\tilde{β}}_{β j 0}$ . We explain this regularization procedure further through the posterior distribution of the tuning parameters as follows:

\begin{matrix} p (τ_{α s h}^{- 2} ∣ \cdot) \sim In - Gaussian (\sqrt{\frac{γ_{α s h}^{2} Ψ_{s}}{∣ α_{s h} ∣^{2}}}, γ_{α s h}^{2}), p ({\tilde{τ}}_{α h}^{- 2} ∣ \cdot) \sim In - Gaussian (\sqrt{\frac{{\tilde{γ}}_{α s h}^{2} σ^{2}}{∣ {\tilde{α}}_{s h} ∣^{2}}}, {\tilde{γ}}_{α s h}^{2}), \\ p (τ_{β s j}^{- 2} ∣ \cdot) \sim In - Gaussian (\sqrt{\frac{γ_{β s j}^{2} Ψ_{s}}{‖ β_{s j} ‖_{G_{s j}}}}, γ_{β s j}^{2}), p ({\tilde{τ}}_{β j}^{- 2} ∣ \cdot) \sim In - Gaussian (\sqrt{\frac{{\tilde{γ}}_{β s j}^{2} σ^{2}}{‖ {\tilde{β}}_{j} ‖_{{\tilde{G}}_{j}}}}, {\tilde{γ}}_{β j}^{2}), \\ p (γ_{α s h}^{2} ∣ \cdot) \sim Gamma (α_{α s h 0} + 1, β_{α s h 0} + \frac{τ_{α s h}^{2}}{2}), p ({\tilde{γ}}_{α h}^{2} ∣ \cdot) \sim Gamma ({\tilde{α}}_{α h 0} + 1, {\tilde{β}}_{α h 0} + \frac{{\tilde{τ}}_{α h}^{2}}{2}), \\ p (γ_{β s j}^{2} ∣ \cdot) \sim Gamma (α_{β s j 0} + \frac{M_{j} + 1}{2}, β_{β s j 0} + \frac{τ_{β s j}^{2}}{2}), \\ p (γ_{β j}^{2} ∣ \cdot) \sim Gamma ({\tilde{α}}_{β j 0} + \frac{M_{j} + 1}{2}, {\tilde{β}}_{β j 0} + \frac{τ_{β j}^{2}}{2}), \end{matrix}

(17)

where “In-Gaussian(·)” denotes the inverse Gaussian distribution. We omit the tedious subscripts and use generic terms τ and γ to simplify notations below. On the basis of (17), if the coefficients are significant, then τ² tends to be large. As a result, the corresponding tuning parameter γ is dominated by τ², leading γ to be mostly data driven. If the coefficients are insignificant, then τ² tends to be small. Consequently, the corresponding tuning parameter γ is dominated by the dispersed prior information, leading to a large value of γ. Thus, the degree of dispersion of the gamma priors in (16) determines the amount of penalties imposed on unimportant predictors. This rationale explains why we assign higher dispersed priors to $γ_{β s j}^{2}$ and ${\tilde{γ}}_{β j}^{2}$ than to $γ_{α s h}^{2}$ and ${\tilde{α}}_{α h}^{2}$ because the coefficients of the nonlinear parts of basis functions are more difficult to shrink to 0 than those of the linear parts.

To conduct a full Bayesian analysis, we specify appropriate prior distributions for other unknown parameters, such as μ_s, π_s, and ζ_us. For u = 1, … ,S and s = 1, … , S, the following Gaussian priors are considered:

p (μ_{s}) \overset{ind}{\sim} N (μ_{s 0}, σ_{μ s 0}^{2}), p (π_{s}) \overset{ind}{\sim} N (π_{s 0}, σ_{π s 0}^{2}), p (ζ_{u s}) \overset{ind}{\sim} N (ζ_{u s 0}, σ_{ζ u s 0}^{2}),

(18)

where μ_s0, $σ_{μ s 0}^{2}$ , π_s0, $σ_{π s 0}^{2}$ , ζ_us0, and $σ_{ζ u s 0}^{2}$ are hyperparameters with preassigned values.

3.3 ∣. Posterior inference

The Bayesian estimate of θ can be obtained through the mean or mode of the posterior samples drawn from p(θ∣Y). However, directly sampling from p(θ∣Y) is intractable because of the existence of latent states. To address this issue, we adopt the data augmentation technique to work on p(θ, Z∣Y) and utilize the Gibbs sampler to simulate each of the unknowns from its full conditional distribution iteratively. Owing to the nonlinearity of the continuation-logit transition model, the full conditional distributions related to the transition model have complex forms. Thus, Markov chain Monte Carlo (MCMC) methods, such as the forward filtering and backward sampling algorithm²⁶ and the Metropolis-Hastings algorithm,^27,28 are used to sample from them. The details of the full conditional distributions are provided in the Appendix.

For nonparametric functions involved in (1) and (3), as suggested by Li et al,²⁹ a functional effect of a covariate is detected as significant and included in the regression if at least one of its coefficients of the basis expansion has a two-sided 95% credible interval estimate that does not cover zero. The latent state Z_it, which usually has actual meaning in empirical studies, is also of great interest for scientists. By using posterior samples, we can estimate the hidden state as follows:

{\hat{Z}}_{i t} = arg \max_{s \in {1, \dots, S}} P (Z_{i t} = s ∣ y_{i}, θ) \approx arg \max_{s \in {1, \dots, S}} \frac{1}{L} \sum_{l = 1}^{L} I (Z_{i t}^{(l)} = s),

(19)

where $Z_{i t}^{(l)}$ denotes the latent allocation of y_it at the lth iteration, and $\frac{1}{L} \sum_{l = 1}^{M} I (Z_{i t}^{(l)} = s)$ is the posterior mean of the latent allocations of y_it drawn from the MCMC iterations.

3.4 ∣. Determination of the number of hidden states

In the analysis of HMMs, the number of hidden states, S, is usually determined a priori. We use a modified DIC, which was developed by Celeux et al,³⁰ for model comparison in the presence of incomplete data, to determine the number of hidden states of the proposed model. The modified DIC is defined as follows:

DIC = \overline{D (θ)} + p_{D},

(20)

where $\overline{D (θ}) = E_{θ, Z} [- 2 \log p (Y, Z ∣ θ) ∣ Y]$ is the posterior mean deviance to reflect the goodness of fit of the model, p_D is the effective number of parameters to penalize an overcomplex model, and p_D = E_θ,Z[−2logp(Y, Z∣θ)∣Y] + 2E_Z[logp(Y, Z)∣E_θ[θ∣Y, Z])∣Y]. The expectations involved in (20) can be approximated by averaging the posterior samples collected through the MCMC algorithm.^30,31 The model with the smallest value of DIC is selected.

4 ∣. SIMULATION STUDY

This section contains two simulations: Simulation 1 assesses the empirical performance of the proposed BaGlasso for simultaneous estimation and variable selection in the context of semiparametric HMMs, and Simulation 2 examines the performance of the DIC in determining the number of hidden states in semiparametric HMMs.

4.1 ∣. Simulation 1

We consider 100 simulated data sets, each consisting of n = 700 subjects and T = 9 time points. For each data set, observations are generated from a two-state semiparametric HMM with a continuous response y_it, two discrete covariates c_it = (c_it1, c_it2)^r (p = 2), and three continuous covariates x_it = (x_it1, x_it2, x_it3)′ (q = 3). For i = 1, … , 700 and t = 1, … , 9, c_it1 and c_it2 are independently generated from the Bernoulli distribution with a probability of success of 0.5, and x_it1, x_it2, and x_it3 are generated from U(−1, 1), N(0, 1), and $N (\sqrt{t}, 1)$ , respectively, and they are standardized to the same scale beforehand. Here, x_it1 and x_it2 are set as time-invariant covariates, whereas x_it3 is set as a time-variant one. The conditional regression model is defined as follows:

[y_{i t} ∣ Z_{i t} = s] = μ_{s} + α_{s 1} c_{i t 1} + α_{s 2} c_{i t 2} + f_{s 1} (x_{i t 1}) + f_{s 2} (x_{i t 2}) + f_{s 3} (x_{i t 3}) + δ_{i t},

(21)

where f₁₁(x_it1) = 0, f₁₂(x_it2) = sin(1.5x_it2) + x_it2 – 0.6, f₁₃(x_it1) = −0.8x_it3, f₂₁(x_it1) = 2.08 – exp(x_it1), f₂₂(x_it₂) = 0, and f₂₃(x_it3) = −0.105 + cos(2x_it3) + 0.5x_it3.

The transition model is defined as

\log it (ϑ_{i t u s}) = ζ_{u s} + {\tilde{α}}_{1} c_{i t 1} + {\tilde{α}}_{2} c_{i t 2} + g_{1} (x_{i t 1}) + g_{2} (x_{i t 2}) + g_{3} (x_{i t 3}),

(22)

where g₁(x_it1) = −log(2 + x_it1)/(2 – x_it1), g₂(x_it2) = 1.5x_it2, and g₃(x_it3) = 0. The true population values of the unknown parameters are set as μ = (μ₁, μ₂)′ = (−1, 1)′, π = (π₁, π₂)′ = (0.5, 0.5)′, ζ₁₁ = ζ₂₁ = 0.5, α₁ = (α₁₁, α₁₂)′ = (0, 0.5)′, α₂ = (α₂₁, α₂₂)′ = (−0.5, 0)−, $\tilde{α} = ({\tilde{α}}_{1}, {\tilde{α}}_{2})^{'} = (- 1, 0)^{'}$ , and ψ = (ψ₁, ψ₂)′ = (0.36, 0.16)′.

In this study, we use a simple version of natural cubic splines derived from a truncated power series basis function¹⁶ to approximate the nonparametric functions: h_j1(x_itj) = 1, h_j2(x_itj) = x_itj, and h_j,m+2 = u_jm(x_itj) − u_{j,M_j–1}(x_itj) for m = 1, … , M_j – 2, where $u_{j, m} (x_{i t j}) = [(x_{i t j} - κ_{j} M_{j})_{+}^{3} - (x_{i t j} - κ_{j m})_{+}^{3}] ∕ (κ_{j} M_{j} - κ_{j m})$ , and κ_jm, m = 1, … , M_j, are the knots taken in the range of x_itj. The prior inputs in (15), (16), and (18) are assigned as follows: μ_s0 = ζ_us0 = π_s0 = 0, $σ_{μ s 0}^{2} = σ_{ζ u s 0}^{2} = σ_{π 0}^{2} = 1$ , α_ψs0 = α_σ0 = 9, β_ψs0 = β_σ0 = 4, $α_{α s h 0} = {\tilde{α}}_{α h 0} = α_{β s j 0} = {\tilde{α}}_{β j 0} = 1$ , $β_{α s h 0} = {\tilde{β}}_{α h 0} = 0.1$ , and $β_{β s j 0} = {\tilde{β}}_{β j 0} = 0.01$ . For each x_itj, M_j = 10 knots are used. We impose the constraint μ₁ < μ₂ in each MCMC iteration to avoid label switching and check the convergence of the algorithm using the estimated potential scale reduction (EPSR) proposed by Gelman et al.³² The MCMC algorithm converges within 5000 iterations. Thus, we collect posterior samples with a size of 20 000 with the first 10 000 as burn-in iterations. The performance of Bayesian estimates is assessed through the bias (BIAS) and the root-mean-square error (RMSE) between the Bayesian estimates and the true population values of the parameters.

Table 1 summarizes the estimation results on the basis of the 100 data sets. The BIAS and RMSE for most of the parameters are close to zero, indicating a satisfactory performance of Bayesian estimation regarding the parametric part. Figure 1 depicts the averages of the pointwise posterior means of the nonparametric functions, along with their 2.5% and 97.5% pointwise quantiles. Three nonexistent functions are successfully shrunk to almost zero by the proposed BaGlasso procedure. The posterior means of other nonzero nonparametric functions are close to their true curves, and all the ranges of the 2.5% and 97.5% pointwise quantiles are small, indicating that the estimated nonparametric curves can correctly recover the complex functional relationships between the response and covariates. Moreover, the average of the correct classification rates calculated from (19) is approximately 95%, implying the good performance of the proposed method in identifying the hidden states of the observations.

TABLE 1.

Bayesian estimates of the parameters in the simulation study

Parameters in the Conditional Regression Model
State 1				State 2
Par	True	Est	RMSE	Par	True	Est	RMSE
μ₁	−1.0	−0.969	0.041	μ₂	1.0	1.006	0.033
α₁₁	0.0	−0.000	0.025	α₂₁	−0.5	−0.499	0.015
α₁₂	0.5	0.501	0.023	α₂₂	0.0	0.001	0.015
ψ₁	0.36	0.392	0.034	ψ₂	0.16	0.191	0.032

Parameters in the Probability Transition Model
Par	True	Est	RMSE	Par	True	Est	RMSE
${\tilde{α}}_{1}$	−1.0	−0.985	0.080	${\tilde{α}}_{2}$	0.0	−0.000	0.055
π₁	0.5	0.528	0.036	π₂	0.5	0.472	0.036
ζ₁₁	0.5	0.501	0.152	ζ₂₁	0.5	0.504	0.152

Open in a new tab

Abbreviation: RMSE, root-mean-square error.

Estimates of the unknown smooth functions in the simulation study. The solid curves represent the true curves, and the dashed curves represent the estimated posterior means and the 2.5% and 97.5% pointwise quantiles on the basis of 100 replications

To reveal the sensitivity of Bayesian estimates to the input of prior distributions, we disturb the prior input as follows: μ_s0 = ζ_us0 = π_s0 = 2, $σ_{μ s 0}^{2} = σ_{ζ u s 0}^{2} = σ_{π s 0}^{2} = 2$ , α_ψs0 = 3, β_ψs0 = 2, $α_{α s h 0} = {\tilde{α}}_{α h 0} = α_{β s j 0} = {\tilde{α}}_{β j 0} = 1$ , $β_{α s h 0} = {\tilde{β}}_{α h 0} = 0.5$ , and $β_{β s j 0} = {\tilde{β}}_{β j 0} = 0.01$ . The Bayesian results obtained under the disturbed prior are similar and not reported.

Notably, this simulation study contains five covariates in the conditional and transition models, which result in a large number (2^2×5) of competing models with various combinations of covariates in both models. Traditional Bayesian model selection statistics, such as the Bayes factor and the DIC, are extremely time consuming in performing variable selection because they compare these competing models in a pairwise basis. By contrast, the proposed BaGlasso procedure automatically selects important predictors and avoids the tedious pairwise comparison, thereby greatly reducing the computational time. In this simulation study, the computing time for simultaneous variable selection and parameter estimation in each replication is 48 minutes using a PC Intel Core i7-6700 3.40-GHz CPU and 16 G of RAM.

4.2 ∣. Simulation 2

To examine the performance of the DIC in determining the number of hidden states of a semiparametric HMM, we consider five competing models M₁, M₂, M₃, M₄, and M₅, where M_s is a model defined by (1)–(3) with S = s, s = 1, … , 5. Here, M₄ is the true model, whereas M₁, M₂, M₃, and M₅ are models with incorrect numbers of hidden states. To mimic the scenario of the ADNI data set in the subsequent real example, we generate 100 data sets from (1)–(3) with S = 4, n = 633, T = 4,p = 4, and q = 3. For i = 1, … , 633 and t = 1, … , 4, c_it1 to c_it4 are independently generated from the Bernoulli distribution with a probability of success of 0.5, and x_it1, x_it2, and x_it3 are generated from U(−1, 1), N(0, 1), and $N (\sqrt{t}, 1)$ , respectively, and they are standardized prior to analysis. The true functions are set as f₁₁(x_it1) = 0, f₁₂(x_it2) = sin(1.5x_it2)+ x_it2 – 0.6, f₁₃(x_it1) = –0.8x_it3, f₂₁(x_it1) = 2.08 – exp(x_it1), f₂₂(x_it2) = 0, f₂₃(x_it3) = −0.105 + cos(2x_it3) + 0.5x_it3, f₃₁(x_it1) = 0.5x_it1, f₃₂(x_it2) = 0, f₃₃(x_it1) = −x_it3, f₄₁(x_it1) = 2x_it1, f₄₂(x_it2) = 1.5x_it2, f₄₃(x_it3) = 0, g₁(x_it1) = −log(2 + x_it1)/(2 − x_it1), g₂(x_it2) = 1.5x_it2, and g₃(x_it3) = 0. The true population values of the unknown parameters are set as μ = (μ₁, μ₂, μ₃, μ₄)′ = (−4, −2, 2, 4)′, π = (π₁, π₂, π₃, π₄)′ = (0.25, 0.25, 0.25, 0.25)′, ζ₁₁ = ζ₂₁ = ζ₃₁ = ζ₄₁ = −1, ζ₁₂ = ζ₂₂ = ζ₃₂ = ζ₄₂ = 0, ζ₁₃ = ζ₂₃ = ζ₃₃ = ζ₄₃ = 1, α1 = (α₁₁, α₁₂, α₁₃, α₁₄)′ = (1, 0, 0.5, 1)′, α₂ = (α₂₁, α₂₂, α₂₃, α₂₄)′ = (0.5, −0.5, 0, −1)′, α₃ = (α₃₁, α₃₂, α₃₃, α₃₄)′ = (0.5, −1, 1, 0)′, α₄ = (α₄₁, α₄₂, α₄₃, α₄₄)′ = (0.5, 1, −0.5, 0)′, $\tilde{α} = ({\tilde{α}}_{1}, {\tilde{α}}_{2}, {\tilde{α}}_{3}, {\tilde{α}}_{4})^{'} = (- 1, 0.5, 0, 1)^{'}$ , and ψ = (ψ₁, ψ₂, ψ₃, ψ₄)′ = (0.16, 0.16, 0.16, 0.16)′. The prior distributions and other settings are specified in the same manner as in Simulation 1. On the basis of the 100 simulated data sets, the means and standard deviations of the DIC values for M₁ to M₅ are reported in Table 2, which suggests that the true model M₄ is consistently selected in each of the 100 replications.

TABLE 2.

Summary of deviance information criterion (DIC) values in the simulation study

Competing Model	DIC (mean)	DIC (std)	No. of Selections
M₁	12 018	79	0
M₂	10 912	92	0
M₃	10 124	461	0
M₄	8988	128	100
M₅	10 052	158	0

Open in a new tab

Note: No. of selections represents the number of times that the DIC value of M_s (s = 1, … ,5) is the smanest among all competing models in 100 replications.

The computer code for conducting the preceding analyses is written in R and is freely available at http://www.sta.cuhk.edu.hk/xysong/codes/BaGLassoHMMs.

5 ∣. ADNI STUDY

To demonstrate the empirical utility of our proposed method, we conduct real data analysis on the basis of the ADNI study. The data set collected imaging, genetic, clinical, and cognitive data from participants under CN controls and participants with mild cognitive impairment or AD. ADNI-1 was first conducted in 2004, and several extensions, namely, ADNI-GO, ADNI-2, and ADNI-3, followed afterward. In this study, we focused on 633 participants collected from ADNI-1 and included their clinical and genetic variables at four time points, namely, baseline, 6 months, 12 months, and 24 months. Functional Assessment Questionnaire (FAQ), a widely used assessment of abilities to function independently in daily life, was used as a response variable (y_it) to reflect cognitive decline over time. Patients with higher FAQ scores have lower cognitive abilities. Three continuous covariates, namely, the logarithm of the ratio of hippocampal volume over whole brain (x_it1), age at baseline (x_it2), and years of education (x_it3), were considered. Moreover, we included a genetic variable, APOE-ϵ4 (c_it1 and c_it2), which was coded as 0, 1, and 2, denoting the number of APOE-ϵ4 alleles. Other discrete demographic characteristics, such as gender (c_it3, 0 = female; 1 = male) and marital status (c_it4, 0 = has been married; 1 = has not been married), were also included. The three continuous variables, namely, FAQ score, hippocampus, and age, were standardized prior to analysis. The main objective of this study is to examine the complex effects of potential risk factors on the transition of neurodegenerative states and on the cognitive decline of participants across different states.

We first determined the number of hidden states. We considered five competing models M_k, k = 1, … , 5, where M_k represents a semiparametric HMM defined in (1)–(3) with k states. We used natural cubic splines for h_itj and M_j = 10 in approximating the unknown smoothing functions. The hyperparameters were assigned in the same manner as those in the simulation study, and the identifiability constraint μ₁ < ⋯ < μ₅ was taken to avoid label switching. We generated several MCMC chains with different initial values to monitor the convergence of the MCMC algorithm. The EPSR plot depicted in Figure 2 indicated that the MCMC algorithm converged within 10 000 iterations. Therefore, we collected 10 000 observations after discarding 10 000 burn-in iterations to calculate the DIC values of the competing models.

Plot of estimated potential scale reduction (EPSR) values for the parameters in the ADNI-1 (Alzheimer’s Disease Neuroimaging Initiative) data analysis. The horizontal dotted line is for EPSR = 1.2. MCMC, Markov chain Monte Carlo

The values of $\overline{D (θ)}$ , p_D, and DIC corresponding to M₁ to M₄ are reported in Table 3. When fitting the data to M₅, the MCMC algorithm broke down after several iterations. After carefully checking the results, we found that one of the states included only fewer than six subjects after several iterations. This phenomenon implies the nonexistence of such a state and the inapplicability of the five-state model in this study. On the basis of the results in Table 3, the four-state model M₄ with the smallest DIC was selected. Then, we used the proposed BaGlasso procedure to conduct a simultaneous estimation and variable selection under M₄. Results are presented in Table 4 (parametric part) and Figure 3 (nonparametric part), in which only significant functional effects are reported.

TABLE 3.

Summary of deviance information criterion (DIC) values in the analysis of the Alzheimer’s Disease Neuroimaging Initiative data set

Competing Model	$\overline{D (θ)}$	P_D	DIC
M₁	6294	35	6329
M₂	1434	69	1503
M₃	1016	97	1113
M₄	972	126	1098

Open in a new tab

TABLE 4.

Estimation results in the ADNI-1 (Alzheimer’s Disease Neuroimaging Initiative) data analysis: parametric part

Parameters in the Conditional Regression Model
State 1			State 2			State 3			State 4
Par	Est	SE	Par	Est	SE	Par	Est	SE	Par	Est	SE
μ₁	−0.608	0.005	μ₂	−0.200	0.032	μ₃	0.948	0.075	μ₄	2.466	0.127
α₁₁	0.000	0.005	α₂₁	0.059	0.040	α₃₁	0.113	0.082	α₄₁	0.256	0.151
α₁₂	0.015	0.013	α₂₂	0.012	0.040	α₃₂	0.068	0.086	α₄₂	0.120	0.143
α₁₃	0.003	0.005	α₂₃	0.019	0.031	α₃₃	−0.303	0.107	α₄₃	−0.427	0.157
α₁₄	0.003	0.005	α₂₄	0.008	0.030	α₃₄	−0.047	0.073	α₄₄	−0.115	0.143
ψ₁	0.009	0.000	ψ₂	0.073	0.008	ψ₃	0.173	0.020	ψ₄	0.437	0.052

Parameters in the Transition Model
Par	Est	SE	Par	Est	SE	Par	Est	SE	Par	Est	SE
${\tilde{α}}_{1}$	−0.386	0.174	${\tilde{α}}_{2}$	−0.821	0.253	${\tilde{α}}_{3}$	0.012	0.078	${\tilde{α}}_{4}$	−0.150	0.132
π₁	0.592	0.022	π₂	0.198	0.022	π₃	0.149	0.018	π₄	0.060	0.014
ζ₁₁	2.513	0.165	ζ₂₁	−1.459	0.246	ζ₃₁	−3.278	0.451	ζ₄₁	−3.343	0.500
ζ₁₂	2.395	0.418	ζ₂₂	1.498	0.253	ζ₃₂	−1.674	0.331	ζ₄₂	−3.320	0.498
ζ₁₃	1.405	0.740	ζ₂₃	2.840	0.447	ζ₃₃	1.657	0.279	ζ₄₃	−2.017	0.426

Open in a new tab

ADNI-1 (Alzheimer’s Disease Neuroimaging Initiative) data analysis results: the estimates of significant unknown smooth functions at the corresponding states. The solid curves represent the pointwise mean curves, and the dashed curves represent the 2.5% and 97.5% pointwise quantiles. Line y = 0 is denoted in each picture by a red dot-dash line to illustrate the range of significant effects for each risk factor. FAQ, Functional Assessment Questionnaire [Colour figure can be viewed at wileyonlinelibrary.com]

We obtain the following observations: first, intercepts μ₁, μ₂, μ₃, and μ₄ were ranked in ascending order. Patients in state 1 had the lowest mean score of FAQ, whereas those in state 4 received the highest mean score. That is, patients’ cognitive ability reflected by independent functioning in daily life steadily deteriorated from state 1 to state 4. According to the existing literature,³³ state 1 to state 4 can be explained as CN, early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI), and AD, respectively.

Second, BaGlasso selected six significant functional effects across the states. The effect of hippocampus on the FAQ score exhibits a descending trend in all the states. Specifically, in the CN state, participants with a greater hippocampal volume tend to have slightly better memory. This result is consistent with the common sense that the hippocampus helps consolidate outside information from short-term memory to long-term memory. In EMCI and LMCI states, the magnitude of the functional effect of the hippocampus on FAQ becomes increasingly large, confirming that atrophy in hippocampal volume continuously impairs patients’ cognitive ability during the progression from EMCI to LMCI. Published medical reports^34-36 also revealed the similar result that the loss of hippocampal volume greatly affects dementia. In the AD state, preventing the loss of hippocampal volume is still beneficial to postpone cognitive decline, but this effect is significant only in a small range of hippocampal volume. The effect of age on FAQ is nonsignificant in the first three states, implying that age influences cognitive function mainly in the AD state. Relatively younger AD patients (around 75 years old) have better functional independence in daily life compared with elder ones. This age effect was also revealed by previous research.^37,38 The effect of educational level on FAQ is likewise significant only in the AD state. Such effect becomes large when educational level is high, indicating that patients with higher educational levels tend to experience more pronounced cognitive decline compared with patients with lower educational levels. This finding is in line with the existing literature.^39,40

Third, for the parametric part, gender has a negative effect on FAQ in the LMCI and AD states, implying that women suffer more serious cognitive decline than men in the late progression period of AD. This result agrees with existing medical reports.^41-43

Fourth, in the transition model, the functional effect of the hippocampus exhibits an ascending trend with the growth of hippocampal volume. In the progression of AD, patients with larger hippocampal volumes are more likely to remain in the current state rather than transit to a worse one compared with those with smaller hippocampal volumes. By contrast, patients with APOE-ϵ4 alleles are more likely to transit to a worse state rather than remain in the current one. Thus, APOE-ϵ4 alleles are important risk factors for the development of AD. This result is consistent with the existing finding.⁴⁴ However, the estimates of other covariates, such as age, educational level, gender, and marital status, were shrunk to nearly zero by BaGlasso, implying that conditional on hippocampus and APOE-ϵ4, the direct effects of age, educational level, gender, and marital status on the transition probability are weak.

For comparison, we reanalyzed the ADNI data set using a parametric HMM as follows:

\begin{matrix} [y_{i t} ∣ Z_{i t} = s] & = μ_{s} + α_{s 1} c_{i t 1} + α_{s 2} c_{i t 2} + α_{s 3} c_{i t 3} + α_{s 4} c_{i t 4} + β_{s 1} x_{i t 1} + β_{s 2} x_{i t 2} + β_{s 3} x_{i t 3} + δ_{i t}, \\ \log it (ϑ_{i t u s}) & = ζ_{u s} + {\tilde{α}}_{1} c_{i t 1} + {\tilde{α}}_{2} c_{i t 2} + {\tilde{α}}_{3} c_{i t 3} + {\tilde{α}}_{4} c_{i t 4} + {\tilde{β}}_{1} x_{i t 1} + {\tilde{β}}_{2} x_{i t 2} + {\tilde{β}}_{3} x_{i t 3} . \end{matrix}

The Bayesian adaptive lasso procedure was used to perform estimation. Table 5 presents the results of parameters β_sj and ${\tilde{β}}_{j}$ . The results of μ_s, ζ_us, α_sh, and ${\tilde{α}}_{h}$ are similar to those in Table 4 and not reported. Several differences exist between the results obtained using the parametric and semiparametric HMMs. First, the parametric model shows a negative constant effect of the hippocampus on FAQ in the CN, EMCI, and LMCI states, whereas the semiparametric model reveals that these negative effects have a descending trend. Second, the parametric model indicates that the effects of the hippocampus, age, and educational level on FAQ are all insignificant in the AD state, whereas the semiparametric model reveals that these effects are actually significant in certain covariate ranges. Finally, the parametric model shows that the effect of age on FAQ is negative in the NC and EMCI states but positive in the LMCI state. This diverse effect is hard to interpret and probably caused by overlooking the subtle structure of the age effect in the parametric model.

TABLE 5.

Estimation results of the parametric hidden Markov model in the ADNI-1 (Alzheimer’s Disease Neuroimaging Initiative) data analysis

Parameters in the Conditional Regression Model
State 1			State 2			State 3			State 4
Par	Est	SE	Par	Est	SE	Par	Est	SE	Par	Est	SE
β₁₁	−0.022	0.004	β₂₁	−0.122	0.023	β₃₁	−0.155	0.039	β₄₁	−0.127	0.065
β₁₂	−0.006	0.003	β₂₂	−0.008	0.017	β₃₂	0.070	0.034	β₄₂	0.088	0.055
β₁₃	−0.004	0.003	β₂₃	−0.014	0.018	β₃₃	0.030	0.029	β₄₃	0.025	0.051

Parameters in the Transition Model
Par	Est	SE	Par	Est	SE	Par	Est	SE	Par	Est	SE
${\tilde{β}}_{1}$	0.351	0.042	${\tilde{β}}_{2}$	−0.033	0.034	${\tilde{β}}_{3}$	0.004	0.023

Open in a new tab

6 ∣. CONCLUSION

In this paper, we have introduced a BaGlasso procedure to conduct simultaneous variable selection and parameter estimation in the context of semiparametric HMMs. We developed a full Bayesian approach, along with efficient MCMC methods and the basis expansion technique, to implement the procedure and estimate nonparametric functions. The methodology was demonstrated by a simulation study and an application to the analysis of the ADNI data set. In the proposed model, covariates are allowed to affect both responses and transition probabilities. This feature enables the model to cope with general situations where certain covariates simultaneously influence the two stochastic processes in various ways. An alternative method of including covariates in HMMs is to use an exclusion restriction to split the overall set of covariates into two groups: one contains covariates affecting only the responses, and the other contains covariates affecting the hidden-state transition. However, determining such an exclusion restriction may be subjective and difficult to justify in practice, which, in turn, elicits model selection issues.

This study can be extended in several directions. First, in approximating nonparametric functions, we considered only a simple version of natural cubic splines. Highly sophisticated smoothing techniques, such as splines and local polynomial kernel methods, may be used to enhance the performance of estimation and variable selection. Second, we simply used a single indicator, FAQ, to reflect cognitive ability in the ADNI data analysis. A comprehensive way to characterize cognitive function is to account for other relevant tests, such as the Alzheimer’s Disease Assessment Scale and the Mini-Mental State Examination. Grouping such highly correlated but different perspectives into an integrated latent variable through factor analysis can improve the analytic power and interpretability of the model. Finally, our model framework includes only binary and continuous variables. Given that ordered and unordered categorical data are frequently encountered in medical, social, and psychological sciences, generalizing the existing framework to accommodate a wide variety of data types is of great interest.

ACKNOWLEDGEMENTS

The work of Xinyuan Song was supported by the Research Grants Council of Hong Kong under grant 14303017, The Chinese University of Hong Kong under direct grants, and the National Natural Science Foundation of China under grant 11471277. The work of Joan Hu was supported by the Canadian Institutes of Health Research under grant RN120660 and the Natural Sciences and Engineering Research Council of Canada under grant 177430. The authors are thankful to the Editor, the Associate Editor, and three anonymous reviewers for their valuable comments and suggestions.

Funding information

Research Grants Council of Hong Kong, Grant/Award Number: 14303017; National Natural Science Foundation of China, Grant/Award Number: 11471277; Canadian Institutes of Health Research, Grant/Award Number: RN120660; Natural Sciences and Engineering Research Council of Canada, Grant/Award Number: DAS 177430

APPENDIX A

FULL CONDITIONAL DISTRIBUTIONS

A.1 ∣. Full conditional distributions of Z_it

Let y_i = (y_i1, … , y_iT)′, $d_{i t} = (c_{i t}^{'}, x_{i t}^{'})^{'}$ , and $D_{i} = (d_{i 1}^{'}, \dots, d_{i T}^{'})^{'}$ . Then, we have

p (Z_{i t} ∣ \cdot) \propto p (y_{i}, D_{i}, Z_{i t} ∣ θ) = p (y_{i 1}, \dots, y_{i t}, d_{i 1}, \dots, d_{i t}, Z_{i t} ∣ θ) \times p (y_{i, t + 1}, \dots, y_{i T}, d_{i, t + 1}, \dots, d_{i T} ∣ Z_{i t}, θ) = q_{i t} (y_{i}, D_{i}, Z_{i t} ∣ θ) \times {\overset{‒}{q}}_{i t} (y_{i}, D_{i} ∣ Z_{i t}, θ) .

We first initialize q_i1(y_i, D_i, Z_it∣θ) = p(y_i1, d_i1, Z_it∣θ) = p(y_i1∣d_i1, Z_i1, θ)p(Z_i1∣θ) and calculate q_it(y_i, D_i, Z_it∣θ) for t = 2, … , T, in a recursion manner as follows:

\begin{matrix} q_{i t} (y_{i}, D_{i}, Z_{i t} ∣ θ) = q_{i t} (y_{i 1}, \dots, y_{i t}, d_{i 1}, \dots, d_{i T}, Z_{i t} ∣ θ) \\ = & \sum_{u = 1}^{S} p (y_{i 1}, \dots, y_{i t}, d_{i 1}, \dots, d_{i T}, Z_{i t}, Z_{i, t - 1} = u ∣ θ) \\ = & \sum_{u = 1}^{S} p (y_{i 1}, \dots, y_{i t}, d_{i 1}, \dots, d_{i T} Z_{i, t - 1} = u ∣ θ) \times p (Z_{i t} ∣ Z_{i, t - 1} = u, d_{i t}, θ) \times p (y_{i t} ∣ d_{i t}, Z_{i t}, θ)] \\ = & \sum_{u = 1}^{S} [q_{i, t - 1} (y_{i}, D_{i}, Z_{i, t - 1} = u ∣ θ) \times p (Z_{i, t - 1} = u, d_{i t}, θ) \times p (y_{i t} ∣ d_{i t}, Z_{i t} θ)], \end{matrix}

(A1)

where p(Z_it∣Z_i,t–1 = u, d_it, θ) and p(y_it, d_it∣Z_it, w_i1, θ) can be calculated on the basis of (8).

Similarly, we initialize ${\overset{‒}{q}}_{i T} (y_{i}, D_{i} ∣ Z_{i T}, θ) = 1$ and calculate ${\overset{‒}{q}}_{i t} (y_{i}, D_{i} ∣ Z_{i t}, θ)$ for t = T – 1, – , 1 as follows:

\begin{matrix} {\overset{‒}{q}}_{i t} & (y_{i}, D_{i} ∣ Z_{i t}, θ) = p (y_{i, t + 1}, \dots, y_{i T}, d_{i, t + 1}, \dots, d_{i T} ∣ Z_{i t}, θ) \\ = & \sum_{u = 1}^{S} p (y_{i, t + 1}, \dots, y_{i T}, d_{i, t + 1}, \dots, d_{i T}, Z_{i, t + 1} = u ∣ Z_{i t}, θ) \\ = & \sum_{u = 1}^{S} [p (y_{i, t + 1}, \dots, y_{i T}, d_{i, t + 1}, \dots, d_{i T} ∣ Z_{i, t + 1} = u, θ) \times p (Z_{i, t + 1} = u ∣ Z_{i t}, d_{i, t + 1}, θ) \\ \times p (y_{i, t + 1} ∣ d_{i, t + 1}, Z_{i, t + 1} = u, θ)] \\ = & \sum_{u = 1}^{S} [{\overset{‒}{q}}_{i, t + 1} (y_{i}, D_{i} ∣ Z_{i, t + 1} = u, θ) \times p (Z_{i, t + 1} = u ∣ Z_{i t}, d_{i, t + 1}, θ) \times p (y_{i, t + 1} ∣ d_{i, t + 1}, Z_{i, t - 1} = u, θ)] \end{matrix}

(A2)

Thus, Z_it can be directly generated from (A1) when all q_it(·) and ${\overset{‒}{q}}_{i t} (\cdot) S$ defined in (A1) and (A2) are well calculated.

A.2 ∣. Full conditional distributions of μ_s, α_s, and ψ_s

[μ_{s} ∣ \cdot] \sim N [μ_{s}^{*}, σ_{μ s}^{*}], [α_{s} ∣ \cdot] \sim N [α_{s}^{*}, Σ_{α s}^{*}], [Ψ_{s}^{- 1} ∣ \cdot] \sim Gamma [α_{Ψ s}^{*}, β_{Ψ s}^{*}]

(A3)

In the above equation, $α_{Ψ s}^{*} = (n_{s} + p + \sum_{j = 1}^{q} M_{j}) ∕ 2 + {\tilde{α}}_{s 0}$ , $σ_{μ s}^{*} = (n_{s} Ψ_{s}^{- 1} + σ_{μ s 0}^{- 1})^{- 1}$ , and

\begin{matrix} β_{Ψ s}^{*} = {\tilde{β}}_{s 0} + \frac{1}{2} [\sum_{i = 1}^{n} \sum_{t = 1}^{T} I (Z_{i t} = s) {(y_{i t} - μ_{s} - α_{s}^{'} c_{i t} - \sum_{j = 1}^{q} β_{s j}^{'} h_{i t j})}^{2} + \sum_{j = 1}^{q} \frac{‖ β_{s j} ‖_{G_{s j}}^{2}}{τ_{β s j}^{2}} + \sum_{h = 1}^{p} + \frac{∣ α_{s h} ∣^{2}}{τ_{α s h}^{2}}], \\ Σ_{α s}^{*} = {(\sum_{i = 1}^{N} \sum_{t = 1}^{T} c_{i t} c_{i t}^{'} Ψ_{s}^{- 1} I (Z_{i t} = s) + D_{α s}^{- 1})}^{- 1}, D_{α s} = diag (τ_{α s 1}^{2}, \dots, τ_{α s p}^{2}), \\ μ_{s}^{*} = σ_{μ s}^{*} [Ψ_{s}^{- 1} \sum_{i = 1}^{n} \sum_{t = 1}^{T} I (Z_{i t} = s) (y_{i t} - α_{s}^{'} c_{i t} - \sum_{j = 1}^{q} β_{s j}^{'} h_{i t j}) + σ_{μ s 0}^{- 1} μ_{s 0}], \\ α_{s} = Σ_{s}^{*} [Ψ_{s}^{- 1} \sum_{i = 1}^{n} \sum_{t = 1}^{T} I (Z_{i t} = s) c_{i t} (y_{i t} - μ_{s} - \sum_{j = 1}^{q} β_{s j}^{'} h_{i t j}) + Σ_{α s 0}^{- 1} α_{s 0}] . \end{matrix}

A.3 ∣. Full conditional distributions of β_sj

[β_{s j} ∣ \cdot] \sim N [β_{s j}^{*}, Σ_{s j}^{*}] I (1_{n_{s}}^{'} H_{s j} β_{s j} = 0)

(A4)

In the above equation, $Σ_{s j}^{*} = Ψ_{s} (H_{s j}^{'} H_{s j} + τ_{β s j}^{- 1} G_{s j})^{- 1}$ , $β_{s j}^{*} = Ψ_{s}^{- 1} Σ_{s j}^{*} H_{s j}^{'} y_{s}^{*}$ , and $y_{s}^{*} = {y_{i t s}^{*}}$ is an n_s × 1 vector with

y_{i t s}^{*} = y_{i t} - μ_{s} - α_{s}^{'} c_{i t} - \sum_{l \neq j, l = 1}^{q} β_{s l}^{'} h_{i t l}, for Z_{i t} = s .

A.4 ∣. Full conditional distributions of π_s, ζ_us, and $\tilde{α}$

\begin{matrix} p (π_{s} ∣ \cdot) \propto exp {\sum_{u = s}^{S} \sum_{i = 1}^{n} \log (p_{i 10 u}) \times I (Z_{i 1} = u) - \frac{(π_{s} - π_{s 0})^{2}}{2 σ_{π 0}^{2}}} \\ p (ζ_{u s} ∣ \cdot) \propto exp {\sum_{ν = s}^{S} \sum_{i = 1}^{n} \sum_{t = 2}^{T} \log (p_{i t u ν}) \times I (Z_{i t} = ν, Z_{i, t - 1} = u) - \frac{(ζ_{u s} - ζ_{u s 0})^{2}}{2 σ_{ζ u s 0}^{2}}} \\ p (\tilde{α} ∣ \cdot) \propto exp {\sum_{i = 1}^{n} \sum_{t = 2}^{T} \log (p_{i t u s}) \times I (Z_{i t} = s, Z_{i, t - 1} = u) - \frac{1}{2} (\tilde{α} - {\tilde{α}}_{0})^{'} {\tilde{D}}_{α}^{- 1} (\tilde{α} - {\tilde{α}}_{0})} \end{matrix}

(A5)

In the above equation, ${\tilde{D}}_{α} = σ^{2} diag ({\tilde{τ}}_{α 1}^{2}, \dots, {\tilde{τ}}_{α p}^{2})$ , and p_itu0 and p_itus can be calculated on the basis of (9).

A.5 ∣. Full conditional distributions of ${\tilde{β}}_{j}$

p ({\tilde{β}}_{j} ∣ \cdot) \propto exp {\sum_{i = 1}^{n} \sum_{t = 2}^{T} \log (p_{i t u s}) \times I (Z_{i t} = s, Z_{i, t - 1} = u) - \frac{1}{2} {({\tilde{β}}_{j} - {\tilde{β}}_{j 0})}^{'} {\tilde{D}}_{β j}^{- 1} ({\tilde{β}}_{j} - {\tilde{β}}_{j 0})}

(A6)

The above equation is with the constraint $1_{n (T - 1)}^{'} H_{j} {\tilde{β}}_{j} = 0$ , where ${\tilde{D}}_{β j} = σ^{2} {\tilde{τ}}_{β j}^{2} {\tilde{G}}_{j}^{- 1}$ , and p_itus can be calculated on the basis of (9).

Notably, the full conditional distributions in (A5) and (A6) are not familiar probability distributions. Therefore, the Metropolis-Hastings algorithm is used to sample from them. Besides, the full conditional distributions in (A4) and (A6) involve constraints, and the procedure for sampling from them can be found in the work of Song and Lu.¹⁸

REFERENCES

1.Bartolucci F, Farcomeni A. A multivariate extension of the dynamic logit model for longitudinal data based on a latent Markov heterogeneity structure. J Am Stat Assoc. 2009;104:816–831. [Google Scholar]
2.Chow SM, Grimm KJ, Filteau G, Dolan CV, McArdle JJ. Regime-switching bivariate dual change score model. Multivar Behav Res. 2013;48:463–502. [DOI] [PubMed] [Google Scholar]
3.Vermunt JK, Langeheine R, Bockenholt U. Discrete-time discrete-state Latent Markov models with time-constant and time-varying covariates. J Educ Behav Stat. 1999;24:179–207. [Google Scholar]
4.Schmittmann VD, Dolan CV, van der Maas HL, Neale MC. Discrete latent Markov models for normally distributed response data. Multivar Behav Res. 2005;40:461–488. [DOI] [PubMed] [Google Scholar]
5.Scott SL, James GM, Sugar CA. Hidden Markov models for longitudinal comparisons. J Am Stat Assoc. 2005;100:359–369. [Google Scholar]
6.Bartolucci F, Farcomeni A, Pennoni F. Latent Markov Models for Longitudinal Data. Boca Raton, FL: Chapman & Hall/CRC; 2012. [Google Scholar]
7.Yau C, Papaspiliopoulos O, Roberts GO, Holmes CC. Bayesian nonparametric hidden Markov models with application to the analysis of copy-number-variation in mammalian genomes. J Royal Stat Soc: Ser B (Stat Methodol). 2011;73:37–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Song X, Kang K, Ouyang M, Jiang X, Cai J. Bayesian analysis of semiparametric hidden Markov models with latent variables. Struct Equ Model: Multidiscip J. 2018;25:1–20. [Google Scholar]
9.Choi H, Fermin D, Nesvizhskii AI, Ghosh D, Qin ZS. Sparsely correlated hidden Markov models with application to genome-wide location studies. Bioinformatics. 2013;29:533–541. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Städler N, Mukheijee S. Penalized estimation in high-dimensional hidden Markov models with state-specific graphical models. Ann Appl Stat. 2013;7:2157–2179. [Google Scholar]
11.Guo R, Zhu H, Chow SM, Ibrahim JG. Bayesian lasso for semiparametric structural equation models. Biometrics. 2012;68:567–577. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Feng XN, Wang GC, Wang YF, Song XY. Structure detection of semiparametric structural equation models with Bayesian adaptive group lasso. Statist Med. 2015;34:1527–1547. [DOI] [PubMed] [Google Scholar]
13.Kang K, Cai J, Song X, Zhu H. Bayesian hidden Markov models for delineating the pathology of Alzheimer’s disease. Stat Methods Med Res. 2018. Online first. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Agresti A. Categorical Data Analysis. Hoboken, NJ: John Wiley & Sons; 2002. [Google Scholar]
15.Song X, Xia Y, Zhu H. Hidden Markov latent variable models with multivariate longitudinal data. Biometrics. 2017;73:313–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Hastie T, Tibshirani R, Friedman JH. Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. New York, NY: Springer; 2009. [Google Scholar]
17.Panagiotelis A, Smith M. Bayesian identification, selection and estimation of semiparametric functions in high-dimensional additive models. J Econom. 2008;143:291–316. [Google Scholar]
18.Song XY, Lu ZH. Semiparametric latent variable models with Bayesian P-splines. J Comput Graph Stat. 2010;19:590–608. [Google Scholar]
19.Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Stat Soc: Ser B (Stat Methodol). 1996;58:267–288. [Google Scholar]
20.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Royal Stat Soc: Ser B (Stat Methodol). 2006; 68:49–67. [Google Scholar]
21.Kyung M, Gill J, Ghosh M, Casella G. Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 2010;5:369–411. [Google Scholar]
22.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–1360. [Google Scholar]
23.Wang H, Li G, Tsai CL. Regression coefficient and autoregressive order shrinkage and selection via the lasso. J Royal Stat Soc: Ser B (Stat Methodol). 2007;69:63–78. [Google Scholar]
24.Wang H, Leng C. A note on adaptive group lasso. Comput Stat Data Anal. 2008;52:5277–5286. [Google Scholar]
25.Bühlmann P, Van De Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. New York, NY: Springer Science and Business Media; 2011. [Google Scholar]
26.Cappé O, Moulines E, Rydén T. Inference in Hidden Markov Models. New York, NY: Springer; 2005. [Google Scholar]
27.Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of state calculations by fast computing machine. J Chem Phys. 1953;21:1087–1092. [Google Scholar]
28.Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57:97–109. [Google Scholar]
29.Li J, Wang Z, Li R, Wu R. Bayesian group lasso for nonparametric varying coefficient models with application to functional genome-wide association studies. Ann Appl Stat. 2015;9:640–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models. Bayesian Anal. 2006;1:651–673. [Google Scholar]
31.Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit. J Royal Stat Soc: Ser B (Stat Methodol). 2002;64:583–639. [Google Scholar]
32.Gelman A, Roberts GO, Gilks WR. Efficient Metropolis jumping rules In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, eds. Bayesian Statistics. Vol. 5 Oxford, UK: Oxford University Press; 1996:599–607. [Google Scholar]
33.Kantarci K, Gunter JL, Tosakulwong N, et al. Focal hemosiderin deposits and I²-amyloid load in the ADNI cohort. Alzheimer’s Dement. 2013;9:S116–S123. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Kesslak JP, Nalcioglu O, Cotman CW. Quantification of magnetic resonance scans for hippocampal and parahippocampal atrophy in Alzheimer’s disease. Neurology. 1991;41:51. [DOI] [PubMed] [Google Scholar]
35.Jack CR, Petersen RC, O’Brien PC, Tangalos EG. MR-based hippocampal volumetry in the diagnosis of Alzheimer’s disease. Neurology. 1992;42:183. [DOI] [PubMed] [Google Scholar]
36.Dickerson BC, Wolk D. Biomarker-based prediction of progression in MCI: comparison of AD-signature and hippocampal volume with spinal fluid amyloid-β and tau. Front Aging Neurosci. 2013;5:55. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Gao S, Hendrie HC, Hall KS, Hui S. The relationships between age, sex, and the incidence of dementia and Alzheimer disease: a meta-analysis. Arch Gen Psychiatry. 1998;55:809–815. [DOI] [PubMed] [Google Scholar]
38.Lindsay J, Laurin D, Verreault R, et al. Risk factors for Alzheimer’s disease: a prospective analysis from the Canadian Study of Health and Aging. Am J Epidemiol. 2002;156:445–453. [DOI] [PubMed] [Google Scholar]
39.Bruandet A, Richard F, Bombois S, et al. Cognitive decline and survival in Alzheimer’s disease according to education level. Dement Geriatr Cogn Disord. 2008;25:74–80. [DOI] [PubMed] [Google Scholar]
40.Stern Y, Albert S, Tang MX, Tsai WY. Rate of memory decline in AD is related to education and occupation. Neurology. 1999;53:1942. [DOI] [PubMed] [Google Scholar]
41.Vina J, Lloret A. Why women have more Alzheimer’s disease than men: gender and mitochondrial toxicity of amyloid-β peptide. J Alzheimer’s Dis. 2010;20:S527–S533. [DOI] [PubMed] [Google Scholar]
42.Heun R, Kockler M. Gender differences in the cognitive impairment in Alzheimer’s disease. Arch Women’s Ment Health. 2002;4: 129–137. [Google Scholar]
43.Mazure CM, Swendsen J. Sex differences in Alzheimer’s disease and other dementias. Lancet Neurol. 2016;15:451–452. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Lee E, Zhu H, Kong D, Wang Y, Giovanello KS, Ibrahim JG. BFLCRM: a Bayesian functional linear Cox regression model for predicting time to conversion to Alzheimer’s disease. Ann Appl Stat. 2015;9:2153–2178. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Bartolucci F, Farcomeni A. A multivariate extension of the dynamic logit model for longitudinal data based on a latent Markov heterogeneity structure. J Am Stat Assoc. 2009;104:816–831. [Google Scholar]

[R2] 2.Chow SM, Grimm KJ, Filteau G, Dolan CV, McArdle JJ. Regime-switching bivariate dual change score model. Multivar Behav Res. 2013;48:463–502. [DOI] [PubMed] [Google Scholar]

[R3] 3.Vermunt JK, Langeheine R, Bockenholt U. Discrete-time discrete-state Latent Markov models with time-constant and time-varying covariates. J Educ Behav Stat. 1999;24:179–207. [Google Scholar]

[R4] 4.Schmittmann VD, Dolan CV, van der Maas HL, Neale MC. Discrete latent Markov models for normally distributed response data. Multivar Behav Res. 2005;40:461–488. [DOI] [PubMed] [Google Scholar]

[R5] 5.Scott SL, James GM, Sugar CA. Hidden Markov models for longitudinal comparisons. J Am Stat Assoc. 2005;100:359–369. [Google Scholar]

[R6] 6.Bartolucci F, Farcomeni A, Pennoni F. Latent Markov Models for Longitudinal Data. Boca Raton, FL: Chapman & Hall/CRC; 2012. [Google Scholar]

[R7] 7.Yau C, Papaspiliopoulos O, Roberts GO, Holmes CC. Bayesian nonparametric hidden Markov models with application to the analysis of copy-number-variation in mammalian genomes. J Royal Stat Soc: Ser B (Stat Methodol). 2011;73:37–57. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Song X, Kang K, Ouyang M, Jiang X, Cai J. Bayesian analysis of semiparametric hidden Markov models with latent variables. Struct Equ Model: Multidiscip J. 2018;25:1–20. [Google Scholar]

[R9] 9.Choi H, Fermin D, Nesvizhskii AI, Ghosh D, Qin ZS. Sparsely correlated hidden Markov models with application to genome-wide location studies. Bioinformatics. 2013;29:533–541. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Städler N, Mukheijee S. Penalized estimation in high-dimensional hidden Markov models with state-specific graphical models. Ann Appl Stat. 2013;7:2157–2179. [Google Scholar]

[R11] 11.Guo R, Zhu H, Chow SM, Ibrahim JG. Bayesian lasso for semiparametric structural equation models. Biometrics. 2012;68:567–577. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Feng XN, Wang GC, Wang YF, Song XY. Structure detection of semiparametric structural equation models with Bayesian adaptive group lasso. Statist Med. 2015;34:1527–1547. [DOI] [PubMed] [Google Scholar]

[R13] 13.Kang K, Cai J, Song X, Zhu H. Bayesian hidden Markov models for delineating the pathology of Alzheimer’s disease. Stat Methods Med Res. 2018. Online first. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Agresti A. Categorical Data Analysis. Hoboken, NJ: John Wiley & Sons; 2002. [Google Scholar]

[R15] 15.Song X, Xia Y, Zhu H. Hidden Markov latent variable models with multivariate longitudinal data. Biometrics. 2017;73:313–323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Hastie T, Tibshirani R, Friedman JH. Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. New York, NY: Springer; 2009. [Google Scholar]

[R17] 17.Panagiotelis A, Smith M. Bayesian identification, selection and estimation of semiparametric functions in high-dimensional additive models. J Econom. 2008;143:291–316. [Google Scholar]

[R18] 18.Song XY, Lu ZH. Semiparametric latent variable models with Bayesian P-splines. J Comput Graph Stat. 2010;19:590–608. [Google Scholar]

[R19] 19.Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Stat Soc: Ser B (Stat Methodol). 1996;58:267–288. [Google Scholar]

[R20] 20.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J Royal Stat Soc: Ser B (Stat Methodol). 2006; 68:49–67. [Google Scholar]

[R21] 21.Kyung M, Gill J, Ghosh M, Casella G. Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 2010;5:369–411. [Google Scholar]

[R22] 22.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–1360. [Google Scholar]

[R23] 23.Wang H, Li G, Tsai CL. Regression coefficient and autoregressive order shrinkage and selection via the lasso. J Royal Stat Soc: Ser B (Stat Methodol). 2007;69:63–78. [Google Scholar]

[R24] 24.Wang H, Leng C. A note on adaptive group lasso. Comput Stat Data Anal. 2008;52:5277–5286. [Google Scholar]

[R25] 25.Bühlmann P, Van De Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. New York, NY: Springer Science and Business Media; 2011. [Google Scholar]

[R26] 26.Cappé O, Moulines E, Rydén T. Inference in Hidden Markov Models. New York, NY: Springer; 2005. [Google Scholar]

[R27] 27.Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of state calculations by fast computing machine. J Chem Phys. 1953;21:1087–1092. [Google Scholar]

[R28] 28.Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57:97–109. [Google Scholar]

[R29] 29.Li J, Wang Z, Li R, Wu R. Bayesian group lasso for nonparametric varying coefficient models with application to functional genome-wide association studies. Ann Appl Stat. 2015;9:640–664. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Celeux G, Forbes F, Robert CP, Titterington DM. Deviance information criteria for missing data models. Bayesian Anal. 2006;1:651–673. [Google Scholar]

[R31] 31.Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit. J Royal Stat Soc: Ser B (Stat Methodol). 2002;64:583–639. [Google Scholar]

[R32] 32.Gelman A, Roberts GO, Gilks WR. Efficient Metropolis jumping rules In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, eds. Bayesian Statistics. Vol. 5 Oxford, UK: Oxford University Press; 1996:599–607. [Google Scholar]

[R33] 33.Kantarci K, Gunter JL, Tosakulwong N, et al. Focal hemosiderin deposits and I²-amyloid load in the ADNI cohort. Alzheimer’s Dement. 2013;9:S116–S123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Kesslak JP, Nalcioglu O, Cotman CW. Quantification of magnetic resonance scans for hippocampal and parahippocampal atrophy in Alzheimer’s disease. Neurology. 1991;41:51. [DOI] [PubMed] [Google Scholar]

[R35] 35.Jack CR, Petersen RC, O’Brien PC, Tangalos EG. MR-based hippocampal volumetry in the diagnosis of Alzheimer’s disease. Neurology. 1992;42:183. [DOI] [PubMed] [Google Scholar]

[R36] 36.Dickerson BC, Wolk D. Biomarker-based prediction of progression in MCI: comparison of AD-signature and hippocampal volume with spinal fluid amyloid-β and tau. Front Aging Neurosci. 2013;5:55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Gao S, Hendrie HC, Hall KS, Hui S. The relationships between age, sex, and the incidence of dementia and Alzheimer disease: a meta-analysis. Arch Gen Psychiatry. 1998;55:809–815. [DOI] [PubMed] [Google Scholar]

[R38] 38.Lindsay J, Laurin D, Verreault R, et al. Risk factors for Alzheimer’s disease: a prospective analysis from the Canadian Study of Health and Aging. Am J Epidemiol. 2002;156:445–453. [DOI] [PubMed] [Google Scholar]

[R39] 39.Bruandet A, Richard F, Bombois S, et al. Cognitive decline and survival in Alzheimer’s disease according to education level. Dement Geriatr Cogn Disord. 2008;25:74–80. [DOI] [PubMed] [Google Scholar]

[R40] 40.Stern Y, Albert S, Tang MX, Tsai WY. Rate of memory decline in AD is related to education and occupation. Neurology. 1999;53:1942. [DOI] [PubMed] [Google Scholar]

[R41] 41.Vina J, Lloret A. Why women have more Alzheimer’s disease than men: gender and mitochondrial toxicity of amyloid-β peptide. J Alzheimer’s Dis. 2010;20:S527–S533. [DOI] [PubMed] [Google Scholar]

[R42] 42.Heun R, Kockler M. Gender differences in the cognitive impairment in Alzheimer’s disease. Arch Women’s Ment Health. 2002;4: 129–137. [Google Scholar]

[R43] 43.Mazure CM, Swendsen J. Sex differences in Alzheimer’s disease and other dementias. Lancet Neurol. 2016;15:451–452. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Lee E, Zhu H, Kong D, Wang Y, Giovanello KS, Ibrahim JG. BFLCRM: a Bayesian functional linear Cox regression model for predicting time to conversion to Alzheimer’s disease. Ann Appl Stat. 2015;9:2153–2178. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Bayesian adaptive group lasso with semiparametric hidden Markov models

Kai Kang

Xinyuan Song

X Joan Hu

Hongtu Zhu

Abstract

1 ∣. INTRODUCTION

2 ∣. MODEL DESCRIPTION

2.1 ∣. Semiparametric HMMs

2.2 ∣. Nonparametric modeling

2.3 ∣. Related issues

3 ∣. BAYESIAN ANALYSIS

3.1 ∣. Adaptive group lasso penalties

3.2 ∣. BaGlasso and prior specification

3.3 ∣. Posterior inference

3.4 ∣. Determination of the number of hidden states

4 ∣. SIMULATION STUDY

4.1 ∣. Simulation 1

TABLE 1.

FIGURE 1.

4.2 ∣. Simulation 2

TABLE 2.

5 ∣. ADNI STUDY

FIGURE 2.

TABLE 3.

TABLE 4.

FIGURE 3.

TABLE 5.

6 ∣. CONCLUSION

ACKNOWLEDGEMENTS

APPENDIX A

FULL CONDITIONAL DISTRIBUTIONS

A.1 ∣. Full conditional distributions of Zit

A.2 ∣. Full conditional distributions of μs, αs, and ψs

A.3 ∣. Full conditional distributions of βsj

A.4 ∣. Full conditional distributions of πs, ζus, and α~

A.5 ∣. Full conditional distributions of β~j

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A.1 ∣. Full conditional distributions of Z_it

A.2 ∣. Full conditional distributions of μ_s, α_s, and ψ_s

A.3 ∣. Full conditional distributions of β_sj

A.4 ∣. Full conditional distributions of π_s, ζ_us, and $\tilde{α}$

A.5 ∣. Full conditional distributions of ${\tilde{β}}_{j}$