Unsupervised Bayesian classification for models with scalar and functional covariates

Nancy L Garcia; Mariana Rodrigues-Motta; Helio S Migon; Eva Petkova; Thaddeus Tarpey; R Todd Ogden; Julio O Giordano; Martin M Perez

doi:10.1093/jrsssc/qlae006

. 2024 Feb 7;73(3):658–681. doi: 10.1093/jrsssc/qlae006

Unsupervised Bayesian classification for models with scalar and functional covariates

Nancy L Garcia ^1,^✉,², Mariana Rodrigues-Motta ², Helio S Migon ³, Eva Petkova ^4,⁵, Thaddeus Tarpey ⁶, R Todd Ogden ⁷, Julio O Giordano ⁸, Martin M Perez ⁹

PMCID: PMC11271982 PMID: 39072300

Abstract

We consider unsupervised classification by means of a latent multinomial variable which categorizes a scalar response into one of the L components of a mixture model which incorporates scalar and functional covariates. This process can be thought as a hierarchical model with the first level modelling a scalar response according to a mixture of parametric distributions and the second level modelling the mixture probabilities by means of a generalized linear model with functional and scalar covariates. The traditional approach of treating functional covariates as vectors not only suffers from the curse of dimensionality, since functional covariates can be measured at very small intervals leading to a highly parametrized model, but also does not take into account the nature of the data. We use basis expansions to reduce the dimensionality and a Bayesian approach for estimating the parameters while providing predictions of the latent classification vector. The method is motivated by two data examples that are not easily handled by existing methods. The first example concerns identifying placebo responders on a clinical trial (normal mixture model) and the other predicting illness for milking cows (zero-inflated mixture of the Poisson model).

Keywords: functional covariates, latent vector, unsupervised clustering, variable selection, variational inference

1. Introduction

Mixture models are popular statistical tools for classification purposes in a broad range of applied fields. Although the Gaussian mixture model is by far the most used approach for model-based cluster analysis, classification problems based on mixture models often require non-Gaussian mixture distributions. Many times we have extra information on explanatory variables and many modern applications routinely have more complex covariates in the form of vectors, matrices, functions, or images. Even though the prevailing approaches in these cases use either a parametric or a nonparametric approach and model the mean of the distribution as a function of the covariates (Cardot et al., 1999; Ferraty & Vieu, 2009; James, 2002; McLean et al., 2014, among others), there are other applications where the interest is to study the effect of the covariates on the entire distribution of the response, for example, quantile regression (e.g. Park et al., 2019, and references therein). On the other hand, some studies show no interest in modelling the mean of the distribution and instead they want to carry out classification purposes, as for example studies related to regression trees, as proposed by Breiman (2017). In this case, modelling the probability of classification through covariates is more important than the mean.

Our approach classifies an outcome as accurately as possible using only the information on the scalar and functional covariates as explanatory variables on the modelling of mixture probabilities, providing practitioners with direct interpretable cluster classification, that is, having the interpretation of the coefficients of the regression equation directly on the probability of belonging to a class. A latent variable is used to model the mixture components and divides the sample of observations into subgroups according to some similarity measure. Many authors have studied the classification problem, see, for example, Titterington et al. (1985) and McLachlan and Peel (2004) and references therein, and there are R packages (R Core Team, 2021) that perform inference for mixture models such as mixtools (Benaglia et al., 2009) but these tools cannot implement one or more functional covariates. When dealing with functional covariates, it is necessary to reduce the dimensionality of the data. One way of dealing with it is to consider a variation of principal components analysis built upon a low rank Candecomp/Parafac (CP) decomposition, a popular dimensionality-reduction method for multiway data (Kolda & Bader, 2009). In particular, a CP decomposition imposes a special low rank structure on the target regression coefficient matrix that explicitly captures the bilinear row and column effects of the matrix covariate (Jiang et al., 2017). Although this method is powerful, it requires all functional covariates to be observed at the same points across observations. Also, viewing the functional covariates as rows of a matrix and applying functional data analysis type procedures do not take advantage of the intrinsic ordering of observations (e.g. in time or space) given by the functional structure of the covariate.

We propose a more flexible classification procedure that can incorporate functional covariates on the regression of the mixture probability. Although the dimension reduction of the problem shares some characteristics of the approach of Ciarleglio et al. (2018), we focus on the regression of the mixture probability instead of the mean. We consider functional covariates as acting through linear or nonlinear effects, taking advantage of their functional nature. Additionally, the method does not restrict the functions to be observed at common points neither to have the same support, as required in Jiang et al. (2017). We employ Bayesian methods for parameter estimation by considering the classical Monte Carlo Markov chain (MCMC) approach as well as variational Bayes (VB) inference, which is an alternative method to the MCMC algorithm. This method facilitates approximation of the posterior distribution in complex models using massive data sets (Blei et al., 2017; Hoffman et al., 2013; Ormerod & Wand, 2012).

We now present two data examples.

1.1. Data set 1: identification of early responders using electroencephalography data

Placebo responders are those patients whose response is termed ‘non-specific’, e.g. in a drug trial, an improvement in symptoms that is not due to the effect of the active chemicals in the drug. There is an intense debate about how to identify placebo responders in clinical trials of medications, in particular, for major depressive disorder (MDD; Walsh et al., 2002) since there could be placebo responders among either the control or the treatment group. Furthermore, it is known that there is a high rate of placebo responders among patients in MDD treatment trials and in some experiments with selective serotonin reuptake inhibitors it was found that some patients can have a better response using placebo (Gueorguieva et al., 2011). Identifying such patients using covariates would be an important tool in clinical research. Scalar covariates such as sex and disease severity are typically included in the modelling. On the other hand, there are several studies relating differences in neural processing between placebo and active treatments (see, for example, Ciarleglio et al., 2018; Leuchter et al., 2002; Wager & Atlas, 2015; Watson et al., 2007; Zhang & Luo, 2009, and references therein).

Jiang et al. (2017) analysed data from a randomized placebo controlled depression clinical trial of sertraline in order to identify early responders to treatment (which is indicative of a placebo response since it is believed that response to the active treatment is not immediate). The data set consists of 96 MDD patients, randomized to either a drug or placebo treatment. For each subject, several scalar and categorical covariates are available, as well as their resting state electroencephalography (EEG) under a closed eyes condition. These EEG data contain the current source density amplitude spectrum values ( ${V / m}^{2}$ ) (Nunez & Srinivasan, 2006) at a total of 14 electrodes ( $P_{9}$ , $P_{10}$ , $P_{7}$ , $P_{8}$ , $P_{5}$ , $P_{6}$ , $P O_{7}$ , $P O_{8}$ , $P O_{3}$ , $P O_{4}$ , $O_{1}$ , $O_{2}$ , $P O_{Z}$ , and $O_{Z}$ ) located in occipital and parietal brain regions. Each electrode is measured at 45 frequencies at a 0.25 Hz resolution within the theta (4–7 Hz) and alpha (7–15 Hz) frequency bands. The response variable for each subject is the Hamilton Depression Rating Scale (HAM-D), measured before the treatment (baseline) and after one week into the study. It is believed that the active drug treatment can only have an effect on symptoms after two weeks. Therefore, any improvement observed after one week is likely to be due to placebo effect (or spontaneous improvement). For more details, we refer to Jiang et al. (2017) and references therein. For EEG location maps, see Figure 7 in Rupasov et al. (2012).

In order to have a baseline to compare our results, we used the hierarchical model proposed by Jiang et al. (2017) with the same scalar covariates, sex and chronicity, and functional covariates given by data taken from 14 EEG electrodes. The unobserved binary subgroup indicators are modelled via a hierarchical probit model as a function of the baseline EEG measurements and other scalar covariates of interest. In their work, instead of the regression component used in our model (cf. (17)), they propose to use the EEG data in the form of a ( $14 \times 45$ ) matrix-valued covariate. Moreover, they assume a low-dimensional structure through CP decomposition (Kolda & Bader, 2009), reducing the matrix dimension to $9 \times 4$ . One disadvantage of this approach is that it does not take advantage of the functional nature of the data and it lacks direct interpretability for the estimated parameters. Also, the analysis uses a tensor product and so the results will likely depend on how the electrodes are ordered in the matrix.

The pattern of EEG data varies from one subject to the other. Figure 1 shows the EEG plots for nine subjects. There may exist correlations among EEG signals. However, our research question at this point is more concerned with finding solutions to solve computational complexity, and thus we focus on comparing the MCMC and VB methods while improving our Bayesian approach by considering the method proposed by Gelman et al. (2008) by incorporating a weakly informative default prior distribution. Additionally, by simplifying our analysis to treat the EEG curves as independent, we gain more straightforward insights into individual curve characteristics and trends. In our forthcoming research, we intend to address correlation among curves by introducing a shared random curve for each EEG signal recorded from the same subject. Nonetheless, this augmentation will greatly increase the complexity of the model and it is beyond the scope of the current study.

The left panel of Figure 3 shows histograms of the change in HAM-D (baseline—week 1) showing the amount of improvement in depression symptoms after 1 week; a positive change indicates improvement in symptoms. Notice that there is a strong indication of a mixture of two distributions. Also, the choice of two classes are natural in this context since the practioners want to identify early responders. To explore the data set, we fit a parametric model using the expectation-maximization (EM) algorithm for a mixture of Gaussian distributions (different means and different variances) with no covariates, and the two fitted Gaussian curves are shown in the left panel of Figure 3, with more than 40% of the subjects classified as early responders (rightmost curve of left panel). Notice that the EM fit has low power to discriminate the subjects into two classes.

1.2. Data set 2: predicting illness for milking cows based on functional covariates measured 30 days before lactation

Understanding factors that affect productivity of dairy cattle is essential for optimizing profitability and sustainability of dairy farms. The transition from late pregnancy to early lactation in cows is one of the most important stages of the lactation cycle of these animals as cows are at greatest risks of experiencing health disorders during this period. The occurrence of health disorders affects the productivity of cows, and thus is a determining factor of the profitability of dairy herds.

A potential strategy to identify cows with health disorders in early lactation for treatments and other interventions is through the use of automated monitoring of cow behavioural, physiological, and performance parameters with automated health monitoring systems based on sensor data. This is because it has been demonstrated that multiple sensor parameters, such as, for example, rumination time, physical activity, resting time, body temperature, milk volume, and component yield, are useful for monitoring cow health as they are dramatically altered during episodes of health disorders (Stangaferro et al., 2016a, 2016b, 2016c) and thus can be used to predict the health status of cows. Moreover, data from these sensors systems can also be combined with non-sensor data to increase the accuracy of alerts used to identify cows with health disorders. Our data set for this application consists of data collected in order to train, validate, and test machine-learning algorithms models created through a combination of sensor data from Automated Health Monitoring System and non-sensor data available at commercial dairy farms. This project funded by the United States Department of Agriculture-National Institute of Food and Agriculture was conducted by the Dairy Cattle Biology and Management Laboratory at Cornell University. We have data on daily clinical examination of cows during the first 40 days after calving (days in milk) from 258 cows. Information on whether a cow had a health disorder (1) or not (0) was collected by the research team on a daily basis. The data consist of 41 rows (0–40) per cow with sensor and non-sensor data from the previous lactation (i.e. before calving) and after calving. We focused on four sensor parameters related to physical activity for the 30 days prior to calving. The selected parameters were: activity (total number of steps in a given day divided by 24), number of resting bouts, average rest time, and total rest time per day, see Figure 2. Moreover, in this analysis, in addition to the four functional covariates as described above, we considered four scalar variables related to previous lactation period known to be associated with health outcomes after calving: $Z_{1} =$ age of first calving, $Z_{2} =$ previous lactation days in milk, $Z_{3} =$ previous lactation health event (0 or 1), and $Z_{4} =$ previous lactation number of previous health events.

Figure 2. — Functional covariates for 10 cows.

The objective of this application example was to model the number of days a cow was sick in the first 40 days after calving. Our main goal was to classify cows into three classes of status by using the scalar and functional covariates related to activity and resting times. There was high variability in the data set as seen in Table 1. During the observation period, 175 cows were not diagnosed with a health disorder whereas 83 cows with at least one health disorder event with a mean value of 1.60 sick days and a variance of 10.30. If we just consider the values not equal to zero, we get a mean value of 4.96 sick days and variance 15.95. As previously noted, we have 68% of zeros in the sample (175/258) and probably a zero-inflated distribution to accommodate overdispersion caused by zeros is appropriate. On the other hand, if we consider only the non-zero observations, we still have a large variance as compared with the mean. Therefore, we fitted a zero-inflated mixture of two Poisson distributions to accomodate the large number of zeros as well as overdispersion beyond that.

Table 1.

Frequency of sick days for lactating cows (training $+$ testing set)

Sick days	0	1	2	3	4	5	6	7	8	9	10	11	12	13	18	Total
Freq	175	22	10	7	7	4	5	4	7	6	2	3	3	2	1	258

Open in a new tab

In this case, we fitted the model to 208 cows randomly selected as a training set and used the remaining 50 cows to evaluate the prediction using the model.

The remainder of this paper is organized as follows. Section 2 introduces the hierarchical mixture model with latent variable and regression of mixture probabilities as function of functional covariates and a Bayesian approach is presented in Section 3. Sections 4 and 5 present the results for the two data examples. An extensive simulation study is given in online supplementary material, with the primary goal of examining the performance of the normal mixture model and the zero-inflated mixture Poisson (ZIMP) model by considering aspects of sample size and ability of the functional curves to predict the latent variables correctly. The secondary goal of the simulation study is to compare the estimation ability of the MCMC and VB methods, as well their performance with respect to computational time-consuming.

2. General model

For each subject $i = 1, \dots, n$ , we observe: $y_{i}$ the scalar response, $z_{i}^{⊤}$ a vector of scalar covariates, and $X_{i} = {(t_{i j s}, X_{i j} (t_{i j s})), 1 \leq s \leq S_{i j}, 1 \leq j \leq J}$ , J functional covariates observed at discrete points $t_{i j s}, 1 \leq s \leq S_{i j}$ (which do need to be the same along the subjects), in closed domains $τ_{1}, \dots, τ_{J}$ . In general, these sets are closed intervals on the real line, and although we interpret $t_{i j s}$ as time in this study, our proposed model works for functional covariates observed at points in space or time-space. Notice that neither the domains in which we observe the functional covariates nor the observation points need to be the same for all covariates. We model the distribution of $y_{i}$ hierarchically by means of a latent class model, postulating a mixture distribution for the observed response to classify subjects into L classes. We will assume an L-mixture latent class model with unobserved multinomial random variables indicating class membership ${γ_{i} = (γ_{i 1}, \dots, γ_{i L})^{⊤}, i = 1, \dots, n}$ where $\sum_{l = 1}^{L} γ_{i l} = 1$ and $p_{i l} = P (γ_{i l} = 1)$ . Let $y = (y_{1}, \dots, y_{n})^{⊤}$ be the vector of responses with dimension $n \times 1$ and $X = {X_{i j}^{⊤}, i = 1, \dots, n, j = 1, \dots, J}$ a vector with components $X_{i j}^{⊤}$ of dimension $1 \times n_{i j}$ , where $n_{i j}$ is the number of observed points of curve j of subject i. Additionally, let $f_{l} (y_{i}; λ_{l})$ , $i = 1, \dots, n$ , be the pdf of the scalar response $y_{i}$ in class l, such as $λ_{l}$ is the vector of parameters in that class. The probability of the random variable $γ_{i l}$ is modelled as a function of the scalar and functional covariates by

g (p_{i l}) = z_{i}^{⊤} θ_{l} + \sum_{j = 1}^{J} \int_{τ_{j}} F_{j l} (X_{i j} (t), t, ϕ_{j l}) d t

(1)

where g is a known link function (e.g. probit or logit), $z_{i}$ is a vector of dimension $p \times 1$ representing scalar covariates, and $θ_{l}$ is a vector of parameters of dimension $p \times 1$ that captures the linear additive effect of those covariates in class l. Finally, $F_{j l} (\cdot, \cdot)$ is a bivariate smooth function related to the jth functional covariate in component l which depends on the vector of parameters $ϕ_{j l}$ . Notice that $F_{j l}$ can be very general and it includes as a special case, $F_{j l} (X_{i j} (t), t) = w (t) X_{i j} (t)$ (general regression model). This is termed the functional generalized additive model by McLean et al. (2014). Then the complete-data likelihood can be written as

f (y, γ ∣ λ_{1}, \dots, λ_{L}, θ, ϕ, z, X) = \prod_{i = 1}^{n} \prod_{l = 1}^{L} [f_{l} (y_{i}; λ_{l}) p_{i l}]^{γ_{i l}}

(2)

where the relationship between $p_{i l}$ with $θ, ϕ, z_{1}, \dots, z_{n}$ and $X$ is given by (1). Here, $θ = (θ_{1}^{⊤}, \dots, θ_{L}^{⊤})^{⊤}$ and $ϕ = {ϕ_{j l}, j = 1, \dots, J, l = 1, \dots, L}$ is a vector with components $ϕ_{j l}$ of length $p_{j l}$ , where $p_{j l}$ is the number of parameters in $ϕ_{j l}$ .

In general, functional covariates are considered to model the mean or any other parameter in $λ_{l}$ that indexes $f_{l} (\cdot; \cdot)$ but not the mixing probabilities $p_{i l}$ . However, in our applications, the interest lies in using functional covariates solely to classify subjects into L classes.

2.1. Functional linear model

Model (1) can be restricted to be linear by specifying $F_{j l} (x (t), t) = w_{j l} (t) x (t)$ , yielding the more common generalized linear functional model

g (p_{i l}) = z_{i}^{⊤} θ_{l} + \sum_{j = 1}^{J} \int_{τ_{j}} w_{j l} (t) X_{i j} (t) d t, l = 1, \dots, L .

(3)

To fit model (3), we consider each weight function $w_{j l} (.)$ as a smooth function belonging to the finite-dimensional space spanned by B-splines basis functions. This is not the only possibility, as other bases could be chosen such as Fourier expansion, wavelets, natural splines, etc. (Silverman, 2018). Also, we are going to choose the number of knots and knots placement in an ad hoc manner. Expressing the weight function as a linear combination of basis functions transforms the challenge of selecting $w_{j}$ from an infinite-dimensional class of functions into a more manageable finite class. Then, two critical issues must be addressed: determining the number of basis functions and optimizing knot placement. The impact of a small number of knots is known to cause oversmoothing, while a large number can yield overly rough estimates. It is noteworthy that, after establishing the number of basis functions, poorly placed knots can lead to undesired multimodality or a failure to capture distinct features. However, it is important to clarify that addressing knot placement is beyond the scope of this work and will not be discussed here. Interested readers can refer to Wahba (1982) and Eilers and Marx (1996) for in-depth information. To address both challenges simultaneously, penalized methods such as the Bayesian Lasso recommended by Park and Casella (2008) and Hans (2009) can be employed. However, they are very expensive to implement along with MCMC methods. In our study, we leverage commonly used procedures in nonparametric regression, substantially reducing computational costs. In the applications, the number of knots were based on visual inspection. In the early responder example, we adopt a heuristic approach to distribute knots based on the suggestion by Wegman and Wright (1983). Knots are strategically placed at the endpoints (frequencies 16 and 60) and around inflection points, specifically at the more frequent peaks with frequencies 18, 20, 22, 25, 30, 35, 40, 45, 50, and 55. Conversely, in the investigation of cow illnesses, we choose to space knots equal distantly.

For a positive integer $K_{j}$ and a vector of $(K_{j} - 4)$ interior knots $Υ_{j} \subset τ_{j}$ , we express the weight function as

w_{j l} (t) = \sum_{k = 1}^{K_{j}} ϕ_{j l k} B_{k}^{(j)} (t)

(4)

where ${B_{1}^{(j)}, \dots, B_{K_{j}}^{(j)}}$ are cubic B-spline basis functions determined by $Υ_{j}$ .

Substituting (4) into (3) yields

g (p_{i l}) = z_{i}^{⊤} θ_{l} + \sum_{j = 1}^{J} R_{i j}^{⊤} ϕ_{j l}

(5)

where, for each pair $(j, l)$ , $ϕ_{j l} = (ϕ_{j l 1}, \dots, ϕ_{j l K_{j}})^{⊤}$ is a vector of parameters of dimension $K_{j} \times 1$ and $R_{i j} = (R_{i j 1}, \dots, R_{i j K_{j}})^{⊤}$ is $K_{j} \times 1$ vector, with $R_{i j k} = \int_{τ_{j}} B_{k}^{(j)} (t) X_{i j} (t) d t$ .

The linear case has the advantage of easy interpretability of the weight functions. If the weight $w_{j l} (t)$ is positive (negative) over the interval $(t_{a}, t_{b})$ , this means that the higher the value of $X_{i j} (t)$ in this interval, the higher (lower) the probability of $γ_{i l} = 1$ , considering all other explanatory variables fixed. In this work, we used functional boxplots to select the functional covariates. An interesting question to be addressed in a future work is to use the posterior sample and Bayes factor to derive a test of $w_{j l} (t) = 0$ over an interval.

2.2. Functional nonlinear model

For the more general model (1), we consider each function $F_{j l} (\cdot, \cdot)$ to be a smooth surface generated by a family of tensor products of cubic B-splines (see, for example, Kim et al., 2018). That is, for each function $F_{j l}$ , there exist positive integers $K_{1 j}$ and $K_{2 j}$ and vectors of $(K_{1 j} - 4)$ interior knots $Υ_{1 j} \subset χ_{j}$ (the image of $X_{i j}$ ) and $(K_{2 j} - 4)$ interior knots $Υ_{2 j} \subset τ_{j}$ , such that

F_{j l} (s, t) = \sum_{k_{1} = 1}^{K_{1 j}} \sum_{k_{2} = 1}^{K_{2 j}} ϕ_{j l k_{1} k_{2}} B_{k_{1}}^{(Υ_{1 j})} (t) B_{k_{2}}^{(Υ_{2 j})} (s)

(6)

where ${B_{1}^{(Υ_{1 j})}, \dots, B_{K_{1 j}}^{(Υ_{1 j})}}$ and ${B_{1}^{(Υ_{2 j})}, \dots, B_{K_{2 j}}^{(Υ_{1 j})}}$ are B-spline basis determined by $Υ_{1 j}$ and $Υ_{2 j}$ , respectively.

Substituting (6) into (1) yields

g (p_{i l}) = z_{i}^{⊤} θ_{l} + \sum_{j = 1}^{J} R_{i j}^{⊤} ϕ_{j l}

(7)

where $R_{i j}$ are vectors of dimension $(K_{1} K_{2}) \times 1$ with $ϕ_{j l} (k_{1}, k_{2}) = ϕ_{j l k_{1} k_{2}}$ and $R_{i j} (k_{1}, k_{2}) = \int_{χ_{j}} \int_{τ_{j}} B_{k_{1}}^{(X, j)} (X_{i j} (s)) B_{k_{2}}^{(T, j)} (t) d t d s$ properly stacked.

As we can see, from (5) and (7), using basis expansions for both the linear and nonlinear cases, we end up with the same linear structure in terms of parameters $θ$ and $ϕ$ . Notice that the functional covariates are used to compute the $R_{i j}$ and they are not expanded using basis functions.

3. A Bayesian approach to the mixture model regression with functional covariates

Model (2) specifies the distribution of the response $y_{i}$ depending on which mixture component subject i belongs to. The mixture component densities are parameterized by the vector $λ = (λ_{1}^{⊤}, \dots, λ_{L}^{⊤})^{⊤}$ whose components are row vectors related to the densities $f_{1}, \dots, f_{L}$ , respectively, considering category L as the baseline. Therefore, we will denote by $Θ = (λ^{⊤}, β^{⊤})^{⊤}$ the vector of unknown parameters in the mixture model, where $β = (β_{1}^{⊤}, \dots, β_{L - 1}^{⊤})^{⊤}$ with components $β_{l} = (θ_{l}^{⊤}, ϕ_{1 l}^{⊤}, \dots, ϕ_{J l}^{⊤})^{⊤}$ is a column vector of parameters for the regression coefficients under component l with $x_{i} : = (z_{i}^{⊤}, R_{i 1}^{⊤}, \dots, R_{i J}^{⊤})^{⊤}$ being the row vector of covariates for subject i. Without loss of generality, we consider the same $x_{i}$ across the L mixture components in the development of the methodology. However, the methodology still applies when the covariates differ across the mixture components.

3.1. Hierarchical structure specification and prior specification

A formal Bayesian analysis of a mixture model usually leads to intractable calculations. Data augmentation is an efficient procedure for mixture models that leads to feasible computations using Gibbs sampling (Diebolt & Robert, 1994). The joint augmented posterior distribution is the product of (2) and the prior distributions, which has no closed form. There is no standard solution for the general case. However, in both applications, using suitable prior distributions, all the conditional distributions are available and the Gibbs sampling algorithm is suitable to sample from the posterior distribution of $γ$ , $λ$ , and $β$ .

The nature of the application under study dictates the form of $f_{l} (y_{i}; λ_{l})$ which in turn provides knowledge about the nature of parameters in $λ_{l}$ . There is a very rich family of distributions $f_{l} (y_{i}; λ_{l})$ that may characterize the mixture distribution of $y_{i}$ . Instead of focusing on a specific distribution for mixture components, we focus on a general solution for posterior sampling of the latent variables $γ$ and parameters in $Θ$ , which are developed with a general $f_{l} (y_{i}; λ_{l})$ without loss of generality. Therefore, prior distributions for $λ$ are problem specific. The full conditional posterior distribution of $β_{l}$ cannot be computed explicitly except for when we are using the probit link function and thus we can apply the simple latent variable method of Albert and Chib (1993). Other methods for calculating the full conditional have been proposed using data-augmentation or multiple layers of latent variables, see for example Held and Holmes (2006), Frühwirth-Schnatter and Frühwirth (2010), Gramacy and Polson (2012), and Polson et al. (2013). In our approach, for each of the components of $β$ , we will assume a Student-t prior distribution with mean 0, degrees-of-freedom parameter df, and scale $ς$ , with df and $ς$ providing minimal prior information to constrain the coefficients to lie in a reasonable range (see Section 2 of Gelman et al., 2008). An advantage of the t family is that fat-tailed distributions allow for flexible inference, since it includes both the normal ( $d f = \infty$ ) and the Cauchy ( $d f = 1$ ) distributions.

3.2. Posterior computation of parameters

We sample from the posterior distribution using a Gibbs sampling scheme, and most of the full conditional posterior distribution of the latent variables $γ$ and parameters in $Θ$ are given by standard methods. The only non-standard computation is the conditional posterior distributions for $β$ .

3.2.1. Full conditional posterior distribution of $β_{l}$

We follow Gelman et al. (2008), which incorporates a prior distribution into classical logistic regression computations. As in the classical iteratively weighted least-squares algorithm, at each iteration the algorithm determines the pseudo-data $ψ_{i l}$ given by

ψ_{i l} = g (p_{i l}) + (γ_{i l} - p_{i l}) g^{'} (p_{i l}), i = 1, \dots, n

(8)

and weights $σ_{ψ, i l}^{2}$ satisfying the equation

[σ_{ψ, i l}^{2}]^{- 1} = {[g^{'} (p_{i l})]}^{2} V_{i l}^{- 1}

(9)

where $V_{i l}$ is the variance function. In the classical approach of McCullagh and Nelder (1989), at each iteration of the algorithm, equations (8) and (9) are evaluated at ${\hat{p}}_{i l}$ , given in equation (7). Under Gelman et al. (2008) approach, $ψ_{i l}$ is approximated by a normal distribution with mean $g (p_{i l}) = x_{i}^{⊤} β$ and variance $σ_{ψ, i l}^{2}$ . Considering a normal prior distribution for $β$ , the full conditional of $β$ conjugates to a normal distribution. On the other hand, if independent Student-t distributions are considered for each $β_{l d}$ in $β_{l}$ , $d = 1, \dots, D$ , being expressed as a mixture of a normal distribution with mean $μ_{l d, β}$ and unknown $σ_{l d}^{2}$ with an inverse scale chi-square distribution with scale parameter $ς_{l d}$ and degrees of freedom $ν_{l d}$ , the approximate logarithm of the full conditional posterior density of $β_{l}$ and $σ_{l} = (σ_{l 1}, \dots, σ_{l D})^{⊤}$ is given by

\begin{aligned} \log p (β_{l}, σ_{l} ∣ ψ) & \propto - \frac{1}{2} \sum_{i = 1}^{n} \frac{1}{σ_{ψ, i l}^{2}} (ψ_{i l} - x_{i}^{⊤} β_{l})^{2} \\ - \frac{1}{2} \sum_{d = 1}^{D} [\frac{1}{σ_{l d}^{2}} (β_{l d} - μ_{l d, β})^{2} + \log (σ_{l d}^{2}) \\ + \log p (σ_{l d} ∣ s_{l d}, ν_{l d})] . \end{aligned}

(10)

Notice (10) has no closed form. Gelman et al. (2008) use an approximate EM algorithm (Dempster et al., 1977) and their algorithm proceeds taking the average over $β_{l}$ at each step, treating them as missing data and performing the EM algorithm to estimate $σ$ . The algorithm starts at some value of $σ$ and proceeds by alternating one step of iteratively weighted least squares to update $β$ and one step of EM to update $σ$ . Once enough iterations have been performed to reach approximate convergence, we get an sample of $β_{l} = (θ_{l}^{⊤}, ϕ_{1 l}^{⊤}, \dots, ϕ_{J l}^{⊤})^{⊤}$ . This step is performed inside the Gibbs algorithm. In our case, we follow this scheme to obtain a sample from the full conditional of $β$ using the bayesglm function implemented in R by Gelman et al. (2008). To use the function bayesglm, we specify the link function g as either the probit or logit function, and set the degrees of freedom $ν_{l d}$ appropriately to have normal ( $ν_{l d} = \infty$ ) and the Cauchy ( $ν_{l d} = 1$ ) as prior distributions.

3.3. Variational inference

To scale the problem described in the introduction of this paper for large samples and to include more functional covariates, we must resort to approximate posterior inference. VB inference is a machine-learning technique that facilitates approximation of the posterior distribution in complex models using massive data sets (Blei et al., 2017; Hoffman et al., 2013; Ormerod & Wand, 2012). VB inference provides the main alternative to the Markov chain Monte Carlo (MCMC) algorithm. It starts by introducing a variational family of distributions, indexed by some variational parameters $κ$ and a criterion function. Then it searches for the member $q (\cdot ∣ κ)$ of the family that best approximates the predictive distribution. The optimization criterion is derived based on the log-marginal posterior distribution of the observed data, an usual model selection criterion, $\log (p (y ∣ M))$ . Often this quantity evolves to where it requires the solution to an intractable integral. To avoid this tedious calculation, a lower bound quantity, called ELBO (evidence lower bound), is easily evaluated as

\begin{aligned} \log (p (y ∣ M)) & = \log (\int \int \int p (y, λ, β, γ) d γ d β d λ) \\ = \log (E_{q} [\frac{p (y, λ, β, γ)}{q (λ, β, γ ∣ κ)}]) \geq E_{q} [\log \frac{p (y, λ, β, γ)}{q (λ, β, γ ∣ κ)}] \end{aligned}

(11)

where $γ$ and $(λ, β)$ represent local and global quantities/parameters, respectively. The inequality in (11) is obtained by Jensen’s inequality. It is natural to use this lower bound as a model selection criterion in place of the predictive distribution, avoiding cumbersome high-dimensional integration. Therefore, the VB inference objective is to maximize ELBO, which is equivalent to minimizing the Kulback–Leibler divergence up to an additive constant (Blei et al., 2017). We propose the following partition of the joint distribution of local and global, denominated mean field family (Parisi, 1988):

q (λ, β, γ ∣ κ) = q (λ ∣ κ) q (β ∣ λ, κ) \prod_{i = 1}^{n} q (γ_{i} ∣ β, κ)

(12)

where $κ$ comprises all the parameters of the variational family. To avoid a cumbersome notation, we are using the same notation q for the joint variational distribution of (11) and the conditional distributions in (12).

The approximate conditional inference is viewed as an optimization problem. Given the above setup, the mean field family, and the ELBO criterion, one can find the optimal solution via the coordinate ascent variational inference algorithm (Bishop, 2006). Each factor of the mean field variational density is optimized iteratively, while keeping the others fixed, climbing the ELBO to a local optimum.

Letting $ν = (λ, β, γ)$ , we need to compute $q^{*} (ν ∣ κ) = q^{*} (λ ∣ κ) q^{*} (β ∣ λ, κ) \prod_{i = 1}^{n} q^{*} (γ_{i} ∣ β, κ)$ where

q^{*} (λ ∣ κ) \propto exp {E_{(β, γ)} \log p (λ ∣ β, γ, y, κ)}

(13)

q^{*} (β ∣ κ) \propto exp {E_{(λ, γ)} \log p (β ∣ λ, γ, y, κ)}, and

(14)

q^{*} (γ_{i} ∣ κ) \propto exp {E_{(γ (- γ_{i}), λ, β)} \log p (γ_{i} ∣ γ (- γ_{i}), λ, β, y)}

(15)

It is worth pointing out that the form of the optimal densities involves the full conditional distributions, revealing a link with Gibbs sampling. However, the VB algorithm does not repeatedly simulate from the full conditional distributions as is done by the Gibbs sampler.

4. Identification of early responders using EEG data

Let $y_{i}$ denote the change in the HAM-D (baseline—week 1) for subject i, $i = 1, \dots, 96$ , where a positive change indicates diminished depression symptom severity. In order to compare our results with Jiang et al. (2017), we will focus on the same model, given by (16), with the same scalar covariates, sex and chronicity, and functional covariates given by data taken from 14 EEG electrodes. The unobserved binary subgroup indicators are modelled via a hierarchical probit model as a function of the baseline EEG measurements and other scalar covariates of interest.

For this data set, we deal with the mixture model of two normal distributions with different means but the same variance. However, it is straightforward to generalize the computations for L classes and different variances. Let $Y_{1}, \dots, Y_{n}$ be independent random variables with

p (y_{i} ∣ μ_{0}, μ_{1}, σ^{2}, γ_{i}) = γ_{i} ϕ (y_{i}; μ_{1}, σ^{2}) + (1 - γ_{i}) ϕ (y_{i}; μ_{0}, σ^{2}), μ_{1} > μ_{0}

(16)

where $ϕ (.; μ, σ^{2})$ is the normal density with parameters μ and $σ^{2}$ , and $γ_{1}, \dots, γ_{n}$ are binary latent random variables with $p (γ_{i} ∣ β) = p_{i} (β)^{γ_{i}} (1 - p_{i} (β))^{1 - γ_{i}}, γ_{i} = 0 or 1$ , where

p_{i} (β) = g^{- 1} (x_{i}^{⊤} β)

(17)

g is a link function, and $β = (θ^{⊤}, ϕ_{1}^{⊤}, \dots, ϕ_{J}^{⊤})^{⊤}$ with $θ$ being a column vector of parameters associated to scalar effects and $ϕ_{j}^{⊤}$ are column vectors representing the coefficients of a function written from a B-splines expansion given by (5) or (7).

For the parameters in $λ = (μ_{0}, μ_{1}, σ^{2})^{⊤}$ in the model, we use diffuse priors, $μ_{0} \sim N (0, τ_{0}^{2})$ and $μ_{1} \sim N (0, τ_{1}^{2})$ with $τ_{0}^{2} = τ_{1}^{2} = 100$ .

For each coefficient in $β$ , we specify weakly informative t family prior distribution.

The observed data likelihood for the hierarchical model is difficult to optimize directly because the unobserved vector $γ = {γ_{i}}_{i = 1}^{n}$ . Denoting by $Θ = (λ^{⊤}, β^{⊤})^{⊤}$ , the complete likelihood is given by

f (y, γ ∣ Θ) = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π σ^{2}}} exp {- \frac{(y_{i} - μ_{0} (1 - γ_{i}) - μ_{1} γ_{i})^{2}}{2 σ^{2}}} [p_{i} (β)]^{γ_{i}} [1 - p_{i} (β)]^{(1 - γ_{i})}

(18)

4.1. Full conditional posterior distributions

The joint augmented posterior distribution is proportional to the product of the likelihood given by (18) and prior distributions specified in the previous section and has no closed form. Therefore, we adapt the Gibbs sampling algorithm to sample from the full conditional posterior distribution of $Θ$ and the latent variables $γ$ .

4.1.1. Full conditional posterior distribution of $γ_{i}$

Let $γ_{- i}$ be the vector $γ = (γ_{1}, \dots, γ_{i - 1}, γ_{i + 1}, \dots, γ_{n})$ . The full conditional posterior distribution of $γ_{i}$ is given by

P (γ_{i} = 1 ∣ Θ, y, γ_{- i}) = \frac{ϕ (\frac{y_{i} - μ_{1}}{σ}) p_{i} (β)}{ϕ (\frac{y_{i} - μ_{1}}{σ}) p_{i} (β) + ϕ (\frac{y_{i} - μ_{0}}{σ}) (1 - p_{i} (β))}

since $P (γ_{i} = x, Θ, y, γ_{- i}) = ϕ (\frac{y_{i} - μ_{1} x - μ_{0} (1 - x)}{σ}) p_{i} (β)$ where $p_{i} (β)$ depends on the covariates through the regression term and it is given by (17).

4.1.2. Full conditional posterior distribution of $μ_{0}$ and $μ_{1}$

We update $μ_{0}$ using a normal distribution with mean ${(\frac{1}{τ_{0}^{2}} + \frac{1}{σ^{2}} \sum_{i = 1}^{n} (1 - γ_{i}))}^{- 1} \sum_{i = 1}^{n} y_{i} (1 - γ_{i})$ and variance ${(\frac{1}{τ_{0}^{2}} + \frac{1}{σ^{2}} \sum_{i = 1}^{n} (1 - γ_{i}))}^{- 1},$ while $μ_{1}$ is updated, conditionally on $μ_{0}$ , with a truncated normal distribution on $(μ_{0}, \infty)$ , with mean and variance given by

{(\frac{1}{τ_{1}^{2}} + \frac{1}{σ^{2}} \sum_{i = 1}^{n} γ_{i})}^{- 1} \sum_{i = 1}^{n} y_{i} γ_{i} and {(\frac{1}{τ_{1}^{2}} + \frac{1}{σ^{2}} \sum_{i = 1}^{n} γ_{i})}^{- 1}

respectively.

4.1.3. Full conditional posterior distribution of $σ^{2}$

We update $σ^{2}$ using an inverse gamma distribution with parameters $a_{0} + \frac{n}{2}$ and $b_{0} + \sum_{i = 1}^{n} (y_{i} γ_{i} - μ_{1})^{2} + (y_{i} (1 - γ_{i}) - μ_{0})^{2}$ .

4.1.4. Full conditional posterior distribution of $β_{l}$

To derive the full conditional posterior distribution of $β_{l}$ , we proceed as described in Section 3.2.1. The distribution will depend on the prior of $β_{l}$ , which can be normal, Cauchy, or t.

4.2. VB for the normal mixed model

For the mixture of normal distributions model, the augmented vector of unknown parameters is $ν = (λ^{⊤}, β^{⊤}, γ^{⊤})^{⊤},$ where $λ = (μ_{0}, μ_{1}, σ^{2})^{⊤}$ , $β = (θ^{⊤}, ϕ^{⊤})^{⊤}$ , and $γ = (γ_{1}, \dots, γ_{n})^{⊤}$ . For the variational distribution we choose:

{ $γ_{i}, i = 1, \dots, n}$ : family of independent Bernoulli random variable with variational parameter $α_{i}$ ;
$μ_{0}$ and $μ_{1}$ : independent normal distributions with means $m_{0}$ and $m_{1}$ and variances $s_{0}^{2}$ and $s_{1}^{2}$ , respectively;
$σ^{2}$ : inverse gamma with parameters $A_{0}$ and $B_{0}$ ;
$β$ : normal distribution with covariance matrix $V_{β}^{*}$ and mean $μ_{β}^{*}$ .

Denote the parameters of the variational distributions as

κ = (m_{0}, s_{0}^{2}, m_{1}, s_{1}^{2}, α, A_{0}, B_{0}, μ_{β}, V_{β})

and define $q^{*} (ν ∣ κ) = q^{*} (μ_{0} ∣ κ) q^{*} (μ_{1} ∣ κ) q^{*} (σ^{2} ∣ κ) q^{*} (β ∣ κ) \prod_{i = 1}^{n} q^{*} (γ_{i} ∣ κ)$ . According to equations (13), (14), and (15), we have to calculate the variational distributions of $μ_{0}, μ_{1}, σ^{2}, β$ , and $γ_{1}, \dots, γ_{n}$ .

In the next sections, to simplify the notation, we will omit the dependence on $κ$ when writing the variational distributions $q^{*}$ . The details of computations can be found in the online supplementary material.

4.2.1. Variational density $q^{*} (γ_{i})$

If we consider $q^{*} (μ_{0})$ and $q^{*} (μ_{1})$ belonging to the family of independent distributions with means $m_{0}$ and $m_{1}$ and variances $s_{0}^{2}$ and $s_{1}^{2}$ , respectively, we get that $γ_{i}$ is a Bernoulli random variable with variational parameter $α_{i} = α_{i 1} / (α_{i 0} + α_{i 1})$ where

α_{i l} = exp {E_{q^{*} (θ)} \log [1 - p_{i} (β)] - E_{q^{*} (σ^{2})} (\frac{1}{2 σ^{2}}) [(y_{i} - m_{l})^{2} + s_{l}^{2}]}, l = 0, 1

4.2.2. Variational densities $q^{} (μ_{0})$ and $q^{} (μ_{1})$

For $k = 0, 1$ , the variational distribution of $μ_{k}$ is Gaussian with mean $m_{k}$ and variance $s_{k}^{2}$ given by

m_{0} = s_{0}^{2} E_{q^{*} (σ^{2})} [1 / σ^{2}] \sum_{i = 1}^{n} (1 - α_{i}) y_{i}, s_{0}^{2} = \frac{1}{E_{q^{*} (σ^{2})} [1 / σ^{2}] \sum_{i = 1}^{n} (1 - α_{i}) + 1 / τ_{0}^{2}}

and

m_{1} = s_{1}^{2} E_{q^{*} (σ^{2})} [1 / σ^{2}] \sum_{i = 1}^{n} α_{i} y_{i}, s_{1}^{2} = \frac{1}{E_{σ^{2}} [1 / σ^{2}] \sum_{i = 1}^{n} α_{i} + 1 / τ_{1}^{2}}

where $τ_{0}^{2}$ and $τ_{1}^{2}$ are the parameters from the prior distribution.

4.2.3. Variational density $q^{*} (σ^{2})$

The variational density of $σ^{2}$ , considering the likelihood and the prior distribution of $σ^{2} \sim I G (a_{0}, b_{0})$ , is given by an inverse gamma with parameters $A_{0} = a_{0} + n / 2$ and

B_{0} = b_{0} + \sum_{i = 1}^{n} \frac{α_{i}}{2} ((y_{i} - m_{1})^{2} + s_{1}^{2}) + \sum_{i = 1}^{n} \frac{(1 - α_{i})}{2} ((y_{i} - m_{0})^{2} + s_{0}^{2})

4.2.4. Variational density $q^{*} (β)$

Assuming a t prior distribution to each component of $β_{l}$ in $β$ , the variational density $q^{*} (β_{l})$ of $β_{l}$ is derived by taking the expectation of equation (10) with respect to $γ_{i}$ , derived from

E_{q * (γ_{i})} (ψ_{i l}) = g ({\hat{p}}_{i l k}) + (α_{i} - {\hat{p}}_{i l k}) g^{'} ({\hat{p}}_{i l k})

(19)

with $α_{i} = E_{q^{*} (γ_{i})} (γ_{i})$ given in Section 4.2.1. There is no closed form for $q^{*} (β_{l})$ and therefore we cannot compute the ELBO.

However, taking the prior distribution of $β_{l}$ as a normal distribution with mean $μ_{β}$ and covariance matrix $Σ_{β}$ , the variational $q^{*} (β_{l})$ is given by a normal distribution with covariance matrix $V_{β}^{*} = (x^{⊤} Σ_{ψ}^{- 1} x + Σ_{β}^{- 1})^{- 1}$ and mean $μ_{β}^{*} = V_{β}^{*} (x^{⊤} Σ_{ψ}^{- 1} E_{q^{*} (γ)} (ψ_{l}) + Σ_{β}^{- 1} μ_{β})$ , with elements in $E_{q^{*} (γ)} (ψ_{l})$ given by (19) and $Σ_{ψ}$ is a diagonal matrix of size n with elements $[σ_{ψ, i l}^{2}]^{- 1}$ given in equation (9).

4.3. Results

To sample from the posterior distribution, we use the Gibbs sampler scheme considering the posterior calculation described in Section 3.2. We ran two chains for $M = 150, 000$ steps. We calculated Gelman and Rubin’s convergence diagnostic (Gelman & Rubin, 1992) using the R package coda and the statistics and plots showed convergence for all parameters leading to a burn-in of $M_{L} = M / 2 = 75, 000$ steps and samples were taken every 100 steps. Table 2 presents the posterior estimates, using one of the chains, of the parameters using logit link function and t, Cauchy and normal priors for the parameters in $β,$ and linear vs. nonlinear functional models. The value $\hat{p}$ in each model represents

Table 2.

Posterior mean (standard deviation) of $μ_{0}, μ_{1}, σ^{2}$ , $θ$ , and p for different prior distributions and linear and nonlinear functional models

	Logit link
	Nonlinear model			Linear model
Parameter	Student-t	Normal	Cauchy	Student-t	Normal	Cauchy
$μ_{0}$	0.88 (0.48)	0.87 (0.47)	0.89 (0.48)	0.86 (0.61)	0.82 (0.61)	0.89 (0.60)
$μ_{1}$	11.52 (1.18)	11.57 (1.18)	11.44 (1.16)	10.81 (1.57)	10.72 (1.52)	10.86 (1.55)
Intercept	$-$ 0.008 (0.005)	$-$ 0.008 (0.005)	$-$ 0.008 (0.005)	$-$ 1.07 (0.34)	$-$ 1.07 (0.34)	$-$ 1.26 (0.79)
Sex	$-$ 0.34 (0.21)	$-$ 0.34 (0.21)	$-$ 0.30 (0.19)	$-$ 1.85 (0.48)	$-$ 1.85 (0.45)	$-$ 1.79 (0.54)
Chronicity	$-$ 0.43 (0.19)	$-$ 0.44 (0.19)	$-$ 0.39 (0.17)	$-$ 1.76 (0.45)	$-$ 1.80 (0.44)	$-$ 1.70 (0.54)
$σ^{2}$	15.82 (2.30)	15.81 (2.34)	15.95 (2.48)	17.21 (3.00)	17.08 (2.85)	17.32 (2.93)
$\hat{p}$	0.18	0.16	0.18	0.15	0.16	0.15

Open in a new tab

\hat{p} = \frac{1}{96} \sum_{i = 1}^{96} 1 ({\hat{p}}_{i} > 1 / 2) .

Notice that all results in Table 2 are very similar. To choose the best model, we compute the posterior predictive checks (Bayarri & Berger, 2000; Gelman & Meng, 1996) for each $t = - 5, 0, 5$ , and 10 and each model, we sample $Θ_{1, t}, \dots, Θ_{M, t}$ from the posterior distribution, generate ${\tilde{y}}_{1 r t}, \dots, {\tilde{y}}_{96 r t}$ , $r = 1, \dots, M$ , from the likelihood and considering M samples after burn-in $M_{L}$ and thinning of 100, leading to 750 samples. Then we compute

h ({\tilde{y}}_{t}) = \frac{1}{M} \sum_{r = 1}^{750} (1 / 96) \sum_{i = 1}^{96} 1 ({\tilde{y}}_{i r t} > t)

as shown in Table 3.

Table 3.

Posterior predictive checks as a function of the threshold (t) for different prior distributions and linear and nonlinear functional models with logit link

t	Nonlinear model			Linear model
	Student-t	Normal	Cauchy	Student-t	Normal	Cauchy
$- 5$	0.51	0.51	0.53	0.44	0.45	0.46
0	0.61	0.61	0.63	0.62	0.62	0.63
5	0.34	0.33	0.36	0.47	0.48	0.48
10	0.44	0.47	0.47	0.39	0.39	0.40

Open in a new tab

Gelman (2013) showed that posterior predictive values do not have uniform distribution under the null hypothesis, but rather tend to be concentrated near 0.5. Therefore, the closer to 0.5, the better the model. Although the nonlinear model is slightly better than the linear model for $t = - 5$ and $t = 10$ as shown in Table 3 independently of the prior distribution, by the parsimonious criterion we chose the linear model fitted with the normal prior distribution and consider this model.

Figure 3 shows histograms of the change in HAM-D (baseline—week 1) showing the amount of improvement in depression symptoms after 1 week, a positive change indicates improvement in symptoms. The right panel presents the fitted distribution ( $0.84 \times N (0.82, 17.08) + 0.16 \times N (10.72, 17.08)$ ) as chosen by the linear model with normal prior and logit link. For comparison, we present the parametric model using the EM algorithm for a mixture of Gaussian distributions (different means and different variances) with no covariates. Comparing left and right panels of Figure 3, we can see that the fitted model has a much better ability to discriminate between the two groups since the two distributions have less overlap.

In order to classify subjects as early responders, we used the maximum posterior estimate of $p (γ_{i} ∣ y)$ , see Figure 4. Specifically, 10 patients exhibit near-certain early responder characteristics (posterior mean probability greater than $0.9$ ) while an additional 5 patients show a high likelihood of being early responders (posterior mean probability in the range of $[0.5, 0.9]$ ).

For comparison between MCMC and VB results, Table 4 shows the estimated expected value estimates for the parameters based on posterior mean for MCMC and the expectation of the variational distribution ( $E_{q *}$ ) for VB for the linear/nonlinear model, logit link. We present only the results for the normal prior as we operate VB with normal prior only (cf. Table 2). Although point estimates of the parameters obtained by MCMC and VB appear to be discrepant, comparing results for linear logit model with normal prior, except by the intercept and female effect, all estimates for the posterior expectation of the parameters under VB approach are inside the MCMC high probability density 95% credible interval.

Table 4.

Estimated posterior expectation of the parameters and the computational time for the normal mixture model using MCMC (posterior mean—PM and high probability density interval—HPDI) and VB ( $E_{q^{*}} [\cdot]$ ) approaches

	VB		MCMC
	Nonlinear	Linear	Nonlinear	Linear
Parameter	$E_{q *} [\cdot]$	$E_{q^{*}} [\cdot]$	PM	PM	HPDI
$μ_{0}$	1.40	1.13	0.87	0.82	( $-$ 0.26; 2.08)
$μ_{1}$	13.34	12.38	11.57	10.72	(7.80; 13.56)
$σ^{2}$	18.14	16.68	15.81	17.08	(11.78; 22.64)
Intercept ( $θ_{0}$ )	$-$ 0.01	$-$ 0.30	$-$ 0.008	$-$ 1.07	( $-$ 1.81; $-$ 0.48)
Female ( $θ_{1}$ )	$-$ 0.14	$-$ 0.93	$-$ 0.34	$-$ 1.85	( $-$ 2.73; $-$ 1.04)
Chronicity ( $θ_{2}$ )	$-$ 0.13	$-$ 1.39	$-$ 0.44	$-$ 1.80	( $-$ 2.68; $-$ 1.12)
Time		12 min		80 min

Open in a new tab

Note. MCMC = Monte Carlo Markov chain; VB = variational Bayes.

For the linear model, we estimate the weight functions $w_{j}$ , $j = 1, \dots, 14$ , which gives the effect to what extent each resting state EEG alpha and theta power in the posterior region of brain under a closed eyes condition could help identify a potential early responders subgroup (which is believed to consist of subjects susceptible to non-specific placebo effects). To understand the variability in the estimates of the weight functions, we use functional boxplots which are the analogue to usual boxplots, and they are a graphical method to display five descriptive statistics: the median, the first and third quartiles, and the non-outlying minimum and maximum observations. The functional boxplot proposed by Sun and Genton (2011) is similar in spirit to the classical ‘box-and-whisker plot’ and displays the functional median, the envelope of the 50% central region and the maximum envelope based on the ranks to functional data induced by the modified band depth proposed by López-Pintado and Romo (2009). Given the ranks, it is possible to order the functional data according to decreasing depth values. This ordering is taken from the centre outwards, therefore the functional observation with the largest depth value has the most central position and it is called the functional median. The sample 50% central region is the set ${(t, y (t)) : y (t)$ is sandwiched between the minimum of half of the ranked observations with larger depths and the maximum of half of the ranked observations with smaller depths $}$ . Notice that the functional median is one of the curves whereas the 50% central region consists of pieces from different observations and it is not one of the original observations in general. The maximum envelope is constructed after removing outliers and determined by inflating the envelope of the 50% central region by 1.5 times its range for each t.

In the literature, depression is associated with impaired right parietal cortex function, see Bruder et al. (1995) and Heller and Nitscke (1997) and references therein. In particular, Stewart et al. (2011) suggest using asymmetry scores to measure parietal brain asymmetry, in particular, comparing $P_{5}$ (right) with $P_{6}$ (left) and $P_{7}$ (right) with $P_{8}$ (left). Figure 5 displays functional boxplots for the weight functions $w_{j} (t); j = 3, 5, and 6$ . These weight functions were found to be the only significant ones for predicting early responders using a linear model with the logit link and a normal prior. They are associated with the EEGs located at positions $P_{5}$ , $P_{6}$ , and $P_{7}$ aligning with the observation mentioned earlier. Notably, higher EEG signals in the frequency range of 20–45Hz for $P_{5}$ , 30–45Hz for $P_{6}$ , and 16–35Hz for $P_{7}$ correspond to an increased probability of being an early responder.

Figure 5. — Functional boxplots for the significant weight functions using the linear model with logit link function and normal prior. The shaded area is the 50% central region, and red dashed curves are outlier candidates.

5. Predicting illness for milking cows based on functional covariates measured 30 days before lactation

In this section, we will analyse the case where the observed sample is given by $Y_{1}, Y_{2}, \dots, Y_{n}$ , independent non-negative integer-valued random variables, such that

P (Y_{i} = y ∣ λ_{1}, λ_{2}, γ_{i}) = [I (y_{i} = 0)]^{γ_{i 0}} {[\frac{1}{y_{i}!} e^{- λ_{1}} λ_{1}^{y_{i}}]}^{γ_{i 1}} {[\frac{1}{y_{i}!} e^{- λ_{2}} λ_{2}^{y_{i}}]}^{γ_{i 2}}

(20)

for $y_{i} = 0, 1, \dots$ , where $γ_{i} = (γ_{i 0}, γ_{i 1}, γ_{i 2})$ are latent multinomial random variables Multinomial $(1, p_{i 0}, p_{i 1}, p_{i 2})$ with

p_{i 0} = \frac{1}{1 + exp (x_{i}^{⊤} β_{1}) + exp (x_{i}^{⊤} β_{2})}, p_{i l} = \frac{exp (x_{i}^{⊤} β_{l})}{1 + exp (x_{i}^{⊤} β_{1}) + exp (x_{i}^{⊤} β_{2})}, l = 1, 2

Therefore, the vector of unknowns is $ν = (λ_{1}, λ_{2}, β_{1}^{⊤}, β_{2}^{⊤}, γ_{1}^{⊤}, γ_{2}^{⊤})^{⊤} = (Θ^{⊤}, γ^{⊤})^{⊤}$ . Here, $x_{i}$ is row i of matrix of covariates.

For the model parameters, we propose using the following priors:

$λ_{1} \sim Gamma (a_{1}, b_{1})$ , $λ_{2} \sim Gamma (a_{2}, b_{2})$ .
Elements in $β_{l} = (θ_{l}^{⊤}, ϕ_{1 l}^{⊤}, \dots, ϕ_{j l}^{⊤})^{⊤}$ will have independent t prior distributions.

The complete likelihood is given by

f (y, γ ∣ Θ) = \prod_{i = 1}^{n} {[p_{i 0} I (y_{i} = 0)]}^{γ_{i 0}} {[p_{i 1} \frac{1}{y_{i}!} e^{- λ_{1}} λ_{1}^{y_{i}}]}^{γ_{i 1}} {[p_{i 2} \frac{1}{y_{i}!} e^{λ_{2}} λ_{2}^{y_{i}}]}^{γ_{i 2}}

(21)

The joint augmented posterior distribution is the product of the likelihood and priors specified above and has no closed form. Therefore, we adapt the Gibbs sampling algorithm to sample from the posterior distribution of $Θ$ and the latent variables $γ$ .

5.1. Full conditional posterior distributions

The posterior distribution of parameters is obtained based on a Gibbs sampling scheme. For that, we calculate the full conditional posterior distribution of parameters in $Θ$ , similarly to the one described in Section 4.1.

5.1.1. Full conditional posterior distribution of $γ_{i}$

Let $γ_{- i 0}$ , $γ_{- i 1}$ , and $γ_{- i 2}$ be the vector $γ_{l} = (γ_{1 l}, \dots, γ_{n l})^{⊤}$ without observation $γ_{i l}$ , $l = 0, 1, 2$ , respectively. The full conditional posterior distribution of $γ_{i}$ for $l = 1, 2$ is given by

P (γ_{i l} = 1 ∣ Θ, y, γ_{- i l}) = \frac{e^{- λ_{l}} λ_{l}^{y_{i}} exp (x_{i}^{⊤} β_{l})}{I (y_{i} = 0) + e^{- λ_{1}} λ_{1}^{y_{i}} exp (x_{i}^{⊤} β_{1}) + e^{- λ_{2}} λ_{2}^{y_{i}} exp (x_{i}^{⊤} β_{2})}

(22)

for $l = 1, 2$ and

P (γ_{i 0} = 1 ∣ Θ, y, γ_{- i 0}) = \frac{I (y_{i} = 0)}{I (y_{i} = 0) + e^{- λ_{1}} λ_{1}^{y_{i}} exp (x_{i}^{⊤} β_{1}) + e^{- λ_{2}} λ_{2}^{y_{i}} exp (x_{i}^{⊤} β_{2})} .

(23)

5.1.2. Full conditional posterior distribution of $λ_{1}$ and $λ_{2}$

We update $λ_{l}, l = 1, 2$ , using a gamma distribution with parameters

a_{l} + \sum_{i = 1}^{n} y_{i} γ_{i l} and b_{l} + \sum_{i = 1}^{n} γ_{i l}, l = 1, 2.

5.1.3. Full conditional posterior distribution of $β_{l} = (θ_{l}^{⊤}, ϕ_{1 l}^{⊤}, \dots, ϕ_{j l}^{⊤})^{⊤}, l = 1, 2$

Computations of the full conditional posterior distribution $β_{l}$ follow from Section 3.2.1.

5.2. VB of the ZIMP model

Analogously to the normal mixture case, we define the variational densities as

q^{*} (ν ∣ κ) = q^{*} (λ_{1} ∣ κ) q^{*} (λ_{2} ∣ κ) q (β_{1} ∣ κ) q (β_{2} ∣ κ) q (γ ∣ κ)

where $κ = (α^{⊤}, ψ_{1}, ζ_{1}, ψ_{2}, ζ_{2}, (μ_{β_{1}}^{*})^{⊤}, (μ_{β_{2}}^{*})^{⊤}, v e c (V_{β_{1}}), v e c (V_{β_{2}}))^{⊤}$ is the vector of variational parameters. For all the cases, the variational densities $q^{*} (β_{1} ∣ κ)$ and $q^{*} (β_{2} ∣ κ)$ will have exactly the same computations as in the mixture of two normal distributions models, see Section 4.2.4. The vector of unknowns is $ν = (λ_{1}, λ_{2}, β_{1}^{⊤}, β_{2}^{⊤}, γ^{⊤})^{⊤} = (Θ^{⊤}, γ^{⊤})^{⊤},$ where $γ = (γ_{1}, \dots, γ_{n})^{⊤}$ .

Again, to simplify the notation, we will omit the dependence on $κ$ when writing the variational distributions $q^{*}$ . The details of computations can be found in the online supplementary material.

5.2.1. Variational density $q^{*} (γ_{i})$

If we consider $q^{*} (λ_{1})$ and $q^{*} (λ_{2})$ belonging to the gamma family of distributions with parameters $(ψ_{1}, ζ_{1})$ and $(ψ_{2}, ζ_{2})$ , respectively, we get $γ_{i}$ as a multinomial random variable Multinomial $(1, α_{i 0}, α_{i 1}, α_{i 2})$ with $α_{i l} = ρ_{i l} / (\sum_{j = 0}^{2} ρ_{i j})$ , for $l = 0, 1, 2$ , where

\begin{aligned} ρ_{i 0} & = I (y_{i} = 0) \\ ρ_{i l} & = exp (- \frac{ψ_{l}}{ζ_{l}} + y_{i} (- \log (ζ_{l}) + Ψ (ψ_{l})) + E_{q^{*} (β_{l})} [x_{i}^{⊤} β_{l}]) for l = 1, 2 \end{aligned}

5.2.2. Variational density $q^{} (λ_{1})$ and $q^{} (λ_{2})$

The variational distribution of $λ_{l}, l = 1, 2$ , is gamma density with parameters

ψ_{l} : = a_{l} + (\sum_{i = 1}^{n} α_{i l} y_{i}) and ζ_{l} : = b_{l} + \sum_{i = 1}^{n} α_{i l}

5.3. Results

For the MCMC, we ran two chains for 150,000 steps. We calculated Gelman and Rubin’s convergence diagnostic (Gelman & Rubin, 1992) using the R package coda and the statistics and plots showed convergence for all parameters leading to a burn-in of 75,000 steps and samples were taken every 100 steps, resulting into 750 samples. After fitting the model, Table 5 shows the classification using the posterior mean for the latent gamma variable (the classification according to the maximum value of the estimated probability is the same using MCMC and VB estimates). From the 138 cows in the training set that did not get sick during lactation, 136 were classified as belonging to the pure-zero class with high probability while the other 2 cows were classified as low incidence. All 43 cows that were sick between 1 and 4 days were classified as low incidence whereas only one of the cows diagnosed with a health disorder for 5 days was classified as low incidence and the other one was classified as high incidence. All cows with 6 or more days with a health disorder were classified as high incidence.

Table 5.

Classification for the training set according to the class presenting the maximum mean value of the posterior distribution and VB estimation of the latent variable γ

Class	Number of sick days
	0	1	2	3	4	5	6	7	8	9	11	12	13	18
Pure zero	136	0	0	0	0	0	0	0	0	0	0	0	0	0
Low incidence	2	20	10	6	5	1	0	0	0	0	0	0	0	0
High incidence	0	0	0	0	0	1	5	4	6	5	2	3	1	1

Open in a new tab

Note. VB = variational Bayes.

The posterior mean and standard deviation and quantiles of the posterior distribution as well as the mean for the VB mean value for these classes are shown in Table 6 and the left panel of Figure 6. Notice that the estimates of the mean and standard deviation of the posterior distribution for $λ_{1}$ and $λ_{2}$ are very similar using VB and MCMC, with the VB method presenting smaller standard deviations. However, the quantiles from the MCMC method indicate that the $95 %$ credible intervals of $λ_{1}$ (1.20;2.13) and $λ_{2}$ (7.53;9.83) clearly discriminate the classes. On the other hand, from Figure 6, we can see that the VB estimates for the mean of the intercept and scalar coefficients for variables are not so good for the coefficients that are not significant whereas they are similar for the ones which are significant. Direct comparison for the coefficients of the expansion for the weights of the functional covariates are not meaningful and we constructed the functional boxplot for the posterior estimates of the weight functions as shown in Figure 7. To understand these plots, recall that the functional boxplots are similar to the usual boxplots, where the shaded region comprises the 50% central functional observations and can be thought as 50% credible intervals. Notice that apparently, none of the functional covariates are significant for the regression term for low incidence class since the horizontal line at zero is almost completely contained inside the shaded regions. The covariate activity seems to be significant to determine the probabilities of the high incidence latent class; if the cow has high activity between 30 and 20 days prior to labour, it impacts negatively in the probability whereas if the cow has high activity between 20 and 5 days prior to labour, it increases the probability of belonging to the high intensity class.

Table 6.

VB mean, posterior mean, and quantiles for the parameters according to MCMC sample for logit link and Cauchy prior for zero-inflated mixture of two Poisson distributions

			Quantiles—MCMC
Parameter	VB mean (SD)	MCMC mean	2.5%	25%	50%	75%	97.5%
$λ_{1}$	1.60 (0.18)	1.63 (0.24)	1.20	1.47	1.62	1.79	2.13
$λ_{2}$	8.45 (0.53)	8.64 (0.59)	7.53	8.24	8.63	9.03	9.83
Time	20.35 s	2.55 h

Open in a new tab

Note. VB = variational Bayes; MCM = Monte Carlo Markov chain.

Figure 6. — Histogram for posterior distribution of $λ_{1}$ and $λ_{2}$ . Intercepts and coefficients related to the scalar variables ( $θ_{1}$ , $θ_{2}$ , $θ_{3}$ , and $θ_{4}$ ) for the ZIMP model. ZIMP = zero-inflated mixture Poisson.

Figure 7. — Functional boxplots for the weight functions corresponding to covariates ‘activity’, ‘number of rest times’, ‘average rest time’, and ‘total rest time’, respectively, under linear model with logit link function and Cauchy prior. Left panels corresponds to class 1 (low intensity) and right panels to class 2 (high intensity).

Figure 8 shows as a stacked barplot the estimated probability of belonging to the pure zero, low and high incidence class $({\hat{p}}_{i 0}, {\hat{p}}_{i 1}, {\hat{p}}_{i 2})$ for the 50 cows in the test set. The left panel presents the 37 cows that did not get sick whereas the right panel displays the 13 cows that were sick at least for a day. It is important to note that the current model does not exhibit strong predictive power. There are various prepartum risk factors linked to postpartum diseases. According to studies by Urton et al. (2005) and Huzzey et al. (2007), reduced feeding times and dry matter intake before parturition serve as predictors for metritis. Additionally, Riekerink et al. (2007) report a seasonal effect associated with mastitis. Further research by Pérez-Báez et al. (2019a, 2019b) explores the connection between prepartum dry matter intake, energy balance, and subsequent postpartum health disorders. Consequently, in future investigations, it is advisable to incorporate some of these factors as additional covariates for a more comprehensive understanding.

6. Discussion

In this paper, we consider two applications where the heart of the problem was to classify the individuals based solely on the covariates. We proposed a classification procedure based on a mixture model driven by latent variables under the Bayesian paradigm. We used a semi-parametric regression model incorporating functional covariates as predictors for the latent group membership. Other methods could be used to analyse the data such as regression trees or regression mixture models. However, all methods have to be adapted to deal with functional data. In the case of regression trees, one approach to include the functional covariates would be discretizing the functional predictor by partitioning their domains into several (disjoint) intervals and use the mean values computed from each interval as covariates. However, again we return to the approach of treating functional data as multivariate vectors. Using regression mixture models would almost double the number of parameters to be estimated. On the other hand, Bayesian analysis not only provides information on the whole distribution of the parameters, going beyond a point or an interval estimate, but also it simplifies the task of estimating latent models since we can augment the parameter space. In this work, we used Gibbs sampler due to the conjugacy of the priors. Other choices would be possible, e.g. the Metropolis-Hastings algorithm. Although VB approach leads to faster computational time, we see from simulations that MCMC produces more reliable estimates. Also, computational times are hard to compare since they depend on chain size and precision in MCMC and VB approaches, respectively.

The main features of our methodology are:

The nonparametric approach of expanding the unknown functions $w_{j}$ and $F_{j}$ into B-splines basis reduces the dimension of the problem.
Linear and nonlinear regressions can be implemented.
It can be used with any link function g.
It incorporates Student-t prior information for the regression coefficients applying an efficient approximate EM algorithm already implemented in R.
It can be used with any distributions.
The functional covariates do not need to be observed concurrently and they can even be different for each subject.
In the case of the linear model, it has the added advantage of interpretability of the weight functions which might naturally incorporate prior information that is available to experts in the field.

The strength of our method is demonstrated by the simulation study as well as the analysis of the two data sets.

Supplementary Material

qlae006_Supplementary_Data

qlae006_supplementary_data.pdf^{(327.6KB, pdf)}

Acknowledgements

A special thanks to Alberto Saa for helping to solve a computational bottleneck and Guilherme J.M. Rosa for fruitful discussions. We thank the referees for their thoughtful comments and suggestions that improved the quality of the manuscript. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the view of the National Institute of Food and Agriculture or the United States Department of Agriculture.

Contributor Information

Nancy L Garcia, Department of Statistics, Universidade Estadual de Campinas, Campinas, Brazil.

Mariana Rodrigues-Motta, Department of Statistics, Universidade Estadual de Campinas, Campinas, Brazil.

Helio S Migon, Department of Statistics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil.

Eva Petkova, Department of Population Health, Grossman School of Medicine, New York University, New York, USA; Department of Child and Adolescent Psychiatry, Grossman School of Medicine, New York University, New York, USA.

Thaddeus Tarpey, Department of Population Health, Grossman School of Medicine, New York University, New York, USA.

R Todd Ogden, Department of Biostatistics, Columbia University, New York, USA.

Julio O Giordano, College of Agriculture and Life Sciences, Cornell University, Cornell, USA.

Martin M Perez, College of Agriculture and Life Sciences, Cornell University, Cornell, USA.

Funding

This work was partially supported by NIMH grant 5 R01 MH099003, USDA National Institute of Food and Agriculture Animal Health Program award 2017-67015-26772 to J.O.G., FAPESP grants 2017/15306-9, 2018/06811-4, 2019/10800-0, and 2023/00592-7, and CNPq grants 302598/2014-6, 442012/2014-4, and 304148/2020-2.

Data availability

The code and data files used in the real data applications and simulation studies are available at https://github.com/nancyg-unicamp/Unsupervised.

Supplementary material

Supplementary material is available online at Journal of the Royal Statistical Society: Series C. As a complement to this paper we provide a file with extensive simulation study to investigate the performance of the proposed method as well as acomparation between MCMC and Variational Bayes Inference.

References

Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. 10.1080/01621459.1993.10476321 [DOI] [Google Scholar]
Bayarri, M., & Berger, J. O. (2000). P-values for composite null models. Journal of the American Statistical Association, 95(452), 1127–1142. 10.2307/2669749 [DOI] [Google Scholar]
Benaglia, T., Chauveau, D., Hunter, D. R., & Young, D. S. (2009, October). mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software, 32(6), 1–29. 10.18637/jss.v032.i06 [DOI] [Google Scholar]
Bishop, C. (2006). Pattern recognition and machine learning. Springer google schola, 2, 531–537. [Google Scholar]
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877. 10.1080/01621459.2017.1285773 [DOI] [Google Scholar]
Breiman, L. (2017). Classification and regression trees. Routledge. [Google Scholar]
Bruder, G. E., Tenke, C. E., Stewart, J. W., Towey, J. P., Leite, P., Voglmaier, M., & Quitkin, F. M. (1995). Brain event-related potentials to complex tones in depressed patients: Relations to perceptual asymmetry and clinical features. Psychophysiology, 32(4), 373–381. 10.1111/psyp.1995.32.issue-4 [DOI] [PubMed] [Google Scholar]
Cardot, H., Ferraty, F., & Sarda, P. (1999). Functional linear model. Statistics & Probability Letters, 45(1), 11–22. 10.1016/S0167-7152(99)00036-X [DOI] [Google Scholar]
Ciarleglio, A., Petkova, E., Ogden, T., & Tarpey, T. (2018). Constructing treatment decision rules based on scalar and functional predictors when moderators of treatment effect are unknown. Journal of the Royal Statistical Society Series C: Applied Statistics, 67(5), 1331–1356. 10.1111/rssc.12278 [DOI] [PMC free article] [PubMed] [Google Scholar]
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22. 10.1111/j.2517-6161.1977.tb01600.x [DOI] [Google Scholar]
Diebolt, J., & Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society: Series B (Methodological), 56(2), 363–375. 10.1111/j.2517-6161.1994.tb01985.x [DOI] [Google Scholar]
Eilers, P. H., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–121. 10.1214/ss/1038425655 [DOI] [Google Scholar]
Ferraty, F., & Vieu, P. (2009). Additive prediction and boosting for functional data. Computational Statistics & Data Analysis, 53(4), 1400–1413. 10.1016/j.csda.2008.11.023 [DOI] [Google Scholar]
Frühwirth-Schnatter, S., & Frühwirth, R. (2010). Data augmentation and MCMC for binary and multinomial logit models. In T. Kneib & G. Tutz (Eds.), Statistical modelling and regression structures (pp. 111–132). Springer.
Gelman, A. (2013). Two simple examples for understanding posterior p-values whose distributions are far from uniform. Electronic Journal of Statistics, 7(1), 2595–2602. 10.1214/13-EJS854 [DOI] [Google Scholar]
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4), 1360–1383. 10.1214/08-AOAS191 [DOI] [Google Scholar]
Gelman, A., & Meng, X.-L. (1996). Model checking and model improvement. In W. R. Gilks, S. Richardson, & D. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 189–201). Springer.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472. 10.1214/ss/1177011136 [DOI] [Google Scholar]
Gramacy, R. B., & Polson, N. G. (2012). Simulation-based regularized logistic regression. Bayesian Analysis, 7(3), 567–590. 10.1214/12-BA719 [DOI] [Google Scholar]
Gueorguieva, R., Mallinckrodt, C., & Krystal, J. H. (2011). Trajectories of depression severity in clinical trials of duloxetine: Insights into antidepressant and placebo responses. Archives of General Psychiatry, 68(12), 1227–1237. 10.1001/archgenpsychiatry.2011.132 [DOI] [PMC free article] [PubMed] [Google Scholar]
Hans, C. (2009). Bayesian lasso regression. Biometrika, 96(4), 835–845. 10.1093/biomet/asp047 [DOI] [Google Scholar]
Held, L., & Holmes, C. C. (2006). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1), 145–168. 10.1214/06-BA105 [DOI] [Google Scholar]
Heller, W., & Nitscke, J. B. (1997). Regional brain activity in emotion: A framework for understanding cognition in depression. Cognition & Emotion, 11(5–6), 637–661. 10.1080/026999397379845a [DOI] [Google Scholar]
Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(5), 1303–1347. https://dl.acm.org/doi/10.5555/2567709.2502622 [Google Scholar]
Huzzey, J., Veira, D., Weary, D., & von Keyserlingk, M. (2007). Prepartum behavior and dry matter intake identify dairy cows at risk for metritis. Journal of Dairy Science, 90(7), 3220–3233. 10.3168/jds.2006-807 [DOI] [PubMed] [Google Scholar]
James, G. M. (2002). Generalized linear models with functional predictors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3), 411–432. 10.1111/1467-9868.00342 [DOI] [Google Scholar]
Jiang, B., Petkova, E., Tarpey, T., & Ogden, R. T. (2017). Latent class modeling using matrix covariates with application to identifying early placebo responders based on EEG signals. The Annals of Applied Statistics, 11(3), 1513–1536. 10.1214/17-AOAS1044 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim, J. S., Staicu, A.-M., Maity, A., Carroll, R. J., & Ruppert, D. (2018). Additive function-on-function regression. Journal of Computational and Graphical Statistics, 27(1), 234–244. 10.1080/10618600.2017.1356730 [DOI] [PMC free article] [PubMed] [Google Scholar]
Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500. 10.1137/07070111X [DOI] [Google Scholar]
Leuchter, A. F., Cook, I. A., Witte, E. A., Morgan, M., & Abrams, M. (2002). Changes in brain function of depressed subjects during treatment with placebo. American Journal of Psychiatry, 159(1), 122–129. 10.1176/appi.ajp.159.1.122 [DOI] [PubMed] [Google Scholar]
López-Pintado, S., & Romo, J. (2009). On the concept of depth for functional data. Journal of the American Statistical Association, 104(486), 718–734. 10.1198/jasa.2009.0108 [DOI] [Google Scholar]
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman & Hall. [Google Scholar]
McLachlan, G. J., & Peel, D. (2004). Finite mixture models. John Wiley & Sons. [Google Scholar]
McLean, M. W., Hooker, G., Staicu, A.-M., Scheipl, F., & Ruppert, D. (2014). Functional generalized additive models. Journal of Computational and Graphical Statistics, 23(1), 249–269. 10.1080/10618600.2012.729985 [DOI] [PMC free article] [PubMed] [Google Scholar]
Nunez, P. L., & Srinivasan, R. (2006). Electric fields of the brain: The neurophysics of EEG. Oxford University Press. [Google Scholar]
Ormerod, J. T., & Wand, M. P. (2012). Gaussian variational approximate inference for generalized linear mixed models. Journal of Computational and Graphical Statistics, 21(1), 2–17. 10.1198/jcgs.2011.09118 [DOI] [Google Scholar]
Parisi, G. (1988). Statistical field theory. Addison-Wesley. [Google Scholar]
Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association, 103(482), 681–686. 10.1198/016214508000000337 [DOI] [Google Scholar]
Park, S. Y., Li, C., Mendoza Benavides, S. M., van Heugten, E., & Staicu, A.-M. (2019). Conditional analysis for mixed covariates, with application to feed intake of lactating sows. Journal of Probability and Statistics, 2(1), 1–14. 10.1155/2019/3743762 [DOI] [Google Scholar]
Pérez-Báez, J., Risco, C., Chebel, R., Gomes, G., Greco, L., Tao, S., Thompson, I., do Amaral, B., Zenobi, M., Martinez, N., Staples, C., Dahl, G., Hernández, J., Santos, J., & Galvão, K. (2019a). Association of dry matter intake and energy balance prepartum and postpartum with health disorders postpartum: Part I. Calving disorders and metritis. Journal of Dairy Science, 102(10), 9138–9150. 10.3168/jds.2018-15878 [DOI] [PubMed] [Google Scholar]
Pérez-Báez, J., Risco, C., Chebel, R., Gomes, G., Greco, L., Tao, S., Thompson, I., do Amaral, B., Zenobi, M., Martinez, N., Staples, C., Dahl, G., Hernández, J., Santos, J., & Galvão, K. (2019b). Association of dry matter intake and energy balance prepartum and postpartum with health disorders postpartum: Part II. Ketosis and clinical mastitis. Journal of Dairy Science, 102(10), 9151–9164. 10.3168/jds.2018-15879 [DOI] [PubMed] [Google Scholar]
Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya-gamma latent variables. Journal of the American Statistical Association, 108(504), 1339–1349. 10.1080/01621459.2013.829001 [DOI] [Google Scholar]
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. [Google Scholar]
Riekerink, R. O., Barkema, H., & Stryhn, H. (2007). The effect of season on somatic cell count and the incidence of clinical mastitis. Journal of Dairy Science, 90(4), 1704–1715. 10.3168/jds.2006-567 [DOI] [PubMed] [Google Scholar]
Rupasov, V. I., Lebedev, M. A., Erlichman, J. S., Lee, S. L., Leiter, J. C., & Linderman, M. (2012). Time-dependent statistical and correlation properties of neural signals during handwriting. PLoS One, 7(9), e43945. 10.1371/journal.pone.0043945 [DOI] [PMC free article] [PubMed] [Google Scholar]
Silverman, B. W. (2018). Density estimation for statistics and data analysis. Routledge. [Google Scholar]
Stangaferro, M., Wijma, R., Caixeta, L., Al-Abri, M., & Giordano, J. (2016a). Use of rumination and activity monitoring for the identification of dairy cows with health disorders: Part I. Metabolic and digestive disorders. Journal of Dairy Science, 99(9), 7395–7410. 10.3168/jds.2016-10907 [DOI] [PubMed] [Google Scholar]
Stangaferro, M., Wijma, R., Caixeta, L., Al-Abri, M., & Giordano, J. (2016b). Use of rumination and activity monitoring for the identification of dairy cows with health disorders: Part III. Metritis. Journal of Dairy Science, 99(9), 7422–7433. 10.3168/jds.2016-11352 [DOI] [PubMed] [Google Scholar]
Stangaferro, M., Wijma, R., Caixeta, L., Al-Abri, M., & Giordano, J. (2016c). Use of rumination and activity monitoring for the identification of dairy cows with health disorders. Part II. Mastitis. Journal of Dairy Science, 99(9), 7411–7421. 10.3168/jds.2016-10908 [DOI] [PubMed] [Google Scholar]
Stewart, J. L., Towers, D. N., Coan, J. A., & Allen, J. J. (2011). The oft-neglected role of parietal EEG asymmetry and risk for major depressive disorder. Psychophysiology, 48(1), 82–95. 10.1111/psyp.2010.48.issue-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun, Y., & Genton, M. G. (2011). Functional boxplots. Journal of Computational and Graphical Statistics, 20(2), 316–334. 10.1198/jcgs.2011.09224 [DOI] [Google Scholar]
Titterington, D. M., Smith, A. F., & Makov, U. E. (1985). Statistical analysis of finite mixture distributions. Wiley. [Google Scholar]
Urton, G., Von Keyserlingk, M., & Weary, D. (2005). Feeding behavior identifies dairy cows at risk for metritis. Journal of Dairy Science, 88(8), 2843–2849. 10.3168/jds.S0022-0302(05)72965-9 [DOI] [PubMed] [Google Scholar]
Wager, T. D., & Atlas, L. Y. (2015). The neuroscience of placebo effects: Connecting context, learning and health. Nature Reviews Neuroscience, 16(7), 403–418. 10.1038/nrn3976 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wahba, G (1982). Constrained regularization for ill posed linear operator equations, with applications in meteorology and medicine. In Statistical decision theory and related topics III (pp. 383–418). Academic Press. [Google Scholar]
Walsh, B. T., Seidman, S. N., Sysko, R., & Gould, M. (2002). Placebo response in studies of major depression: Variable, substantial, and growing. The Journal of the American Medical Association, 287(14), 1840–1847. 10.1001/jama.287.14.1840 [DOI] [PubMed] [Google Scholar]
Watson, A., El-Deredy, W., Vogt, B. A., & Jones, A. K. (2007). Placebo analgesia is not due to compliance or habituation: EEG and behavioural evidence. Neuroreport, 18(8), 771–775. 10.1097/WNR.0b013e3280c1e2a8 [DOI] [PubMed] [Google Scholar]
Wegman, E. J., & Wright, I. W. (1983). Splines in statistics. Journal of the American Statistical Association, 78(382), 351–365. 10.1080/01621459.1983.10477977 [DOI] [PubMed] [Google Scholar]
Zhang, W., & Luo, J. (2009). The transferable placebo effect from pain to emotion: Changes in behavior and EEG activity. Psychophysiology, 46(3), 626–634. 10.1111/psyp.2009.46.issue-3 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

qlae006_Supplementary_Data

qlae006_supplementary_data.pdf^{(327.6KB, pdf)}

Data Availability Statement

The code and data files used in the real data applications and simulation studies are available at https://github.com/nancyg-unicamp/Unsupervised.

[qlae006-B1] Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. 10.1080/01621459.1993.10476321 [DOI] [Google Scholar]

[qlae006-B2] Bayarri, M., & Berger, J. O. (2000). P-values for composite null models. Journal of the American Statistical Association, 95(452), 1127–1142. 10.2307/2669749 [DOI] [Google Scholar]

[qlae006-B3] Benaglia, T., Chauveau, D., Hunter, D. R., & Young, D. S. (2009, October). mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software, 32(6), 1–29. 10.18637/jss.v032.i06 [DOI] [Google Scholar]

[qlae006-B110] Bishop, C. (2006). Pattern recognition and machine learning. Springer google schola, 2, 531–537. [Google Scholar]

[qlae006-B4] Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877. 10.1080/01621459.2017.1285773 [DOI] [Google Scholar]

[qlae006-B5] Breiman, L. (2017). Classification and regression trees. Routledge. [Google Scholar]

[qlae006-B6] Bruder, G. E., Tenke, C. E., Stewart, J. W., Towey, J. P., Leite, P., Voglmaier, M., & Quitkin, F. M. (1995). Brain event-related potentials to complex tones in depressed patients: Relations to perceptual asymmetry and clinical features. Psychophysiology, 32(4), 373–381. 10.1111/psyp.1995.32.issue-4 [DOI] [PubMed] [Google Scholar]

[qlae006-B7] Cardot, H., Ferraty, F., & Sarda, P. (1999). Functional linear model. Statistics & Probability Letters, 45(1), 11–22. 10.1016/S0167-7152(99)00036-X [DOI] [Google Scholar]

[qlae006-B8] Ciarleglio, A., Petkova, E., Ogden, T., & Tarpey, T. (2018). Constructing treatment decision rules based on scalar and functional predictors when moderators of treatment effect are unknown. Journal of the Royal Statistical Society Series C: Applied Statistics, 67(5), 1331–1356. 10.1111/rssc.12278 [DOI] [PMC free article] [PubMed] [Google Scholar]

[qlae006-B9] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22. 10.1111/j.2517-6161.1977.tb01600.x [DOI] [Google Scholar]

[qlae006-B10] Diebolt, J., & Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society: Series B (Methodological), 56(2), 363–375. 10.1111/j.2517-6161.1994.tb01985.x [DOI] [Google Scholar]

[qlae006-B222] Eilers, P. H., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–121. 10.1214/ss/1038425655 [DOI] [Google Scholar]

[qlae006-B11] Ferraty, F., & Vieu, P. (2009). Additive prediction and boosting for functional data. Computational Statistics & Data Analysis, 53(4), 1400–1413. 10.1016/j.csda.2008.11.023 [DOI] [Google Scholar]

[qlae006-B12] Frühwirth-Schnatter, S., & Frühwirth, R. (2010). Data augmentation and MCMC for binary and multinomial logit models. In T. Kneib & G. Tutz (Eds.), Statistical modelling and regression structures (pp. 111–132). Springer.

[qlae006-B13] Gelman, A. (2013). Two simple examples for understanding posterior p-values whose distributions are far from uniform. Electronic Journal of Statistics, 7(1), 2595–2602. 10.1214/13-EJS854 [DOI] [Google Scholar]

[qlae006-B14] Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4), 1360–1383. 10.1214/08-AOAS191 [DOI] [Google Scholar]

[qlae006-B15] Gelman, A., & Meng, X.-L. (1996). Model checking and model improvement. In W. R. Gilks, S. Richardson, & D. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 189–201). Springer.

[qlae006-B16] Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472. 10.1214/ss/1177011136 [DOI] [Google Scholar]

[qlae006-B17] Gramacy, R. B., & Polson, N. G. (2012). Simulation-based regularized logistic regression. Bayesian Analysis, 7(3), 567–590. 10.1214/12-BA719 [DOI] [Google Scholar]

[qlae006-B18] Gueorguieva, R., Mallinckrodt, C., & Krystal, J. H. (2011). Trajectories of depression severity in clinical trials of duloxetine: Insights into antidepressant and placebo responses. Archives of General Psychiatry, 68(12), 1227–1237. 10.1001/archgenpsychiatry.2011.132 [DOI] [PMC free article] [PubMed] [Google Scholar]

[qlae006-B333] Hans, C. (2009). Bayesian lasso regression. Biometrika, 96(4), 835–845. 10.1093/biomet/asp047 [DOI] [Google Scholar]

[qlae006-B19] Held, L., & Holmes, C. C. (2006). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1), 145–168. 10.1214/06-BA105 [DOI] [Google Scholar]

[qlae006-B20] Heller, W., & Nitscke, J. B. (1997). Regional brain activity in emotion: A framework for understanding cognition in depression. Cognition & Emotion, 11(5–6), 637–661. 10.1080/026999397379845a [DOI] [Google Scholar]

[qlae006-B21] Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(5), 1303–1347. https://dl.acm.org/doi/10.5555/2567709.2502622 [Google Scholar]

[qlae006-B22] Huzzey, J., Veira, D., Weary, D., & von Keyserlingk, M. (2007). Prepartum behavior and dry matter intake identify dairy cows at risk for metritis. Journal of Dairy Science, 90(7), 3220–3233. 10.3168/jds.2006-807 [DOI] [PubMed] [Google Scholar]

[qlae006-B23] James, G. M. (2002). Generalized linear models with functional predictors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3), 411–432. 10.1111/1467-9868.00342 [DOI] [Google Scholar]

[qlae006-B24] Jiang, B., Petkova, E., Tarpey, T., & Ogden, R. T. (2017). Latent class modeling using matrix covariates with application to identifying early placebo responders based on EEG signals. The Annals of Applied Statistics, 11(3), 1513–1536. 10.1214/17-AOAS1044 [DOI] [PMC free article] [PubMed] [Google Scholar]

[qlae006-B25] Kim, J. S., Staicu, A.-M., Maity, A., Carroll, R. J., & Ruppert, D. (2018). Additive function-on-function regression. Journal of Computational and Graphical Statistics, 27(1), 234–244. 10.1080/10618600.2017.1356730 [DOI] [PMC free article] [PubMed] [Google Scholar]

[qlae006-B26] Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500. 10.1137/07070111X [DOI] [Google Scholar]

[qlae006-B27] Leuchter, A. F., Cook, I. A., Witte, E. A., Morgan, M., & Abrams, M. (2002). Changes in brain function of depressed subjects during treatment with placebo. American Journal of Psychiatry, 159(1), 122–129. 10.1176/appi.ajp.159.1.122 [DOI] [PubMed] [Google Scholar]

[qlae006-B28] López-Pintado, S., & Romo, J. (2009). On the concept of depth for functional data. Journal of the American Statistical Association, 104(486), 718–734. 10.1198/jasa.2009.0108 [DOI] [Google Scholar]

[qlae006-B29] McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman & Hall. [Google Scholar]

[qlae006-B30] McLachlan, G. J., & Peel, D. (2004). Finite mixture models. John Wiley & Sons. [Google Scholar]

[qlae006-B31] McLean, M. W., Hooker, G., Staicu, A.-M., Scheipl, F., & Ruppert, D. (2014). Functional generalized additive models. Journal of Computational and Graphical Statistics, 23(1), 249–269. 10.1080/10618600.2012.729985 [DOI] [PMC free article] [PubMed] [Google Scholar]

[qlae006-B32] Nunez, P. L., & Srinivasan, R. (2006). Electric fields of the brain: The neurophysics of EEG. Oxford University Press. [Google Scholar]

[qlae006-B33] Ormerod, J. T., & Wand, M. P. (2012). Gaussian variational approximate inference for generalized linear mixed models. Journal of Computational and Graphical Statistics, 21(1), 2–17. 10.1198/jcgs.2011.09118 [DOI] [Google Scholar]

[qlae006-B34] Parisi, G. (1988). Statistical field theory. Addison-Wesley. [Google Scholar]

[qlae006-B444] Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association, 103(482), 681–686. 10.1198/016214508000000337 [DOI] [Google Scholar]

[qlae006-B35] Park, S. Y., Li, C., Mendoza Benavides, S. M., van Heugten, E., & Staicu, A.-M. (2019). Conditional analysis for mixed covariates, with application to feed intake of lactating sows. Journal of Probability and Statistics, 2(1), 1–14. 10.1155/2019/3743762 [DOI] [Google Scholar]

[qlae006-B36] Pérez-Báez, J., Risco, C., Chebel, R., Gomes, G., Greco, L., Tao, S., Thompson, I., do Amaral, B., Zenobi, M., Martinez, N., Staples, C., Dahl, G., Hernández, J., Santos, J., & Galvão, K. (2019a). Association of dry matter intake and energy balance prepartum and postpartum with health disorders postpartum: Part I. Calving disorders and metritis. Journal of Dairy Science, 102(10), 9138–9150. 10.3168/jds.2018-15878 [DOI] [PubMed] [Google Scholar]

[qlae006-B37] Pérez-Báez, J., Risco, C., Chebel, R., Gomes, G., Greco, L., Tao, S., Thompson, I., do Amaral, B., Zenobi, M., Martinez, N., Staples, C., Dahl, G., Hernández, J., Santos, J., & Galvão, K. (2019b). Association of dry matter intake and energy balance prepartum and postpartum with health disorders postpartum: Part II. Ketosis and clinical mastitis. Journal of Dairy Science, 102(10), 9151–9164. 10.3168/jds.2018-15879 [DOI] [PubMed] [Google Scholar]

[qlae006-B38] Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya-gamma latent variables. Journal of the American Statistical Association, 108(504), 1339–1349. 10.1080/01621459.2013.829001 [DOI] [Google Scholar]

[qlae006-B39] R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. [Google Scholar]

[qlae006-B40] Riekerink, R. O., Barkema, H., & Stryhn, H. (2007). The effect of season on somatic cell count and the incidence of clinical mastitis. Journal of Dairy Science, 90(4), 1704–1715. 10.3168/jds.2006-567 [DOI] [PubMed] [Google Scholar]

[qlae006-B41] Rupasov, V. I., Lebedev, M. A., Erlichman, J. S., Lee, S. L., Leiter, J. C., & Linderman, M. (2012). Time-dependent statistical and correlation properties of neural signals during handwriting. PLoS One, 7(9), e43945. 10.1371/journal.pone.0043945 [DOI] [PMC free article] [PubMed] [Google Scholar]

[qlae006-B42] Silverman, B. W. (2018). Density estimation for statistics and data analysis. Routledge. [Google Scholar]

[qlae006-B43] Stangaferro, M., Wijma, R., Caixeta, L., Al-Abri, M., & Giordano, J. (2016a). Use of rumination and activity monitoring for the identification of dairy cows with health disorders: Part I. Metabolic and digestive disorders. Journal of Dairy Science, 99(9), 7395–7410. 10.3168/jds.2016-10907 [DOI] [PubMed] [Google Scholar]

[qlae006-B44] Stangaferro, M., Wijma, R., Caixeta, L., Al-Abri, M., & Giordano, J. (2016b). Use of rumination and activity monitoring for the identification of dairy cows with health disorders: Part III. Metritis. Journal of Dairy Science, 99(9), 7422–7433. 10.3168/jds.2016-11352 [DOI] [PubMed] [Google Scholar]

[qlae006-B45] Stangaferro, M., Wijma, R., Caixeta, L., Al-Abri, M., & Giordano, J. (2016c). Use of rumination and activity monitoring for the identification of dairy cows with health disorders. Part II. Mastitis. Journal of Dairy Science, 99(9), 7411–7421. 10.3168/jds.2016-10908 [DOI] [PubMed] [Google Scholar]

[qlae006-B46] Stewart, J. L., Towers, D. N., Coan, J. A., & Allen, J. J. (2011). The oft-neglected role of parietal EEG asymmetry and risk for major depressive disorder. Psychophysiology, 48(1), 82–95. 10.1111/psyp.2010.48.issue-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[qlae006-B47] Sun, Y., & Genton, M. G. (2011). Functional boxplots. Journal of Computational and Graphical Statistics, 20(2), 316–334. 10.1198/jcgs.2011.09224 [DOI] [Google Scholar]

[qlae006-B48] Titterington, D. M., Smith, A. F., & Makov, U. E. (1985). Statistical analysis of finite mixture distributions. Wiley. [Google Scholar]

[qlae006-B49] Urton, G., Von Keyserlingk, M., & Weary, D. (2005). Feeding behavior identifies dairy cows at risk for metritis. Journal of Dairy Science, 88(8), 2843–2849. 10.3168/jds.S0022-0302(05)72965-9 [DOI] [PubMed] [Google Scholar]

[qlae006-B50] Wager, T. D., & Atlas, L. Y. (2015). The neuroscience of placebo effects: Connecting context, learning and health. Nature Reviews Neuroscience, 16(7), 403–418. 10.1038/nrn3976 [DOI] [PMC free article] [PubMed] [Google Scholar]

[qlae006-B51] Wahba, G (1982). Constrained regularization for ill posed linear operator equations, with applications in meteorology and medicine. In Statistical decision theory and related topics III (pp. 383–418). Academic Press. [Google Scholar]

[qlae006-B52] Walsh, B. T., Seidman, S. N., Sysko, R., & Gould, M. (2002). Placebo response in studies of major depression: Variable, substantial, and growing. The Journal of the American Medical Association, 287(14), 1840–1847. 10.1001/jama.287.14.1840 [DOI] [PubMed] [Google Scholar]

[qlae006-B53] Watson, A., El-Deredy, W., Vogt, B. A., & Jones, A. K. (2007). Placebo analgesia is not due to compliance or habituation: EEG and behavioural evidence. Neuroreport, 18(8), 771–775. 10.1097/WNR.0b013e3280c1e2a8 [DOI] [PubMed] [Google Scholar]

[qlae006-B55] Wegman, E. J., & Wright, I. W. (1983). Splines in statistics. Journal of the American Statistical Association, 78(382), 351–365. 10.1080/01621459.1983.10477977 [DOI] [PubMed] [Google Scholar]

[qlae006-B54] Zhang, W., & Luo, J. (2009). The transferable placebo effect from pain to emotion: Changes in behavior and EEG activity. Psychophysiology, 46(3), 626–634. 10.1111/psyp.2009.46.issue-3 [DOI] [PubMed] [Google Scholar]

PERMALINK

Unsupervised Bayesian classification for models with scalar and functional covariates

Nancy L Garcia

Mariana Rodrigues-Motta

Helio S Migon

Eva Petkova

Thaddeus Tarpey

R Todd Ogden

Julio O Giordano

Martin M Perez

Abstract

1. Introduction

1.1. Data set 1: identification of early responders using electroencephalography data

Figure 1.

Figure 3.

1.2. Data set 2: predicting illness for milking cows based on functional covariates measured 30 days before lactation

Figure 2.

Table 1.

2. General model

2.1. Functional linear model

2.2. Functional nonlinear model

3. A Bayesian approach to the mixture model regression with functional covariates

3.1. Hierarchical structure specification and prior specification

3.2. Posterior computation of parameters

3.2.1. Full conditional posterior distribution of βl

3.3. Variational inference

4. Identification of early responders using EEG data

4.1. Full conditional posterior distributions

4.1.1. Full conditional posterior distribution of γi

4.1.2. Full conditional posterior distribution of μ0 and μ1

4.1.3. Full conditional posterior distribution of σ2

4.1.4. Full conditional posterior distribution of βl

4.2. VB for the normal mixed model

4.2.1. Variational density q*(γi)

4.2.2. Variational densities q*(μ0) and q*(μ1)

4.2.3. Variational density q*(σ2)

4.2.4. Variational density q*(β)

4.3. Results

Table 2.

Table 3.

Figure 4.

Table 4.

Figure 5.

5. Predicting illness for milking cows based on functional covariates measured 30 days before lactation

5.1. Full conditional posterior distributions

5.1.1. Full conditional posterior distribution of γi

5.1.2. Full conditional posterior distribution of λ1 and λ2

5.1.3. Full conditional posterior distribution of βl=(θl⊤,ϕ1l⊤,…,ϕjl⊤)⊤,l=1,2

5.2. VB of the ZIMP model

5.2.1. Variational density q*(γi)

5.2.2. Variational density q*(λ1) and q*(λ2)

5.3. Results

Table 5.

Table 6.

Figure 6.

Figure 7.

Figure 8.

6. Discussion

Supplementary Material

Acknowledgements

Contributor Information

Funding

Data availability

Supplementary material

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.2.1. Full conditional posterior distribution of $β_{l}$

4.1.1. Full conditional posterior distribution of $γ_{i}$

4.1.2. Full conditional posterior distribution of $μ_{0}$ and $μ_{1}$

4.1.3. Full conditional posterior distribution of $σ^{2}$

4.1.4. Full conditional posterior distribution of $β_{l}$

4.2.1. Variational density $q^{*} (γ_{i})$

4.2.2. Variational densities $q^{} (μ_{0})$ and $q^{} (μ_{1})$

4.2.3. Variational density $q^{*} (σ^{2})$

4.2.4. Variational density $q^{*} (β)$

5.1.1. Full conditional posterior distribution of $γ_{i}$

5.1.2. Full conditional posterior distribution of $λ_{1}$ and $λ_{2}$

5.1.3. Full conditional posterior distribution of $β_{l} = (θ_{l}^{⊤}, ϕ_{1 l}^{⊤}, \dots, ϕ_{j l}^{⊤})^{⊤}, l = 1, 2$

5.2.1. Variational density $q^{*} (γ_{i})$

5.2.2. Variational density $q^{} (λ_{1})$ and $q^{} (λ_{2})$