Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2024 Feb 7;73(3):658–681. doi: 10.1093/jrsssc/qlae006

Unsupervised Bayesian classification for models with scalar and functional covariates

Nancy L Garcia 1,✉,2, Mariana Rodrigues-Motta 2, Helio S Migon 3, Eva Petkova 4,5, Thaddeus Tarpey 6, R Todd Ogden 7, Julio O Giordano 8, Martin M Perez 9
PMCID: PMC11271982  PMID: 39072300

Abstract

We consider unsupervised classification by means of a latent multinomial variable which categorizes a scalar response into one of the L components of a mixture model which incorporates scalar and functional covariates. This process can be thought as a hierarchical model with the first level modelling a scalar response according to a mixture of parametric distributions and the second level modelling the mixture probabilities by means of a generalized linear model with functional and scalar covariates. The traditional approach of treating functional covariates as vectors not only suffers from the curse of dimensionality, since functional covariates can be measured at very small intervals leading to a highly parametrized model, but also does not take into account the nature of the data. We use basis expansions to reduce the dimensionality and a Bayesian approach for estimating the parameters while providing predictions of the latent classification vector. The method is motivated by two data examples that are not easily handled by existing methods. The first example concerns identifying placebo responders on a clinical trial (normal mixture model) and the other predicting illness for milking cows (zero-inflated mixture of the Poisson model).

Keywords: functional covariates, latent vector, unsupervised clustering, variable selection, variational inference

1. Introduction

Mixture models are popular statistical tools for classification purposes in a broad range of applied fields. Although the Gaussian mixture model is by far the most used approach for model-based cluster analysis, classification problems based on mixture models often require non-Gaussian mixture distributions. Many times we have extra information on explanatory variables and many modern applications routinely have more complex covariates in the form of vectors, matrices, functions, or images. Even though the prevailing approaches in these cases use either a parametric or a nonparametric approach and model the mean of the distribution as a function of the covariates (Cardot et al., 1999; Ferraty & Vieu, 2009; James, 2002; McLean et al., 2014, among others), there are other applications where the interest is to study the effect of the covariates on the entire distribution of the response, for example, quantile regression (e.g. Park et al., 2019, and references therein). On the other hand, some studies show no interest in modelling the mean of the distribution and instead they want to carry out classification purposes, as for example studies related to regression trees, as proposed by Breiman (2017). In this case, modelling the probability of classification through covariates is more important than the mean.

Our approach classifies an outcome as accurately as possible using only the information on the scalar and functional covariates as explanatory variables on the modelling of mixture probabilities, providing practitioners with direct interpretable cluster classification, that is, having the interpretation of the coefficients of the regression equation directly on the probability of belonging to a class. A latent variable is used to model the mixture components and divides the sample of observations into subgroups according to some similarity measure. Many authors have studied the classification problem, see, for example, Titterington et al. (1985) and McLachlan and Peel (2004) and references therein, and there are R packages (R Core Team, 2021) that perform inference for mixture models such as mixtools (Benaglia et al., 2009) but these tools cannot implement one or more functional covariates. When dealing with functional covariates, it is necessary to reduce the dimensionality of the data. One way of dealing with it is to consider a variation of principal components analysis built upon a low rank Candecomp/Parafac (CP) decomposition, a popular dimensionality-reduction method for multiway data (Kolda & Bader, 2009). In particular, a CP decomposition imposes a special low rank structure on the target regression coefficient matrix that explicitly captures the bilinear row and column effects of the matrix covariate (Jiang et al., 2017). Although this method is powerful, it requires all functional covariates to be observed at the same points across observations. Also, viewing the functional covariates as rows of a matrix and applying functional data analysis type procedures do not take advantage of the intrinsic ordering of observations (e.g. in time or space) given by the functional structure of the covariate.

We propose a more flexible classification procedure that can incorporate functional covariates on the regression of the mixture probability. Although the dimension reduction of the problem shares some characteristics of the approach of Ciarleglio et al. (2018), we focus on the regression of the mixture probability instead of the mean. We consider functional covariates as acting through linear or nonlinear effects, taking advantage of their functional nature. Additionally, the method does not restrict the functions to be observed at common points neither to have the same support, as required in Jiang et al. (2017). We employ Bayesian methods for parameter estimation by considering the classical Monte Carlo Markov chain (MCMC) approach as well as variational Bayes (VB) inference, which is an alternative method to the MCMC algorithm. This method facilitates approximation of the posterior distribution in complex models using massive data sets (Blei et al., 2017; Hoffman et al., 2013; Ormerod & Wand, 2012).

We now present two data examples.

1.1. Data set 1: identification of early responders using electroencephalography data

Placebo responders are those patients whose response is termed ‘non-specific’, e.g. in a drug trial, an improvement in symptoms that is not due to the effect of the active chemicals in the drug. There is an intense debate about how to identify placebo responders in clinical trials of medications, in particular, for major depressive disorder (MDD; Walsh et al., 2002) since there could be placebo responders among either the control or the treatment group. Furthermore, it is known that there is a high rate of placebo responders among patients in MDD treatment trials and in some experiments with selective serotonin reuptake inhibitors it was found that some patients can have a better response using placebo (Gueorguieva et al., 2011). Identifying such patients using covariates would be an important tool in clinical research. Scalar covariates such as sex and disease severity are typically included in the modelling. On the other hand, there are several studies relating differences in neural processing between placebo and active treatments (see, for example, Ciarleglio et al., 2018; Leuchter et al., 2002; Wager & Atlas, 2015; Watson et al., 2007; Zhang & Luo, 2009, and references therein).

Jiang et al. (2017) analysed data from a randomized placebo controlled depression clinical trial of sertraline in order to identify early responders to treatment (which is indicative of a placebo response since it is believed that response to the active treatment is not immediate). The data set consists of 96 MDD patients, randomized to either a drug or placebo treatment. For each subject, several scalar and categorical covariates are available, as well as their resting state electroencephalography (EEG) under a closed eyes condition. These EEG data contain the current source density amplitude spectrum values (V/m2) (Nunez & Srinivasan, 2006) at a total of 14 electrodes (P9, P10, P7, P8, P5, P6, PO7, PO8, PO3, PO4, O1, O2, POZ, and OZ) located in occipital and parietal brain regions. Each electrode is measured at 45 frequencies at a 0.25 Hz resolution within the theta (4–7 Hz) and alpha (7–15 Hz) frequency bands. The response variable for each subject is the Hamilton Depression Rating Scale (HAM-D), measured before the treatment (baseline) and after one week into the study. It is believed that the active drug treatment can only have an effect on symptoms after two weeks. Therefore, any improvement observed after one week is likely to be due to placebo effect (or spontaneous improvement). For more details, we refer to Jiang et al. (2017) and references therein. For EEG location maps, see Figure 7 in Rupasov et al. (2012).

In order to have a baseline to compare our results, we used the hierarchical model proposed by Jiang et al. (2017) with the same scalar covariates, sex and chronicity, and functional covariates given by data taken from 14 EEG electrodes. The unobserved binary subgroup indicators are modelled via a hierarchical probit model as a function of the baseline EEG measurements and other scalar covariates of interest. In their work, instead of the regression component used in our model (cf. (17)), they propose to use the EEG data in the form of a (14×45) matrix-valued covariate. Moreover, they assume a low-dimensional structure through CP decomposition (Kolda & Bader, 2009), reducing the matrix dimension to 9×4. One disadvantage of this approach is that it does not take advantage of the functional nature of the data and it lacks direct interpretability for the estimated parameters. Also, the analysis uses a tensor product and so the results will likely depend on how the electrodes are ordered in the matrix.

The pattern of EEG data varies from one subject to the other. Figure 1 shows the EEG plots for nine subjects. There may exist correlations among EEG signals. However, our research question at this point is more concerned with finding solutions to solve computational complexity, and thus we focus on comparing the MCMC and VB methods while improving our Bayesian approach by considering the method proposed by Gelman et al. (2008) by incorporating a weakly informative default prior distribution. Additionally, by simplifying our analysis to treat the EEG curves as independent, we gain more straightforward insights into individual curve characteristics and trends. In our forthcoming research, we intend to address correlation among curves by introducing a shared random curve for each EEG signal recorded from the same subject. Nonetheless, this augmentation will greatly increase the complexity of the model and it is beyond the scope of the current study.

Figure 1.

Figure 1.

EEG patterns for nine subjects. EEG = electroencephalography.

The left panel of Figure 3 shows histograms of the change in HAM-D (baseline—week 1) showing the amount of improvement in depression symptoms after 1 week; a positive change indicates improvement in symptoms. Notice that there is a strong indication of a mixture of two distributions. Also, the choice of two classes are natural in this context since the practioners want to identify early responders. To explore the data set, we fit a parametric model using the expectation-maximization (EM) algorithm for a mixture of Gaussian distributions (different means and different variances) with no covariates, and the two fitted Gaussian curves are shown in the left panel of Figure 3, with more than 40% of the subjects classified as early responders (rightmost curve of left panel). Notice that the EM fit has low power to discriminate the subjects into two classes.

Figure 3.

Figure 3.

Histogram of the change in HAM-D (baseline—week 1) showing the amount of improvement in depression symptoms after 1 week for both drug- and placebo-treated patients. The left panel presents the normal mixture with two components given by the EM algorithm without covariates. The right panel shows the fit using mixture of two normal distributions 0.84×N(0.82,17.08)+0.16×N(10.72,17.08) chosen by the linear model with normal prior and logit link. Gaussian curves are the mixture component densities. HAM-D = Hamilton Depression Rating Scale; EM = expectation-maximization.

1.2. Data set 2: predicting illness for milking cows based on functional covariates measured 30 days before lactation

Understanding factors that affect productivity of dairy cattle is essential for optimizing profitability and sustainability of dairy farms. The transition from late pregnancy to early lactation in cows is one of the most important stages of the lactation cycle of these animals as cows are at greatest risks of experiencing health disorders during this period. The occurrence of health disorders affects the productivity of cows, and thus is a determining factor of the profitability of dairy herds.

A potential strategy to identify cows with health disorders in early lactation for treatments and other interventions is through the use of automated monitoring of cow behavioural, physiological, and performance parameters with automated health monitoring systems based on sensor data. This is because it has been demonstrated that multiple sensor parameters, such as, for example, rumination time, physical activity, resting time, body temperature, milk volume, and component yield, are useful for monitoring cow health as they are dramatically altered during episodes of health disorders (Stangaferro et al., 2016a, 2016b, 2016c) and thus can be used to predict the health status of cows. Moreover, data from these sensors systems can also be combined with non-sensor data to increase the accuracy of alerts used to identify cows with health disorders. Our data set for this application consists of data collected in order to train, validate, and test machine-learning algorithms models created through a combination of sensor data from Automated Health Monitoring System and non-sensor data available at commercial dairy farms. This project funded by the United States Department of Agriculture-National Institute of Food and Agriculture was conducted by the Dairy Cattle Biology and Management Laboratory at Cornell University. We have data on daily clinical examination of cows during the first 40 days after calving (days in milk) from 258 cows. Information on whether a cow had a health disorder (1) or not (0) was collected by the research team on a daily basis. The data consist of 41 rows (0–40) per cow with sensor and non-sensor data from the previous lactation (i.e. before calving) and after calving. We focused on four sensor parameters related to physical activity for the 30 days prior to calving. The selected parameters were: activity (total number of steps in a given day divided by 24), number of resting bouts, average rest time, and total rest time per day, see Figure 2. Moreover, in this analysis, in addition to the four functional covariates as described above, we considered four scalar variables related to previous lactation period known to be associated with health outcomes after calving: Z1= age of first calving, Z2= previous lactation days in milk, Z3= previous lactation health event (0 or 1), and Z4= previous lactation number of previous health events.

Figure 2.

Figure 2.

Functional covariates for 10 cows.

The objective of this application example was to model the number of days a cow was sick in the first 40 days after calving. Our main goal was to classify cows into three classes of status by using the scalar and functional covariates related to activity and resting times. There was high variability in the data set as seen in Table 1. During the observation period, 175 cows were not diagnosed with a health disorder whereas 83 cows with at least one health disorder event with a mean value of 1.60 sick days and a variance of 10.30. If we just consider the values not equal to zero, we get a mean value of 4.96 sick days and variance 15.95. As previously noted, we have 68% of zeros in the sample (175/258) and probably a zero-inflated distribution to accommodate overdispersion caused by zeros is appropriate. On the other hand, if we consider only the non-zero observations, we still have a large variance as compared with the mean. Therefore, we fitted a zero-inflated mixture of two Poisson distributions to accomodate the large number of zeros as well as overdispersion beyond that.

Table 1.

Frequency of sick days for lactating cows (training + testing set)

Sick days 0 1 2 3 4 5 6 7 8 9 10 11 12 13 18 Total
Freq 175 22 10 7 7 4 5 4 7 6 2 3 3 2 1 258

In this case, we fitted the model to 208 cows randomly selected as a training set and used the remaining 50 cows to evaluate the prediction using the model.

The remainder of this paper is organized as follows. Section 2 introduces the hierarchical mixture model with latent variable and regression of mixture probabilities as function of functional covariates and a Bayesian approach is presented in Section 3. Sections 4 and 5 present the results for the two data examples. An extensive simulation study is given in online supplementary material, with the primary goal of examining the performance of the normal mixture model and the zero-inflated mixture Poisson (ZIMP) model by considering aspects of sample size and ability of the functional curves to predict the latent variables correctly. The secondary goal of the simulation study is to compare the estimation ability of the MCMC and VB methods, as well their performance with respect to computational time-consuming.

2. General model

For each subject i=1,,n, we observe: yi the scalar response, zi a vector of scalar covariates, and Xi={(tijs,Xij(tijs)),1sSij,1jJ}, J functional covariates observed at discrete points tijs,1sSij (which do need to be the same along the subjects), in closed domains τ1,,τJ. In general, these sets are closed intervals on the real line, and although we interpret tijs as time in this study, our proposed model works for functional covariates observed at points in space or time-space. Notice that neither the domains in which we observe the functional covariates nor the observation points need to be the same for all covariates. We model the distribution of yi hierarchically by means of a latent class model, postulating a mixture distribution for the observed response to classify subjects into L classes. We will assume an L-mixture latent class model with unobserved multinomial random variables indicating class membership {γi=(γi1,,γiL),i=1,,n} where l=1Lγil=1 and pil=P(γil=1). Let y=(y1,,yn) be the vector of responses with dimension n×1 and X={Xij,i=1,,n,j=1,,J} a vector with components Xij of dimension 1×nij, where nij is the number of observed points of curve j of subject i. Additionally, let fl(yi;λl), i=1,,n, be the pdf of the scalar response yi in class l, such as λl is the vector of parameters in that class. The probability of the random variable γil is modelled as a function of the scalar and functional covariates by

g(pil)=ziθl+j=1JτjFjl(Xij(t),t,ϕjl)dt (1)

where g is a known link function (e.g. probit or logit), zi is a vector of dimension p×1 representing scalar covariates, and θl is a vector of parameters of dimension p×1 that captures the linear additive effect of those covariates in class l. Finally, Fjl(,) is a bivariate smooth function related to the jth functional covariate in component l which depends on the vector of parameters ϕjl . Notice that Fjl can be very general and it includes as a special case, Fjl(Xij(t),t)=w(t)Xij(t) (general regression model). This is termed the functional generalized additive model by McLean et al. (2014). Then the complete-data likelihood can be written as

f(y,γλ1,,λL,θ,ϕ,z,X)=i=1nl=1L[fl(yi;λl)pil]γil (2)

where the relationship between pil with θ,ϕ,z1,,zn and X is given by (1). Here, θ=(θ1,,θL) and ϕ={ϕjl,j=1,,J,l=1,,L} is a vector with components ϕjl of length pjl, where pjl is the number of parameters in ϕjl.

In general, functional covariates are considered to model the mean or any other parameter in λl that indexes fl(;) but not the mixing probabilities pil. However, in our applications, the interest lies in using functional covariates solely to classify subjects into L classes.

2.1. Functional linear model

Model (1) can be restricted to be linear by specifying Fjl(x(t),t)=wjl(t)x(t) , yielding the more common generalized linear functional model

g(pil)=ziθl+j=1Jτjwjl(t)Xij(t)dt,l=1,,L. (3)

To fit model (3), we consider each weight function wjl(.) as a smooth function belonging to the finite-dimensional space spanned by B-splines basis functions. This is not the only possibility, as other bases could be chosen such as Fourier expansion, wavelets, natural splines, etc. (Silverman, 2018). Also, we are going to choose the number of knots and knots placement in an ad hoc manner. Expressing the weight function as a linear combination of basis functions transforms the challenge of selecting wj from an infinite-dimensional class of functions into a more manageable finite class. Then, two critical issues must be addressed: determining the number of basis functions and optimizing knot placement. The impact of a small number of knots is known to cause oversmoothing, while a large number can yield overly rough estimates. It is noteworthy that, after establishing the number of basis functions, poorly placed knots can lead to undesired multimodality or a failure to capture distinct features. However, it is important to clarify that addressing knot placement is beyond the scope of this work and will not be discussed here. Interested readers can refer to Wahba (1982) and Eilers and Marx (1996) for in-depth information. To address both challenges simultaneously, penalized methods such as the Bayesian Lasso recommended by Park and Casella (2008) and Hans (2009) can be employed. However, they are very expensive to implement along with MCMC methods. In our study, we leverage commonly used procedures in nonparametric regression, substantially reducing computational costs. In the applications, the number of knots were based on visual inspection. In the early responder example, we adopt a heuristic approach to distribute knots based on the suggestion by Wegman and Wright (1983). Knots are strategically placed at the endpoints (frequencies 16 and 60) and around inflection points, specifically at the more frequent peaks with frequencies 18, 20, 22, 25, 30, 35, 40, 45, 50, and 55. Conversely, in the investigation of cow illnesses, we choose to space knots equal distantly.

For a positive integer Kj and a vector of (Kj4) interior knots Υjτj, we express the weight function as

wjl(t)=k=1KjϕjlkBk(j)(t) (4)

where {B1(j),,BKj(j)} are cubic B-spline basis functions determined by Υj.

Substituting (4) into (3) yields

g(pil)=ziθl+j=1JRijϕjl (5)

where, for each pair (j,l), ϕjl=(ϕjl1,,ϕjlKj) is a vector of parameters of dimension Kj×1 and Rij=(Rij1,,RijKj) is Kj×1 vector, with Rijk=τjBk(j)(t)Xij(t)dt.

The linear case has the advantage of easy interpretability of the weight functions. If the weight wjl(t) is positive (negative) over the interval (ta,tb), this means that the higher the value of Xij(t) in this interval, the higher (lower) the probability of γil=1, considering all other explanatory variables fixed. In this work, we used functional boxplots to select the functional covariates. An interesting question to be addressed in a future work is to use the posterior sample and Bayes factor to derive a test of wjl(t)=0 over an interval.

2.2. Functional nonlinear model

For the more general model (1), we consider each function Fjl(,) to be a smooth surface generated by a family of tensor products of cubic B-splines (see, for example, Kim et al., 2018). That is, for each function Fjl, there exist positive integers K1j and K2j and vectors of (K1j4) interior knots Υ1jχj (the image of Xij) and (K2j4) interior knots Υ2jτj, such that

Fjl(s,t)=k1=1K1jk2=1K2jϕjlk1k2Bk1(Υ1j)(t)Bk2(Υ2j)(s) (6)

where {B1(Υ1j),,BK1j(Υ1j)} and {B1(Υ2j),,BK2j(Υ1j)} are B-spline basis determined by Υ1j and Υ2j, respectively.

Substituting (6) into (1) yields

g(pil)=ziθl+j=1JRijϕjl (7)

where Rij are vectors of dimension (K1K2)×1 with ϕjl(k1,k2)=ϕjlk1k2 and Rij(k1,k2)=χjτjBk1(X,j)(Xij(s))Bk2(T,j)(t)dtds properly stacked.

As we can see, from (5) and (7), using basis expansions for both the linear and nonlinear cases, we end up with the same linear structure in terms of parameters θ and ϕ. Notice that the functional covariates are used to compute the Rij and they are not expanded using basis functions.

3. A Bayesian approach to the mixture model regression with functional covariates

Model (2) specifies the distribution of the response yi depending on which mixture component subject i belongs to. The mixture component densities are parameterized by the vector λ=(λ1,,λL) whose components are row vectors related to the densities f1,,fL, respectively, considering category L as the baseline. Therefore, we will denote by Θ=(λ,β) the vector of unknown parameters in the mixture model, where β=(β1,,βL1) with components βl=(θl,ϕ1l,,ϕJl) is a column vector of parameters for the regression coefficients under component l with xi:=(zi,Ri1,,RiJ) being the row vector of covariates for subject i. Without loss of generality, we consider the same xi across the L mixture components in the development of the methodology. However, the methodology still applies when the covariates differ across the mixture components.

3.1. Hierarchical structure specification and prior specification

A formal Bayesian analysis of a mixture model usually leads to intractable calculations. Data augmentation is an efficient procedure for mixture models that leads to feasible computations using Gibbs sampling (Diebolt & Robert, 1994). The joint augmented posterior distribution is the product of (2) and the prior distributions, which has no closed form. There is no standard solution for the general case. However, in both applications, using suitable prior distributions, all the conditional distributions are available and the Gibbs sampling algorithm is suitable to sample from the posterior distribution of γ, λ, and β.

The nature of the application under study dictates the form of fl(yi;λl) which in turn provides knowledge about the nature of parameters in λl. There is a very rich family of distributions fl(yi;λl) that may characterize the mixture distribution of yi. Instead of focusing on a specific distribution for mixture components, we focus on a general solution for posterior sampling of the latent variables γ and parameters in Θ, which are developed with a general fl(yi;λl) without loss of generality. Therefore, prior distributions for λ are problem specific. The full conditional posterior distribution of βl cannot be computed explicitly except for when we are using the probit link function and thus we can apply the simple latent variable method of Albert and Chib (1993). Other methods for calculating the full conditional have been proposed using data-augmentation or multiple layers of latent variables, see for example Held and Holmes (2006), Frühwirth-Schnatter and Frühwirth (2010), Gramacy and Polson (2012), and Polson et al. (2013). In our approach, for each of the components of β, we will assume a Student-t prior distribution with mean 0, degrees-of-freedom parameter df, and scale ς, with df and ς providing minimal prior information to constrain the coefficients to lie in a reasonable range (see Section 2 of Gelman et al., 2008). An advantage of the t family is that fat-tailed distributions allow for flexible inference, since it includes both the normal (df=) and the Cauchy (df=1) distributions.

3.2. Posterior computation of parameters

We sample from the posterior distribution using a Gibbs sampling scheme, and most of the full conditional posterior distribution of the latent variables γ and parameters in Θ are given by standard methods. The only non-standard computation is the conditional posterior distributions for β.

3.2.1. Full conditional posterior distribution of βl

We follow Gelman et al. (2008), which incorporates a prior distribution into classical logistic regression computations. As in the classical iteratively weighted least-squares algorithm, at each iteration the algorithm determines the pseudo-data ψil given by

ψil=g(pil)+(γilpil)g(pil),i=1,,n (8)

and weights σψ,il2 satisfying the equation

[σψ,il2]1=[g(pil)]2Vil1 (9)

where Vil is the variance function. In the classical approach of McCullagh and Nelder (1989), at each iteration of the algorithm, equations (8) and (9) are evaluated at p^il, given in equation (7). Under Gelman et al. (2008) approach, ψil is approximated by a normal distribution with mean g(pil)=xiβ and variance σψ,il2. Considering a normal prior distribution for β, the full conditional of β conjugates to a normal distribution. On the other hand, if independent Student-t distributions are considered for each βld in βl, d=1,,D, being expressed as a mixture of a normal distribution with mean μld,β and unknown σld2 with an inverse scale chi-square distribution with scale parameter ςld and degrees of freedom νld, the approximate logarithm of the full conditional posterior density of βl and σl=(σl1,,σlD) is given by

logp(βl,σlψ)12i=1n1σψ,il2(ψilxiβl)212d=1D[1σld2(βldμld,β)2+log(σld2)+logp(σldsld,νld)]. (10)

Notice (10) has no closed form. Gelman et al. (2008) use an approximate EM algorithm (Dempster et al., 1977) and their algorithm proceeds taking the average over βl at each step, treating them as missing data and performing the EM algorithm to estimate σ. The algorithm starts at some value of σ and proceeds by alternating one step of iteratively weighted least squares to update β and one step of EM to update σ. Once enough iterations have been performed to reach approximate convergence, we get an sample of βl=(θl,ϕ1l,,ϕJl). This step is performed inside the Gibbs algorithm. In our case, we follow this scheme to obtain a sample from the full conditional of β using the bayesglm function implemented in R by Gelman et al. (2008). To use the function bayesglm, we specify the link function g as either the probit or logit function, and set the degrees of freedom νld appropriately to have normal (νld=) and the Cauchy (νld=1) as prior distributions.

3.3. Variational inference

To scale the problem described in the introduction of this paper for large samples and to include more functional covariates, we must resort to approximate posterior inference. VB inference is a machine-learning technique that facilitates approximation of the posterior distribution in complex models using massive data sets (Blei et al., 2017; Hoffman et al., 2013; Ormerod & Wand, 2012). VB inference provides the main alternative to the Markov chain Monte Carlo (MCMC) algorithm. It starts by introducing a variational family of distributions, indexed by some variational parameters κ and a criterion function. Then it searches for the member q(κ) of the family that best approximates the predictive distribution. The optimization criterion is derived based on the log-marginal posterior distribution of the observed data, an usual model selection criterion, log(p(yM)) . Often this quantity evolves to where it requires the solution to an intractable integral. To avoid this tedious calculation, a lower bound quantity, called ELBO (evidence lower bound), is easily evaluated as

log(p(yM))=log(p(y,λ,β,γ)dγdβdλ)=log(Eq[p(y,λ,β,γ)q(λ,β,γκ)])Eq[logp(y,λ,β,γ)q(λ,β,γκ)] (11)

where γ and (λ,β) represent local and global quantities/parameters, respectively. The inequality in (11) is obtained by Jensen’s inequality. It is natural to use this lower bound as a model selection criterion in place of the predictive distribution, avoiding cumbersome high-dimensional integration. Therefore, the VB inference objective is to maximize ELBO, which is equivalent to minimizing the Kulback–Leibler divergence up to an additive constant (Blei et al., 2017). We propose the following partition of the joint distribution of local and global, denominated mean field family (Parisi, 1988):

q(λ,β,γκ)=q(λκ)q(βλ,κ)i=1nq(γiβ,κ) (12)

where κ comprises all the parameters of the variational family. To avoid a cumbersome notation, we are using the same notation q for the joint variational distribution of (11) and the conditional distributions in (12).

The approximate conditional inference is viewed as an optimization problem. Given the above setup, the mean field family, and the ELBO criterion, one can find the optimal solution via the coordinate ascent variational inference algorithm (Bishop, 2006). Each factor of the mean field variational density is optimized iteratively, while keeping the others fixed, climbing the ELBO to a local optimum.

Letting ν=(λ,β,γ), we need to compute q*(νκ)=q*(λκ)q*(βλ,κ)i=1nq*(γiβ,κ) where

q*(λκ)exp{E(β,γ)logp(λβ,γ,y,κ)} (13)
q*(βκ)exp{E(λ,γ)logp(βλ,γ,y,κ)},and (14)
q*(γiκ)exp{E(γ(γi),λ,β)logp(γiγ(γi),λ,β,y)} (15)

It is worth pointing out that the form of the optimal densities involves the full conditional distributions, revealing a link with Gibbs sampling. However, the VB algorithm does not repeatedly simulate from the full conditional distributions as is done by the Gibbs sampler.

4. Identification of early responders using EEG data

Let yi denote the change in the HAM-D (baseline—week 1) for subject i, i=1,,96, where a positive change indicates diminished depression symptom severity. In order to compare our results with Jiang et al. (2017), we will focus on the same model, given by (16), with the same scalar covariates, sex and chronicity, and functional covariates given by data taken from 14 EEG electrodes. The unobserved binary subgroup indicators are modelled via a hierarchical probit model as a function of the baseline EEG measurements and other scalar covariates of interest.

For this data set, we deal with the mixture model of two normal distributions with different means but the same variance. However, it is straightforward to generalize the computations for L classes and different variances. Let Y1,,Yn be independent random variables with

p(yiμ0,μ1,σ2,γi)=γiϕ(yi;μ1,σ2)+(1γi)ϕ(yi;μ0,σ2),μ1>μ0 (16)

where ϕ(.;μ,σ2) is the normal density with parameters μ and σ2, and γ1,,γn are binary latent random variables with p(γiβ)=pi(β)γi(1pi(β))1γi,γi=0or1, where

pi(β)=g1(xiβ) (17)

g is a link function, and β=(θ,ϕ1,,ϕJ) with θ being a column vector of parameters associated to scalar effects and ϕj are column vectors representing the coefficients of a function written from a B-splines expansion given by (5) or (7).

For the parameters in λ=(μ0,μ1,σ2) in the model, we use diffuse priors, μ0N(0,τ02) and μ1N(0,τ12) with τ02=τ12=100.

For each coefficient in β, we specify weakly informative t family prior distribution.

The observed data likelihood for the hierarchical model is difficult to optimize directly because the unobserved vector γ={γi}i=1n. Denoting by Θ=(λ,β), the complete likelihood is given by

f(y,γΘ)=i=1n12πσ2exp{(yiμ0(1γi)μ1γi)22σ2}[pi(β)]γi[1pi(β)](1γi) (18)

4.1. Full conditional posterior distributions

The joint augmented posterior distribution is proportional to the product of the likelihood given by (18) and prior distributions specified in the previous section and has no closed form. Therefore, we adapt the Gibbs sampling algorithm to sample from the full conditional posterior distribution of Θ and the latent variables γ.

4.1.1. Full conditional posterior distribution of γi

Let γi be the vector γ=(γ1,,γi1,γi+1,,γn). The full conditional posterior distribution of γi is given by

P(γi=1Θ,y,γi)=ϕ(yiμ1σ)pi(β)ϕ(yiμ1σ)pi(β)+ϕ(yiμ0σ)(1pi(β))

since P(γi=x,Θ,y,γi)=ϕ(yiμ1xμ0(1x)σ)pi(β) where pi(β) depends on the covariates through the regression term and it is given by (17).

4.1.2. Full conditional posterior distribution of μ0 and μ1

We update μ0 using a normal distribution with mean (1τ02+1σ2i=1n(1γi))1i=1nyi(1γi) and variance (1τ02+1σ2i=1n(1γi))1, while μ1 is updated, conditionally on μ0, with a truncated normal distribution on (μ0,), with mean and variance given by

(1τ12+1σ2i=1nγi)1i=1nyiγiand(1τ12+1σ2i=1nγi)1

respectively.

4.1.3. Full conditional posterior distribution of σ2

We update σ2 using an inverse gamma distribution with parameters a0+n2 and b0+i=1n(yiγiμ1)2+(yi(1γi)μ0)2.

4.1.4. Full conditional posterior distribution of βl

To derive the full conditional posterior distribution of βl, we proceed as described in Section 3.2.1. The distribution will depend on the prior of βl, which can be normal, Cauchy, or t.

4.2. VB for the normal mixed model

For the mixture of normal distributions model, the augmented vector of unknown parameters is ν=(λ,β,γ), where λ=(μ0,μ1,σ2), β=(θ,ϕ), and γ=(γ1,,γn). For the variational distribution we choose:

  • {γi,i=1,,n}: family of independent Bernoulli random variable with variational parameter αi;

  • μ0 and μ1: independent normal distributions with means m0 and m1 and variances s02 and s12, respectively;

  • σ2 : inverse gamma with parameters A0 and B0;

  • β : normal distribution with covariance matrix Vβ* and mean μβ*.

Denote the parameters of the variational distributions as

κ=(m0,s02,m1,s12,α,A0,B0,μβ,Vβ)

and define q*(νκ)=q*(μ0κ)q*(μ1κ)q*(σ2κ)q*(βκ)i=1nq*(γiκ). According to equations (13), (14), and (15), we have to calculate the variational distributions of μ0,μ1,σ2,β, and γ1,,γn.

In the next sections, to simplify the notation, we will omit the dependence on κ when writing the variational distributions q*. The details of computations can be found in the online supplementary material.

4.2.1. Variational density q*(γi)

If we consider q*(μ0) and q*(μ1) belonging to the family of independent distributions with means m0 and m1 and variances s02 and s12, respectively, we get that γi is a Bernoulli random variable with variational parameter αi=αi1/(αi0+αi1) where

αil=exp{Eq*(θ)log[1pi(β)]Eq*(σ2)(12σ2)[(yiml)2+sl2]},l=0,1

4.2.2. Variational densities q*(μ0) and q*(μ1)

For k=0,1, the variational distribution of μk is Gaussian with mean mk and variance sk2 given by

m0=s02Eq*(σ2)[1/σ2]i=1n(1αi)yi,s02=1Eq*(σ2)[1/σ2]i=1n(1αi)+1/τ02

and

m1=s12Eq*(σ2)[1/σ2]i=1nαiyi,s12=1Eσ2[1/σ2]i=1nαi+1/τ12

where τ02 and τ12 are the parameters from the prior distribution.

4.2.3. Variational density q*(σ2)

The variational density of σ2, considering the likelihood and the prior distribution of σ2IG(a0,b0), is given by an inverse gamma with parameters A0=a0+n/2 and

B0=b0+i=1nαi2((yim1)2+s12)+i=1n(1αi)2((yim0)2+s02)

4.2.4. Variational density q*(β)

Assuming a t prior distribution to each component of βl in β, the variational density q*(βl) of βl is derived by taking the expectation of equation (10) with respect to γi, derived from

Eq*(γi)(ψil)=g(p^ilk)+(αip^ilk)g(p^ilk) (19)

with αi=Eq*(γi)(γi) given in Section 4.2.1. There is no closed form for q*(βl) and therefore we cannot compute the ELBO.

However, taking the prior distribution of βl as a normal distribution with mean μβ and covariance matrix Σβ, the variational q*(βl) is given by a normal distribution with covariance matrix Vβ*=(xΣψ1x+Σβ1)1 and mean μβ*=Vβ*(xΣψ1Eq*(γ)(ψl)+Σβ1μβ), with elements in Eq*(γ)(ψl) given by (19) and Σψ is a diagonal matrix of size n with elements [σψ,il2]1 given in equation (9).

4.3. Results

To sample from the posterior distribution, we use the Gibbs sampler scheme considering the posterior calculation described in Section 3.2. We ran two chains for M=150,000 steps. We calculated Gelman and Rubin’s convergence diagnostic (Gelman & Rubin, 1992) using the R package coda and the statistics and plots showed convergence for all parameters leading to a burn-in of ML=M/2=75,000 steps and samples were taken every 100 steps. Table 2 presents the posterior estimates, using one of the chains, of the parameters using logit link function and t, Cauchy and normal priors for the parameters in β, and linear vs. nonlinear functional models. The value p^ in each model represents

Table 2.

Posterior mean (standard deviation) of μ0,μ1,σ2, θ, and p for different prior distributions and linear and nonlinear functional models

Logit link
Nonlinear model Linear model
Parameter Student-t Normal Cauchy Student-t Normal Cauchy
μ0 0.88 (0.48) 0.87 (0.47) 0.89 (0.48) 0.86 (0.61) 0.82 (0.61) 0.89 (0.60)
μ1 11.52 (1.18) 11.57 (1.18) 11.44 (1.16) 10.81 (1.57) 10.72 (1.52) 10.86 (1.55)
Intercept 0.008 (0.005) 0.008 (0.005) 0.008 (0.005) 1.07 (0.34) 1.07 (0.34) 1.26 (0.79)
Sex 0.34 (0.21) 0.34 (0.21) 0.30 (0.19) 1.85 (0.48) 1.85 (0.45) 1.79 (0.54)
Chronicity 0.43 (0.19) 0.44 (0.19) 0.39 (0.17) 1.76 (0.45) 1.80 (0.44) 1.70 (0.54)
σ2 15.82 (2.30) 15.81 (2.34) 15.95 (2.48) 17.21 (3.00) 17.08 (2.85) 17.32 (2.93)
p^ 0.18 0.16 0.18 0.15 0.16 0.15
p^=196i=1961(p^i>1/2).

Notice that all results in Table 2 are very similar. To choose the best model, we compute the posterior predictive checks (Bayarri & Berger, 2000; Gelman & Meng, 1996) for each t=5,0,5, and 10 and each model, we sample Θ1,t,,ΘM,t from the posterior distribution, generate y~1rt,,y~96rt , r=1,,M, from the likelihood and considering M samples after burn-in ML and thinning of 100, leading to 750 samples. Then we compute

h(y~t)=1Mr=1750(1/96)i=1961(y~irt>t)

as shown in Table 3.

Table 3.

Posterior predictive checks as a function of the threshold (t) for different prior distributions and linear and nonlinear functional models with logit link

t Nonlinear model Linear model
Student-t Normal Cauchy Student-t Normal Cauchy
5 0.51 0.51 0.53 0.44 0.45 0.46
0 0.61 0.61 0.63 0.62 0.62 0.63
5 0.34 0.33 0.36 0.47 0.48 0.48
10 0.44 0.47 0.47 0.39 0.39 0.40

Gelman (2013) showed that posterior predictive values do not have uniform distribution under the null hypothesis, but rather tend to be concentrated near 0.5. Therefore, the closer to 0.5, the better the model. Although the nonlinear model is slightly better than the linear model for t=5 and t=10 as shown in Table 3 independently of the prior distribution, by the parsimonious criterion we chose the linear model fitted with the normal prior distribution and consider this model.

Figure 3 shows histograms of the change in HAM-D (baseline—week 1) showing the amount of improvement in depression symptoms after 1 week, a positive change indicates improvement in symptoms. The right panel presents the fitted distribution (0.84×N(0.82,17.08)+0.16×N(10.72,17.08)) as chosen by the linear model with normal prior and logit link. For comparison, we present the parametric model using the EM algorithm for a mixture of Gaussian distributions (different means and different variances) with no covariates. Comparing left and right panels of Figure 3, we can see that the fitted model has a much better ability to discriminate between the two groups since the two distributions have less overlap.

In order to classify subjects as early responders, we used the maximum posterior estimate of p(γiy), see Figure 4. Specifically, 10 patients exhibit near-certain early responder characteristics (posterior mean probability greater than 0.9) while an additional 5 patients show a high likelihood of being early responders (posterior mean probability in the range of [0.5,0.9]).

Figure 4.

Figure 4.

Posterior mean for the probability of being an early responder for the 96 patients.

For comparison between MCMC and VB results, Table 4 shows the estimated expected value estimates for the parameters based on posterior mean for MCMC and the expectation of the variational distribution (Eq*) for VB for the linear/nonlinear model, logit link. We present only the results for the normal prior as we operate VB with normal prior only (cf. Table 2). Although point estimates of the parameters obtained by MCMC and VB appear to be discrepant, comparing results for linear logit model with normal prior, except by the intercept and female effect, all estimates for the posterior expectation of the parameters under VB approach are inside the MCMC high probability density 95% credible interval.

Table 4.

Estimated posterior expectation of the parameters and the computational time for the normal mixture model using MCMC (posterior mean—PM and high probability density interval—HPDI) and VB (Eq*[]) approaches

VB MCMC
Nonlinear Linear Nonlinear Linear
Parameter Eq*[] Eq*[] PM PM HPDI
μ0 1.40 1.13 0.87 0.82 (0.26; 2.08)
μ1 13.34 12.38 11.57 10.72 (7.80; 13.56)
σ2 18.14 16.68 15.81 17.08 (11.78; 22.64)
Intercept (θ0) 0.01 0.30 0.008 1.07 (1.81; 0.48)
Female (θ1) 0.14 0.93 0.34 1.85 (2.73; 1.04)
Chronicity (θ2) 0.13 1.39 0.44 1.80 (2.68; 1.12)
Time 12 min 80 min

Note. MCMC = Monte Carlo Markov chain; VB = variational Bayes.

For the linear model, we estimate the weight functions wj, j=1,,14, which gives the effect to what extent each resting state EEG alpha and theta power in the posterior region of brain under a closed eyes condition could help identify a potential early responders subgroup (which is believed to consist of subjects susceptible to non-specific placebo effects). To understand the variability in the estimates of the weight functions, we use functional boxplots which are the analogue to usual boxplots, and they are a graphical method to display five descriptive statistics: the median, the first and third quartiles, and the non-outlying minimum and maximum observations. The functional boxplot proposed by Sun and Genton (2011) is similar in spirit to the classical ‘box-and-whisker plot’ and displays the functional median, the envelope of the 50% central region and the maximum envelope based on the ranks to functional data induced by the modified band depth proposed by López-Pintado and Romo (2009). Given the ranks, it is possible to order the functional data according to decreasing depth values. This ordering is taken from the centre outwards, therefore the functional observation with the largest depth value has the most central position and it is called the functional median. The sample 50% central region is the set {(t,y(t)):y(t) is sandwiched between the minimum of half of the ranked observations with larger depths and the maximum of half of the ranked observations with smaller depths }. Notice that the functional median is one of the curves whereas the 50% central region consists of pieces from different observations and it is not one of the original observations in general. The maximum envelope is constructed after removing outliers and determined by inflating the envelope of the 50% central region by 1.5 times its range for each t.

In the literature, depression is associated with impaired right parietal cortex function, see Bruder et al. (1995) and Heller and Nitscke (1997) and references therein. In particular, Stewart et al. (2011) suggest using asymmetry scores to measure parietal brain asymmetry, in particular, comparing P5 (right) with P6 (left) and P7 (right) with P8 (left). Figure 5 displays functional boxplots for the weight functions wj(t);j=3,5,and6. These weight functions were found to be the only significant ones for predicting early responders using a linear model with the logit link and a normal prior. They are associated with the EEGs located at positions P5, P6, and P7 aligning with the observation mentioned earlier. Notably, higher EEG signals in the frequency range of 20–45Hz for P5, 30–45Hz for P6, and 16–35Hz for P7 correspond to an increased probability of being an early responder.

Figure 5.

Figure 5.

Functional boxplots for the significant weight functions using the linear model with logit link function and normal prior. The shaded area is the 50% central region, and red dashed curves are outlier candidates.

5. Predicting illness for milking cows based on functional covariates measured 30 days before lactation

In this section, we will analyse the case where the observed sample is given by Y1,Y2,,Yn, independent non-negative integer-valued random variables, such that

P(Yi=yλ1,λ2,γi)=[I(yi=0)]γi0[1yi!eλ1λ1yi]γi1[1yi!eλ2λ2yi]γi2 (20)

for yi=0,1,, where γi=(γi0,γi1,γi2) are latent multinomial random variables Multinomial (1,pi0,pi1,pi2) with

pi0=11+exp(xiβ1)+exp(xiβ2),pil=exp(xiβl)1+exp(xiβ1)+exp(xiβ2),l=1,2

Therefore, the vector of unknowns is ν=(λ1,λ2,β1,β2,γ1,γ2)=(Θ,γ). Here, xi is row i of matrix of covariates.

For the model parameters, we propose using the following priors:

  • λ1Gamma(a1,b1) , λ2Gamma(a2,b2).

  • Elements in βl=(θl,ϕ1l,,ϕjl) will have independent t prior distributions.

The complete likelihood is given by

f(y,γΘ)=i=1n[pi0I(yi=0)]γi0[pi11yi!eλ1λ1yi]γi1[pi21yi!eλ2λ2yi]γi2 (21)

The joint augmented posterior distribution is the product of the likelihood and priors specified above and has no closed form. Therefore, we adapt the Gibbs sampling algorithm to sample from the posterior distribution of Θ and the latent variables γ.

5.1. Full conditional posterior distributions

The posterior distribution of parameters is obtained based on a Gibbs sampling scheme. For that, we calculate the full conditional posterior distribution of parameters in Θ, similarly to the one described in Section 4.1.

5.1.1. Full conditional posterior distribution of γi

Let γi0, γi1, and γi2 be the vector γl=(γ1l,,γnl) without observation γil, l=0,1,2, respectively. The full conditional posterior distribution of γi for l=1,2 is given by

P(γil=1Θ,y,γil)=eλlλlyiexp(xiβl)I(yi=0)+eλ1λ1yiexp(xiβ1)+eλ2λ2yiexp(xiβ2) (22)

for l=1,2 and

P(γi0=1Θ,y,γi0)=I(yi=0)I(yi=0)+eλ1λ1yiexp(xiβ1)+eλ2λ2yiexp(xiβ2). (23)

5.1.2. Full conditional posterior distribution of λ1 and λ2

We update λl,l=1,2, using a gamma distribution with parameters

al+i=1nyiγilandbl+i=1nγil,l=1,2.

5.1.3. Full conditional posterior distribution of βl=(θl,ϕ1l,,ϕjl),l=1,2

Computations of the full conditional posterior distribution βl follow from Section 3.2.1.

5.2. VB of the ZIMP model

Analogously to the normal mixture case, we define the variational densities as

q*(νκ)=q*(λ1κ)q*(λ2κ)q(β1κ)q(β2κ)q(γκ)

where κ=(α,ψ1,ζ1,ψ2,ζ2,(μβ1*),(μβ2*),vec(Vβ1),vec(Vβ2)) is the vector of variational parameters. For all the cases, the variational densities q*(β1κ) and q*(β2κ) will have exactly the same computations as in the mixture of two normal distributions models, see Section 4.2.4. The vector of unknowns is ν=(λ1,λ2,β1,β2,γ)=(Θ,γ), where γ=(γ1,,γn).

Again, to simplify the notation, we will omit the dependence on κ when writing the variational distributions q*. The details of computations can be found in the online supplementary material.

5.2.1. Variational density q*(γi)

If we consider q*(λ1) and q*(λ2) belonging to the gamma family of distributions with parameters (ψ1,ζ1) and (ψ2,ζ2), respectively, we get γi as a multinomial random variable Multinomial (1,αi0,αi1,αi2) with αil=ρil/(j=02ρij), for l=0,1,2, where

ρi0=I(yi=0)ρil=exp(ψlζl+yi(log(ζl)+Ψ(ψl))+Eq*(βl)[xiβl])forl=1,2

5.2.2. Variational density q*(λ1) and q*(λ2)

The variational distribution of λl,l=1,2, is gamma density with parameters

ψl:=al+(i=1nαilyi)andζl:=bl+i=1nαil

5.3. Results

For the MCMC, we ran two chains for 150,000 steps. We calculated Gelman and Rubin’s convergence diagnostic (Gelman & Rubin, 1992) using the R package coda and the statistics and plots showed convergence for all parameters leading to a burn-in of 75,000 steps and samples were taken every 100 steps, resulting into 750 samples. After fitting the model, Table 5 shows the classification using the posterior mean for the latent gamma variable (the classification according to the maximum value of the estimated probability is the same using MCMC and VB estimates). From the 138 cows in the training set that did not get sick during lactation, 136 were classified as belonging to the pure-zero class with high probability while the other 2 cows were classified as low incidence. All 43 cows that were sick between 1 and 4 days were classified as low incidence whereas only one of the cows diagnosed with a health disorder for 5 days was classified as low incidence and the other one was classified as high incidence. All cows with 6 or more days with a health disorder were classified as high incidence.

Table 5.

Classification for the training set according to the class presenting the maximum mean value of the posterior distribution and VB estimation of the latent variable γ

Class Number of sick days
0 1 2 3 4 5 6 7 8 9 11 12 13 18
Pure zero 136 0 0 0 0 0 0 0 0 0 0 0 0 0
Low incidence 2 20 10 6 5 1 0 0 0 0 0 0 0 0
High incidence 0 0 0 0 0 1 5 4 6 5 2 3 1 1

Note. VB = variational Bayes.

The posterior mean and standard deviation and quantiles of the posterior distribution as well as the mean for the VB mean value for these classes are shown in Table 6 and the left panel of Figure 6. Notice that the estimates of the mean and standard deviation of the posterior distribution for λ1 and λ2 are very similar using VB and MCMC, with the VB method presenting smaller standard deviations. However, the quantiles from the MCMC method indicate that the 95% credible intervals of λ1 (1.20;2.13) and λ2 (7.53;9.83) clearly discriminate the classes. On the other hand, from Figure 6, we can see that the VB estimates for the mean of the intercept and scalar coefficients for variables are not so good for the coefficients that are not significant whereas they are similar for the ones which are significant. Direct comparison for the coefficients of the expansion for the weights of the functional covariates are not meaningful and we constructed the functional boxplot for the posterior estimates of the weight functions as shown in Figure 7. To understand these plots, recall that the functional boxplots are similar to the usual boxplots, where the shaded region comprises the 50% central functional observations and can be thought as 50% credible intervals. Notice that apparently, none of the functional covariates are significant for the regression term for low incidence class since the horizontal line at zero is almost completely contained inside the shaded regions. The covariate activity seems to be significant to determine the probabilities of the high incidence latent class; if the cow has high activity between 30 and 20 days prior to labour, it impacts negatively in the probability whereas if the cow has high activity between 20 and 5 days prior to labour, it increases the probability of belonging to the high intensity class.

Table 6.

VB mean, posterior mean, and quantiles for the parameters according to MCMC sample for logit link and Cauchy prior for zero-inflated mixture of two Poisson distributions

Quantiles—MCMC
Parameter VB mean (SD) MCMC mean 2.5% 25% 50% 75% 97.5%
λ1 1.60 (0.18) 1.63 (0.24) 1.20 1.47 1.62 1.79 2.13
λ2 8.45 (0.53) 8.64 (0.59) 7.53 8.24 8.63 9.03 9.83
Time 20.35 s 2.55 h

Note. VB = variational Bayes; MCM = Monte Carlo Markov chain.

Figure 6.

Figure 6.

Histogram for posterior distribution of λ1 and λ2. Intercepts and coefficients related to the scalar variables (θ1, θ2, θ3, and θ4) for the ZIMP model. ZIMP = zero-inflated mixture Poisson.

Figure 7.

Figure 7.

Functional boxplots for the weight functions corresponding to covariates ‘activity’, ‘number of rest times’, ‘average rest time’, and ‘total rest time’, respectively, under linear model with logit link function and Cauchy prior. Left panels corresponds to class 1 (low intensity) and right panels to class 2 (high intensity).

Figure 8 shows as a stacked barplot the estimated probability of belonging to the pure zero, low and high incidence class (p^i0,p^i1,p^i2) for the 50 cows in the test set. The left panel presents the 37 cows that did not get sick whereas the right panel displays the 13 cows that were sick at least for a day. It is important to note that the current model does not exhibit strong predictive power. There are various prepartum risk factors linked to postpartum diseases. According to studies by Urton et al. (2005) and Huzzey et al. (2007), reduced feeding times and dry matter intake before parturition serve as predictors for metritis. Additionally, Riekerink et al. (2007) report a seasonal effect associated with mastitis. Further research by Pérez-Báez et al. (2019a, 2019b) explores the connection between prepartum dry matter intake, energy balance, and subsequent postpartum health disorders. Consequently, in future investigations, it is advisable to incorporate some of these factors as additional covariates for a more comprehensive understanding.

Figure 8.

Figure 8.

Estimated probability of belonging to the pure zero, low and high incidence class (p^i0,p^i1,p^i2),i=1,50, for the cows in the test set using the ZIMP model. The left panel represents the cows that did not get sick. The numbers under the bars in the right panel represent the number of sick days. ZIMP = zero-inflated mixture Poisson.

6. Discussion

In this paper, we consider two applications where the heart of the problem was to classify the individuals based solely on the covariates. We proposed a classification procedure based on a mixture model driven by latent variables under the Bayesian paradigm. We used a semi-parametric regression model incorporating functional covariates as predictors for the latent group membership. Other methods could be used to analyse the data such as regression trees or regression mixture models. However, all methods have to be adapted to deal with functional data. In the case of regression trees, one approach to include the functional covariates would be discretizing the functional predictor by partitioning their domains into several (disjoint) intervals and use the mean values computed from each interval as covariates. However, again we return to the approach of treating functional data as multivariate vectors. Using regression mixture models would almost double the number of parameters to be estimated. On the other hand, Bayesian analysis not only provides information on the whole distribution of the parameters, going beyond a point or an interval estimate, but also it simplifies the task of estimating latent models since we can augment the parameter space. In this work, we used Gibbs sampler due to the conjugacy of the priors. Other choices would be possible, e.g. the Metropolis-Hastings algorithm. Although VB approach leads to faster computational time, we see from simulations that MCMC produces more reliable estimates. Also, computational times are hard to compare since they depend on chain size and precision in MCMC and VB approaches, respectively.

The main features of our methodology are:

  1. The nonparametric approach of expanding the unknown functions wj and Fj into B-splines basis reduces the dimension of the problem.

  2. Linear and nonlinear regressions can be implemented.

  3. It can be used with any link function g.

  4. It incorporates Student-t prior information for the regression coefficients applying an efficient approximate EM algorithm already implemented in R.

  5. It can be used with any distributions.

  6. The functional covariates do not need to be observed concurrently and they can even be different for each subject.

  7. In the case of the linear model, it has the added advantage of interpretability of the weight functions which might naturally incorporate prior information that is available to experts in the field.

The strength of our method is demonstrated by the simulation study as well as the analysis of the two data sets.

Supplementary Material

qlae006_Supplementary_Data

Acknowledgements

A special thanks to Alberto Saa for helping to solve a computational bottleneck and Guilherme J.M. Rosa for fruitful discussions. We thank the referees for their thoughtful comments and suggestions that improved the quality of the manuscript. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the authors and do not necessarily reflect the view of the National Institute of Food and Agriculture or the United States Department of Agriculture.

Contributor Information

Nancy L Garcia, Department of Statistics, Universidade Estadual de Campinas, Campinas, Brazil.

Mariana Rodrigues-Motta, Department of Statistics, Universidade Estadual de Campinas, Campinas, Brazil.

Helio S Migon, Department of Statistics, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil.

Eva Petkova, Department of Population Health, Grossman School of Medicine, New York University, New York, USA; Department of Child and Adolescent Psychiatry, Grossman School of Medicine, New York University, New York, USA.

Thaddeus Tarpey, Department of Population Health, Grossman School of Medicine, New York University, New York, USA.

R Todd Ogden, Department of Biostatistics, Columbia University, New York, USA.

Julio O Giordano, College of Agriculture and Life Sciences, Cornell University, Cornell, USA.

Martin M Perez, College of Agriculture and Life Sciences, Cornell University, Cornell, USA.

Funding

This work was partially supported by NIMH grant 5 R01 MH099003, USDA National Institute of Food and Agriculture Animal Health Program award 2017-67015-26772 to J.O.G., FAPESP grants 2017/15306-9, 2018/06811-4, 2019/10800-0, and 2023/00592-7, and CNPq grants 302598/2014-6, 442012/2014-4, and 304148/2020-2.

Data availability

The code and data files used in the real data applications and simulation studies are available at https://github.com/nancyg-unicamp/Unsupervised.

Supplementary material

Supplementary material is available online at Journal of the Royal Statistical Society: Series C. As a complement to this paper we provide a file with extensive simulation study to investigate the performance of the proposed method as well as acomparation between MCMC and Variational Bayes Inference.

References

  1. Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. 10.1080/01621459.1993.10476321 [DOI] [Google Scholar]
  2. Bayarri, M., & Berger, J. O. (2000). P-values for composite null models. Journal of the American Statistical Association, 95(452), 1127–1142. 10.2307/2669749 [DOI] [Google Scholar]
  3. Benaglia, T., Chauveau, D., Hunter, D. R., & Young, D. S. (2009, October). mixtools: An R package for analyzing finite mixture models. Journal of Statistical Software, 32(6), 1–29. 10.18637/jss.v032.i06 [DOI] [Google Scholar]
  4. Bishop, C. (2006). Pattern recognition and machine learning. Springer google schola, 2, 531–537. [Google Scholar]
  5. Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877. 10.1080/01621459.2017.1285773 [DOI] [Google Scholar]
  6. Breiman, L. (2017). Classification and regression trees. Routledge. [Google Scholar]
  7. Bruder, G. E., Tenke, C. E., Stewart, J. W., Towey, J. P., Leite, P., Voglmaier, M., & Quitkin, F. M. (1995). Brain event-related potentials to complex tones in depressed patients: Relations to perceptual asymmetry and clinical features. Psychophysiology, 32(4), 373–381. 10.1111/psyp.1995.32.issue-4 [DOI] [PubMed] [Google Scholar]
  8. Cardot, H., Ferraty, F., & Sarda, P. (1999). Functional linear model. Statistics & Probability Letters, 45(1), 11–22. 10.1016/S0167-7152(99)00036-X [DOI] [Google Scholar]
  9. Ciarleglio, A., Petkova, E., Ogden, T., & Tarpey, T. (2018). Constructing treatment decision rules based on scalar and functional predictors when moderators of treatment effect are unknown. Journal of the Royal Statistical Society Series C: Applied Statistics, 67(5), 1331–1356. 10.1111/rssc.12278 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22. 10.1111/j.2517-6161.1977.tb01600.x [DOI] [Google Scholar]
  11. Diebolt, J., & Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society: Series B (Methodological), 56(2), 363–375. 10.1111/j.2517-6161.1994.tb01985.x [DOI] [Google Scholar]
  12. Eilers, P. H., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11(2), 89–121. 10.1214/ss/1038425655 [DOI] [Google Scholar]
  13. Ferraty, F., & Vieu, P. (2009). Additive prediction and boosting for functional data. Computational Statistics & Data Analysis, 53(4), 1400–1413. 10.1016/j.csda.2008.11.023 [DOI] [Google Scholar]
  14. Frühwirth-Schnatter, S., & Frühwirth, R. (2010). Data augmentation and MCMC for binary and multinomial logit models. In T. Kneib & G. Tutz (Eds.), Statistical modelling and regression structures (pp. 111–132). Springer.
  15. Gelman, A. (2013). Two simple examples for understanding posterior p-values whose distributions are far from uniform. Electronic Journal of Statistics, 7(1), 2595–2602. 10.1214/13-EJS854 [DOI] [Google Scholar]
  16. Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4), 1360–1383. 10.1214/08-AOAS191 [DOI] [Google Scholar]
  17. Gelman, A., & Meng, X.-L. (1996). Model checking and model improvement. In W. R. Gilks, S. Richardson, & D. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 189–201). Springer.
  18. Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472. 10.1214/ss/1177011136 [DOI] [Google Scholar]
  19. Gramacy, R. B., & Polson, N. G. (2012). Simulation-based regularized logistic regression. Bayesian Analysis, 7(3), 567–590. 10.1214/12-BA719 [DOI] [Google Scholar]
  20. Gueorguieva, R., Mallinckrodt, C., & Krystal, J. H. (2011). Trajectories of depression severity in clinical trials of duloxetine: Insights into antidepressant and placebo responses. Archives of General Psychiatry, 68(12), 1227–1237. 10.1001/archgenpsychiatry.2011.132 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hans, C. (2009). Bayesian lasso regression. Biometrika, 96(4), 835–845. 10.1093/biomet/asp047 [DOI] [Google Scholar]
  22. Held, L., & Holmes, C. C. (2006). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1), 145–168. 10.1214/06-BA105 [DOI] [Google Scholar]
  23. Heller, W., & Nitscke, J. B. (1997). Regional brain activity in emotion: A framework for understanding cognition in depression. Cognition & Emotion, 11(5–6), 637–661. 10.1080/026999397379845a [DOI] [Google Scholar]
  24. Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(5), 1303–1347. https://dl.acm.org/doi/10.5555/2567709.2502622 [Google Scholar]
  25. Huzzey, J., Veira, D., Weary, D., & von Keyserlingk, M. (2007). Prepartum behavior and dry matter intake identify dairy cows at risk for metritis. Journal of Dairy Science, 90(7), 3220–3233. 10.3168/jds.2006-807 [DOI] [PubMed] [Google Scholar]
  26. James, G. M. (2002). Generalized linear models with functional predictors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3), 411–432. 10.1111/1467-9868.00342 [DOI] [Google Scholar]
  27. Jiang, B., Petkova, E., Tarpey, T., & Ogden, R. T. (2017). Latent class modeling using matrix covariates with application to identifying early placebo responders based on EEG signals. The Annals of Applied Statistics, 11(3), 1513–1536. 10.1214/17-AOAS1044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kim, J. S., Staicu, A.-M., Maity, A., Carroll, R. J., & Ruppert, D. (2018). Additive function-on-function regression. Journal of Computational and Graphical Statistics, 27(1), 234–244. 10.1080/10618600.2017.1356730 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51(3), 455–500. 10.1137/07070111X [DOI] [Google Scholar]
  30. Leuchter, A. F., Cook, I. A., Witte, E. A., Morgan, M., & Abrams, M. (2002). Changes in brain function of depressed subjects during treatment with placebo. American Journal of Psychiatry, 159(1), 122–129. 10.1176/appi.ajp.159.1.122 [DOI] [PubMed] [Google Scholar]
  31. López-Pintado, S., & Romo, J. (2009). On the concept of depth for functional data. Journal of the American Statistical Association, 104(486), 718–734. 10.1198/jasa.2009.0108 [DOI] [Google Scholar]
  32. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman & Hall. [Google Scholar]
  33. McLachlan, G. J., & Peel, D. (2004). Finite mixture models. John Wiley & Sons. [Google Scholar]
  34. McLean, M. W., Hooker, G., Staicu, A.-M., Scheipl, F., & Ruppert, D. (2014). Functional generalized additive models. Journal of Computational and Graphical Statistics, 23(1), 249–269. 10.1080/10618600.2012.729985 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Nunez, P. L., & Srinivasan, R. (2006). Electric fields of the brain: The neurophysics of EEG. Oxford University Press. [Google Scholar]
  36. Ormerod, J. T., & Wand, M. P. (2012). Gaussian variational approximate inference for generalized linear mixed models. Journal of Computational and Graphical Statistics, 21(1), 2–17. 10.1198/jcgs.2011.09118 [DOI] [Google Scholar]
  37. Parisi, G. (1988). Statistical field theory. Addison-Wesley. [Google Scholar]
  38. Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association, 103(482), 681–686. 10.1198/016214508000000337 [DOI] [Google Scholar]
  39. Park, S. Y., Li, C., Mendoza Benavides, S. M., van Heugten, E., & Staicu, A.-M. (2019). Conditional analysis for mixed covariates, with application to feed intake of lactating sows. Journal of Probability and Statistics, 2(1), 1–14. 10.1155/2019/3743762 [DOI] [Google Scholar]
  40. Pérez-Báez, J., Risco, C., Chebel, R., Gomes, G., Greco, L., Tao, S., Thompson, I., do Amaral, B., Zenobi, M., Martinez, N., Staples, C., Dahl, G., Hernández, J., Santos, J., & Galvão, K. (2019a). Association of dry matter intake and energy balance prepartum and postpartum with health disorders postpartum: Part I. Calving disorders and metritis. Journal of Dairy Science, 102(10), 9138–9150. 10.3168/jds.2018-15878 [DOI] [PubMed] [Google Scholar]
  41. Pérez-Báez, J., Risco, C., Chebel, R., Gomes, G., Greco, L., Tao, S., Thompson, I., do Amaral, B., Zenobi, M., Martinez, N., Staples, C., Dahl, G., Hernández, J., Santos, J., & Galvão, K. (2019b). Association of dry matter intake and energy balance prepartum and postpartum with health disorders postpartum: Part II. Ketosis and clinical mastitis. Journal of Dairy Science, 102(10), 9151–9164. 10.3168/jds.2018-15879 [DOI] [PubMed] [Google Scholar]
  42. Polson, N. G., Scott, J. G., & Windle, J. (2013). Bayesian inference for logistic models using Pólya-gamma latent variables. Journal of the American Statistical Association, 108(504), 1339–1349. 10.1080/01621459.2013.829001 [DOI] [Google Scholar]
  43. R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. [Google Scholar]
  44. Riekerink, R. O., Barkema, H., & Stryhn, H. (2007). The effect of season on somatic cell count and the incidence of clinical mastitis. Journal of Dairy Science, 90(4), 1704–1715. 10.3168/jds.2006-567 [DOI] [PubMed] [Google Scholar]
  45. Rupasov, V. I., Lebedev, M. A., Erlichman, J. S., Lee, S. L., Leiter, J. C., & Linderman, M. (2012). Time-dependent statistical and correlation properties of neural signals during handwriting. PLoS One, 7(9), e43945. 10.1371/journal.pone.0043945 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Silverman, B. W. (2018). Density estimation for statistics and data analysis. Routledge. [Google Scholar]
  47. Stangaferro, M., Wijma, R., Caixeta, L., Al-Abri, M., & Giordano, J. (2016a). Use of rumination and activity monitoring for the identification of dairy cows with health disorders: Part I. Metabolic and digestive disorders. Journal of Dairy Science, 99(9), 7395–7410. 10.3168/jds.2016-10907 [DOI] [PubMed] [Google Scholar]
  48. Stangaferro, M., Wijma, R., Caixeta, L., Al-Abri, M., & Giordano, J. (2016b). Use of rumination and activity monitoring for the identification of dairy cows with health disorders: Part III. Metritis. Journal of Dairy Science, 99(9), 7422–7433. 10.3168/jds.2016-11352 [DOI] [PubMed] [Google Scholar]
  49. Stangaferro, M., Wijma, R., Caixeta, L., Al-Abri, M., & Giordano, J. (2016c). Use of rumination and activity monitoring for the identification of dairy cows with health disorders. Part II. Mastitis. Journal of Dairy Science, 99(9), 7411–7421. 10.3168/jds.2016-10908 [DOI] [PubMed] [Google Scholar]
  50. Stewart, J. L., Towers, D. N., Coan, J. A., & Allen, J. J. (2011). The oft-neglected role of parietal EEG asymmetry and risk for major depressive disorder. Psychophysiology, 48(1), 82–95. 10.1111/psyp.2010.48.issue-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Sun, Y., & Genton, M. G. (2011). Functional boxplots. Journal of Computational and Graphical Statistics, 20(2), 316–334. 10.1198/jcgs.2011.09224 [DOI] [Google Scholar]
  52. Titterington, D. M., Smith, A. F., & Makov, U. E. (1985). Statistical analysis of finite mixture distributions. Wiley. [Google Scholar]
  53. Urton, G., Von Keyserlingk, M., & Weary, D. (2005). Feeding behavior identifies dairy cows at risk for metritis. Journal of Dairy Science, 88(8), 2843–2849. 10.3168/jds.S0022-0302(05)72965-9 [DOI] [PubMed] [Google Scholar]
  54. Wager, T. D., & Atlas, L. Y. (2015). The neuroscience of placebo effects: Connecting context, learning and health. Nature Reviews Neuroscience, 16(7), 403–418. 10.1038/nrn3976 [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Wahba, G (1982). Constrained regularization for ill posed linear operator equations, with applications in meteorology and medicine. In Statistical decision theory and related topics III (pp. 383–418). Academic Press. [Google Scholar]
  56. Walsh, B. T., Seidman, S. N., Sysko, R., & Gould, M. (2002). Placebo response in studies of major depression: Variable, substantial, and growing. The Journal of the American Medical Association, 287(14), 1840–1847. 10.1001/jama.287.14.1840 [DOI] [PubMed] [Google Scholar]
  57. Watson, A., El-Deredy, W., Vogt, B. A., & Jones, A. K. (2007). Placebo analgesia is not due to compliance or habituation: EEG and behavioural evidence. Neuroreport, 18(8), 771–775. 10.1097/WNR.0b013e3280c1e2a8 [DOI] [PubMed] [Google Scholar]
  58. Wegman, E. J., & Wright, I. W. (1983). Splines in statistics. Journal of the American Statistical Association, 78(382), 351–365. 10.1080/01621459.1983.10477977 [DOI] [PubMed] [Google Scholar]
  59. Zhang, W., & Luo, J. (2009). The transferable placebo effect from pain to emotion: Changes in behavior and EEG activity. Psychophysiology, 46(3), 626–634. 10.1111/psyp.2009.46.issue-3 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

qlae006_Supplementary_Data

Data Availability Statement

The code and data files used in the real data applications and simulation studies are available at https://github.com/nancyg-unicamp/Unsupervised.


Articles from Journal of the Royal Statistical Society. Series C, Applied Statistics are provided here courtesy of Oxford University Press

RESOURCES