Bayesian profile regression for clustering analysis involving a longitudinal response and explanatory variables

Anaïs Rouanet; Rob Johnson; Magdalena Strauss; Sylvia Richardson; Brian D Tom; Simon R White; Paul D W Kirk

doi:10.1093/jrsssc/qlad097

. Author manuscript; available in PMC: 2024 Apr 4.

Published in final edited form as: Methodology (Gott). 2023 Nov 8;73(2):314–339. doi: 10.1093/jrsssc/qlad097

Bayesian profile regression for clustering analysis involving a longitudinal response and explanatory variables

Anaïs Rouanet ¹, Rob Johnson ¹, Magdalena Strauss ^1,², Sylvia Richardson ¹, Brian D Tom ¹, Simon R White ^1,³, Paul D W Kirk ^1,^4,^*

PMCID: PMC7615733 EMSID: EMS176875 PMID: 38577633

Abstract

The identification of sets of co-regulated genes that share a common function is a key question of modern genomics. Bayesian profile regression is a semi-supervised mixture modelling approach that makes use of a response to guide inference toward relevant clusterings. Previous applications of profile regression have considered univariate continuous, categorical, and count outcomes. In this work, we extend Bayesian profile regression to cases where the outcome is longitudinal (or multivariate continuous) and provide PReMiuMlongi, an updated version of PReMiuM, the R package for profile regression. We consider multivariate normal and Gaussian process regression response models and provide proof of principle applications to four simulation studies. The model is applied on budding yeast data to identify groups of genes co-regulated during the Saccharomyces cerevisiae cell cycle. We identify 4 distinct groups of genes associated with specific patterns of gene expression trajectories, along with the bound transcriptional factors, likely involved in their co-regulation process.

1. Introduction

Understanding gene regulation is a major research challenge in molecular biology (Jacob and Monod, 1961). To this end, a number of model organisms are studied for their relative simplicity in exploring the underlying regulatory mechanisms, including the budding yeast Saccharomyces cerevisiae (Tong et al., 2004). It has been shown that integrating different types of omic data may allow us to further deepen our understanding of the transcription process from DNA to messenger RNA (Savage et al., 2010), including allowing us to identify transcriptional modules (Ihmels et al., 2002): sets of functionally related genes that are regulated by the same transcription factor(s). In this paper, we integrate yeast gene expression time course data with chromatin immunoprecipitation microarray data in order to identify such gene sets. This analysis motivates a new adapted clustering approach.

Clustering is typically formulated as an unsupervised method for identifying homogeneous subgroups in a heterogeneous population (Jain et al., 1999). In addition to the perennial problem of how to determine the appropriate number of clusters (Fraley and Raftery, 2002, Rousseeuw, 1987, Sugar and James, 2003, Tibshirani and Walther, 2005), a key challenge is how to validate a given clustering structure (Brock et al., 2008, Kerr and Churchill, 2001). A common approach is to make use of left-out information to assess if the identified clustering structure provides a relevant stratification of the population (Handl et al., 2005, Yeung et al., 2001). For example, if we had clustered patients on the basis of their blood plasma proteome profiles, then we might check to see if patients allocated to the same cluster also tended to have responses to treatment which are more similar than for patients in different clusters. If this were the case, we might conclude that the identified clustering structure was relevant for the particular aim of identifying clusters associated with differential treatment responses. Of course, assessment of relevance is necessarily context- and application-specific. The left-out information (which we will refer to here as an outcome or response) is typically very closely related to the true aim of the clustering analysis, and implicitly defines the criterion by which a given clustering structure is assessed to be relevant or irrelevant.

When clustering datasets with many variables — some of them possibly not defining a clustering structure (Law et al., 2003, 2004, Tadesse et al., 2005), or when subsets of variables define a variety of different clustering structures (Cui et al., 2007, Guan et al., 2010, Kirk and Richardson, 2021, Li and Shafto, 2011, Niu et al., 2010, 2014) − it may be desirable to make use of (potentially highly informative) outcome information directly, in order to guide inference toward the most relevant clustering structures. That is, we may wish to use the outcome information during the clustering analysis itself, rather than during post-analysis validation. This is one of the principal motivations for Bayesian profile regression (Molitor et al., 2010), a semi-supervised approach for model-based clustering that allows outcome information to be taken into account for the determination of the clustering. Previous applications of profile regression have considered univariate continuous, categorical, or count outcomes (Liverani et al., 2015). In this work, we extend Bayesian profile regression to cases where the outcome is longitudinal (or multivariate continuous) and provide PReMiuMLongi, an updated version of PReMiuM, the R package for profile regression (Liverani et al., 2015).

In the following, we recall the Bayesian profile regression modelling framework in Section 2 and present two extensions to handle a longitudinal outcome in Section 3. In Section 4, we demonstrate through different simulation studies the benefit of integrating a longitudinal outcome in the clustering algorithm and assess the two methods that we propose, in terms of clustering recovery and computation time. Section 5 features the results of the methods applied to budding yeast data, uncovering groups of genes co-regulated during the cell cycle, followed by a discussion in Section 6.

2. Bayesian profile regression: A semi-supervised clustering model

We suppose that we have data comprising observations on a vector of covariates, x, and outcomes, y. Molitor et al. (2010), which we follow, permitted their model for y to depend upon additional covariates, w, referred to as fixed effects. These fixed effects are covariates that may be predictive of the outcome, but do not directly contribute to the clustering, and which come with associated “global” (i.e. not cluster-specific) parameters, β, for the model linking y and w. The general model considered in Molitor et al. (2010) for C mixture components is then:

p (x, y | ϕ, θ, π, β, w) = \sum_{c = 1}^{C} π_{c} f_{y} (y | θ_{c}, w, β) f_{x} (x | ϕ_{c}),

(1)

with π_c the mixture weights, ϕ_c the cluster-specific parameters for the density f_x, θ_c the cluster-specific parameters for the density f_y and β the vector of regression parameters in the outcome model. The original formulation of this model, which we also adopt here, is in terms of infinite (specifically Dirichlet process) mixture models; however, we note that the model is equally applicable in the case of finite C. A stick-breaking construction (Pitman, 2002) is adopted for the Dirichlet Process prior set upon the mixture distribution, also known as a GEM or Griffiths-Engen-McCloskey prior (Pitman, 2002), constructed as follows:

\begin{array}{l} u_{c} \sim Beta (1, α) \\ π_{1} = u_{1}, and π_{c} = u_{c} \prod_{r = 1}^{c - 1} (1 - π_{r}), for c \geq 2, \end{array}

(2)

where α is a positive parameter, for which we adopt a gamma prior. See, for example, Ishwaran and James (2001) and Kalli et al. (2009) for further background, and Liverani et al. (2015) and Hastie et al. (2015) for details of how inference of the π_c’s is performed in the Bayesian profile regression model. Thereafter, we define the component allocation variable z taking values in 1, … ,C. Thus we have $π_{c} = P (z = c)$ .

Profile regression can accommodate various types of covariates, by specifying the corresponding cluster-specific distributions in Equation (1). In the case of Q categorical covariates, the probability distribution in cluster c for x_i = (x_i1, … , x_i,Q) for individual i, i = 1, … , N would be written as

f_{x} (x_{i} | ϕ_{z_{i}}) = \prod_{q = 1}^{Q} \prod_{e = 1}^{E_{q}} ϕ_{z_{i,} q, e}^{1_{{x_{i, q} = e}}}

(3)

with z_i = c if the i-th individual is allocated to the c-th cluster, $ϕ_{z_{i}} = {ϕ_{z_{i,} q, 1}, \dots, ϕ_{z_{i,} q, E_{q}}}$ where ϕ_{z_i,q,e} is the probability that covariate $q \in {1, \dots Q}$ takes the value $e \in {1, \dots, E_{q}}$ in cluster z_i = c and 1_{{x_i,q=e}} = 1 if x_i,q = e and 0 otherwise. Conjugate Dirichlet priors are adopted for the parameters $ϕ_{z_{i,} q, e} \sim Dirichlet (a_{q, e})$ . Multivariate Gaussian densities can also be used to handle a set of correlated Gaussian covariates, using a multivariate normal prior for the mean vector and an inverse Wishart prior for the covariance matrix. Finally, a product of both types of probability distributions accommodate mixed mixtures of Gaussian and discrete covariates, assuming independence given the cluster allocations.

Several distributions are proposed in the current profile regression model for the outcome, such as Bernoulli, Poisson, Binomial, etc. In the binary case for example, the probability distribution f_Y for individual i is given by:

P (y_{i} = 1 | θ_{z_{i}}, β, w_{i}) = expit (θ_{z_{i}} + β^{⊤} w_{i})

where $expit (x) = \frac{exp (x)}{1 + exp (x)}, θ_{z_{i}}$ as well as the regression parameters β follow t location-scale distributions.

The profile regression model described in Equation (1) implicitly assumes that the covariates, x, and outcomes, y, are linked via a shared clustering structure, but are otherwise conditionally independent. The inclusion of fixed effects w in Equation (1) allows for the possibility that there are additional nuisance covariates that might be predictive of the observed y, but which we do not wish to include in the clustering analysis. For example, in an analysis of data from a genome wide association study of lung cancer considered in Papathomas et al. (2012), x comprises genetic covariates (SNPs), while w comprises potentially confounding covariates including age, sex, and smoking status. Adopting Bair’s terminology (Bair and Tibshirani, 2004), we refer to this as a semi-supervised mixture model, since clustering is guided by an outcome variable.

A variable selection procedure selects the covariates x that contain clustering information. The parameters in Equation (1) are replaced by composite ones defined as:

\begin{array}{l} ϕ_{z_{i,} q}^{*} = γ_{{z_{i}}_{, q}} ϕ_{z_{i,} q} + (1 - γ_{z_{i}, q}) ϕ_{0, q} \\ γ_{z_{i,} q} \sim Bernoulli (ρ_{q}) \end{array}

(4)

where ϕ_0,q is the empirical proportion of genes with x_q = 1 in the whole population and γ_{z_i,q} is a relevance indicator for covariate q in cluster z_i, with a sparsity inducing hyperprior $ρ_{q} \sim 1_{{ω_{q} = 0}} δ_{0} (ρ_{q}) + 1_{{ω_{q} = 1}}$ Beta(0.5, 0.5) and ω_q ∼ Bernoulli(0.5). Values of the hyperprior ρ_q close to 1 indicate a difference in the cluster-specific distributions of variable q compared to its distribution in the whole sample. See Liverani et al. (2015) for more details about the alternative “soft variable selection” procedure, also implemented in PReMiuM and PReMiuMlongi.

The profile regression model is fitted via Markov Chain Monte Carlo (MCMC) using a blocked Gibbs sampler. A representative clustering structure is identified using postprocessing techniques of the MCMC output, based on the posterior similarity matrix which represents the estimated co-clustering probabilities for each pair of individuals, and also quantifies the uncertainty associated with the cluster allocations. We refer the reader to Liverani et al. (2015) for further details.

3. Clustering guided by a longitudinal outcome

In previous applications of Bayesian profile regression, the outcome y has been assumed to be univariate (Molitor et al., 2010, Papathomas et al., 2012). Here we extend the original model to consider cases in which y comprises longitudinal continuous data. We present the multivariate normal (MVN) and Gaussian process (GP) regression response models in Sections 3.1 and 3.2, respectively. In Section 3.1, we restrict attention to situations in which we have a longitudinal outcome measured on individuals at a common set of time points. In Section 3.2, individuals may have different outcome measurements with specific time points.

3.1. Multivariate normal outcome model

We propose to model the outcome as multivariate normal, refering to this as the MVN response model, when all individuals/units have response measurements at a common set of time points. Such a model may also be appropriate when each individual’s longitudinal response is summarised through a collection of statistics, as in Hathaway and D’agostino (1993). For flexibility, we make no assumptions about the structure of the covariance matrix in our model, and defer until Section 3.2 the discussion of a model in which we explicitly try to capture the time-ordering of the data. In this section, the vector of M time points, t = {t₁, …, t_M}, is assumed to be the same for all individuals, the outcome for the i-th individual is then of the form y_i = [y_i,1, …, y_i,M]^⊤ ∈ ℝ^M, where y_is denotes the value for the response measured on the i-th individual at s-th time point t_s and it is assumed that there are no missing outcome values.

For the MVN response model, the component-specific parameters θ_c in Equation (1) comprise mean vector, μ_c, and covariance matrix, Σ_c. In the simple case where there are no confounding covariates w, the density f_y(y|θ_c) is simply a multivariate normal density with mean μ_c and covariance matrix Σ_c. When we have confounding covariates, we assume that they act on the response via a constant shift of the mean. In particular, if $w_{i} \in ℝ^{R}, then β \in ℝ^{R},$ and the MVN response model is as follows:

f_{y} (y_{i} | θ_{c}, β, w_{i}, t) = f_{y} (y_{i} | μ_{c}, \sum_{c}, β, w_{i}, t) = \frac{1}{\sqrt{{(2 π)}^{M} | \sum_{c} |}} exp (- \frac{1}{2} (y_{i} - (μ_{c} + (β^{⊤} w_{i}) 1_{M})) \sum_{c}^{- 1} (y_{i} - (μ_{c} + (β^{⊤} w_{i}) 1_{M}))^{⊤}),

(5)

where 1_M denotes a vector of ones of length M and we note that $β^{⊤} w_{i}$ is a scalar. To perform inference, we adopt conjugate normal-inverse Wishart (Gelman et al., 2013) priors $μ_{c} | \sum_{c} \sim 𝒩 (μ_{0}, \sum_{c} / κ_{0}),$ and $\sum_{c} \sim W^{- 1} (R_{0}, ν_{0}),$ where $μ_{0} = \frac{1}{N} \sum_{i = 1}^{N} y_{i},$ where N is the number of participants, κ₀ = 0.01, ν₀ = M and

R_{0} = {(\frac{ν_{0}}{N} \sum_{i = 1}^{N} (y_{i} - μ_{0}) {(y_{i} - μ_{0})}^{T})}^{- 1} .

(6)

3.2. Gaussian process outcome model: a Bayesian functional data approach

In this section, we consider a more general setting in which the number of observations and the time points vary across individuals. In this case, we write y_i = y_i(t_i) = [y_i(t_i,1), …,y_i(t_{i,M_i})]^⊤ with t_i the vector of the M_i time points associated with the i-th individual, and y_i(t) refers to the value of the response for the i-th individual at time t. In the outcome model, we assume that the complete longitudinal trajectory of the i-th individual’s response is expressed in the following form:

y_{i} (t) = g_{i} (t) + β^{⊤} w_{i} + ϵ_{i} (t),

(7)

where g_i is a continuous function of time; $ϵ_{i} (t) = ϵ_{i, t} \sim N (0, σ_{i}^{2})$ is assumed to be i.i.d. Gaussian noise; and β and w_i are the vectors of regression parameters and of the individual covariates, respectively, defined as in the MVN specification. In practice, of course, we only observe y_i(t) at a discrete set of time points, {t_i,1, …, t_{i,M_i}}; however, this functional approach proves useful when dealing with individuals who have observations at different time points.

If individuals i and j are both in cluster c, then we assume that g_i(t) = g_j(t) ≔ g^(c)(t), where g^(c) denotes the function associated with the c-th cluster. Moreover, $σ_{i}^{2} = σ_{j}^{2} ≔ σ_{c}^{2}, where σ_{c}^{2}$ denotes the noise variance associated with the c-th cluster. The conditional distribution of y_i at time t, given that the i-th individual is allocated to the c-th component, is then:

y_{i} (t) | z_{i} = c \sim N (g^{(c)} (t) + β^{⊤} w_{i}, σ_{c}^{2}) .

(8)

In order to proceed, we could specify a parametric form for the functions g^(c) (e.g. we might specify that g^(c) is a polynomial of degree d), whose component-specific parameters we would need to infer in order to fit the model. However, as we now describe, here we adopt a Bayesian non-parametric approach, and take a Gaussian process prior for the unknown function g^(c).

3.2.1. Gaussian process priors for unknown functions

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen and Williams, 2006). This definition simply means that a (potentially infinite) collection v₁, v₂, … of random variables defines a Gaussian process if and only if any finite subcollection of the variables is jointly distributed according to a Gaussian distribution.

In GP regression, a GP prior is assumed for the outputs of an unknown function g^(c)(Rasmussen and Williams, 2006). This means that we assume a priori that {g^(c)(t₁), g^(c)(t₂), …, g^(c)(t_T)} are jointly distributed according to a T-variate Gaussian distribution for any vector of times {t₁, t₂, …, t_T} of finite length T. To fully specify a GP prior, we require a mean function, m^(c), and a covariance function, k^(c), which define the mean vectors and covariance matrices of the Gaussian distributions associated with each finite subcollection of the variables. We write $g^{(c)} \sim G P (m^{(c)}, k^{(c)})$ to indicate that we have assumed a Gaussian process prior with mean function m^(c) and covariance function k^(c) for the function g^(c), so that:

\begin{array}{l} g^{(c)} \sim G P (m^{(c)}, k^{(c)}) if and only if, for any finite collection {t_{1}, t_{2,} \dots, t_{T}} of times, we have [g^{(c)} (t_{1}), \dots, g^{(c)} {(t_{T})}^{⊤} \sim N_{T} (m^{(c)}, K^{(c)}), where m_{j}^{(c)} = m^{(c)} (t_{j}), K_{j, j^{'}}^{(c)} = k^{(c)} (t_{j,} t_{j^{'}}), and N_{T} denotes a T-variate normal distribution . \end{array}

There are many possibilities for m^(c) and k^(c); see, for example, Rasmussen and Williams (2006). In practice, we take the standard default choice of setting m^(c) to be the zero function, so that m^(c)(t_j) = 0 for all t_j. Note that if one prefers to use a non-zero mean function, the specification with a null mean function can then be applied to the difference between observations and the specified non-zero mean function. We also take k^(c) to be the widely-used squared exponential covariance function or Gaussian kernel which is infinitely differentiable (i.e. very smooth), characterized by only two parameters, and has previously provided robust performances in the context of clustering (Strauss et al., 2019). Thus we have:

k^{(c)} (t_{j}, t_{j^{'}}) = a_{c} \exp (- \frac{{(t_{j} - t_{j^{'}})}^{2}}{2 l_{c}}),

(9)

where here the hyperparameters a_c and l_c are respectively the signal variance and length-scale for the squared exponential covariance function k^(c). Likewise, we define the corresponding function on time vectors: $K^{(c)} (t, s)$ whose output is a matrix with (i, j)^th element equal to k^(c)(t_i, s_j). All hyperparameters (a, l and the variance of the measurement error σ²) must be positive. Following Neal (1999), we deal throughout with the logarithms of these quantities, eliminating the positivity constraint (and thereby making sampling of these quantities more straightforward). For convenience, we adopt independent, standard normal priors for each of log(a), log(l) and log(σ²) as in Kirk et al. (2012a) and McDowell et al. (2018).

In non-parametric Bayesian clustering, the tuning of the variance prior is of particular importance. Depending on the application at hand, standard lognormal priors may be too vague and the initialization may lead to the algorithm getting stuck into states with too few or too many clusters, leading to spurious results. Ross and Dy (2013) proposed to set the hyperparameters using empirical Bayes strategies. The lognormal hyperpriors adopted for the covariance functions $K_{c}$ can be tuned with an empirical variogram of the data (Diggle, 1990), which quantifies the correlation between observations as a function of the time interval and provides empirical estimates of the signal variance and length-scale parameters. In addition, and based on our experience, it may be necessary to constrain the variance of the measurement error, as large values may lead to a small number of clusters and conversely, small values may lead to a high number. This can be diagnosed by unstructured posterior similarity matrices (i.e. with unclear separations). We thus propose a specification where the measurement error variance is proportional to the signal variance, in each cluster: $a_{c} = r * σ_{c}^{2},$ where the r ratio is fixed by the user. Sensitivity analyses are then necessary to validate this variance ratio value.

3.2.2. Likelihood

We adopt a GP prior, with component specific hyperparameters a_c and l_c, for each unknown function g^(c). Suppose there are n_c individuals associated with component c. For notational convenience, we assume that these are individuals 1,…, n_c, i.e. we assume that z_i = c for i = 1, …, n_c and z_i ≠ c for all other i. Recalling that y_i = [y_i(t_i,1), …, y_i(t_{i,M_i})]^⊤ and that t_i = [t_i,1, …, t_{i,M_i}] denotes the vector of the M_i time points associated with the i-th individual, we define the “corrected” vector of observations for the i-th individual to be ${y^{'}}_{i} = y_{i} - (β^{⊤} w_{i}) 1_{M_{i}} .$ It then follows from our model for the response (Equations (7) and (8)) and the GP prior, that:

{[{y^{'}}_{1}, \dots, {y^{'}}_{n_{c}}]}^{⊤} | g^{(c)}, σ_{c}^{2} \sim N ({[g_{1}^{(c)}, \dots, g_{n_{c}}^{(c)}]}^{⊤}, σ_{c}^{2} I_{M_{c}})

(10)

\begin{matrix} {[g_{1}^{(c)}, \dots, g_{n_{c}}^{(c)}]}^{⊤} | a_{c}, l_{c} \sim N (0, K^{(c)}) \end{matrix}

(11)

where $g_{i}^{(c)} = [g^{(c)} (t_{i, 1}), \dots, g^{(c)} {(t_{i, M_{i}})]}^{⊤}, ℳ_{c} = \sum_{i = 1}^{n_{c}} M_{i}$ and

K^{(c)} = (\begin{matrix} K_{t_{1,} t_{1}}^{(c)} & \dots & K_{t_{1,} t_{n_{c}}}^{(c)} \\ ⋮ & ⋱ & ⋮ \\ K_{t_{n_{c},} t_{1}}^{(c)} & \dots & K_{t_{n_{c}}, t_{n_{c},}}^{(c)} \end{matrix}) and K_{t_{i,} t_{j}}^{(c)} = (\begin{matrix} k^{(c)} (t_{i, 1}, t_{j, 1}) & \dots & k^{(c)} (t_{i, 1}, t_{j M_{j}}) \\ ⋮ & ⋱ & ⋮ \\ k^{(c)} (t_{i, M_{i}}, t_{j, 1}) & \dots & k^{(c)} (t_{i, M_{i}}, t_{j, M_{j}}) \end{matrix})

(12)

with K^(c) and $K_{t_{i,} t_{j}}^{(c)}$ matrices with dimensions $ℳ_{c} \times ℳ_{c}$ and M_i × M_j, respectively, and individuals i and j in cluster c.

This defines the conditional likelihood given the function g:

{[y_{1, \dots,} y_{n_{c}}]}^{⊤} | g^{(c)}, σ_{c}^{2} \sim N {[(β^{⊤} w_{1}) 1_{M_{1}}, \dots, (β^{⊤} w_{n_{c}}) 1_{M_{n_{c}}}]}^{⊤} + g^{(c) ⊤}, σ_{c}^{2} I_{ℳ_{c}})

(13)

with $g^{(c)} = [g_{1}^{(c)}, \dots, g_{n_{c}}^{(c)}]$ . Alternatively, marginalising g^(c), we see from Equations (10) and (11) that

{[{y^{'}}_{1}, \dots, {y^{'}}_{n_{c}}]}^{⊤} | a_{c}, l_{c}, σ_{c}^{2} \sim N (0, K^{(c)} + σ_{c}^{2} I_{ℳ_{c}}) .

(14)

We then can express the marginal likelihood:

{[y_{1}, \dots, y_{n_{c}}]}^{⊤} | a_{c}, l_{c}, σ_{c}^{2} \sim N ({[(β^{⊤} w_{1}) 1_{M_{1}}, \dots, (β^{⊤} w_{n_{c}}) 1_{M_{n_{c}}}]}^{⊤}, K^{(c)} + σ_{c}^{2} I_{M_{c}}) .

(15)

3.3. Model inference

The model is estimated via MCMC. The DP parameter α is updated using a Metropolis-Hastings step and cluster-specific covariate parameters ϕ_c are updated using Gibbs steps. For the MVN specification of the outcome model, the mean vectors μ_c, covariance matrices Σ_c are updated using Gibbs sampler. Regarding the inference of the model with the GP specification, the logarithms of the hyperparameters of the squared exponential covariance function are updated in a Metropolis-within-Gibbs step. Details on the complete Gibbs samplers are given in Section A in the Supplementary Materials.

For the GP specification, we implemented two sampling algorithms of the hyperparameters $θ_{c} = (\log (a_{c}), \log (l_{c}), log (σ_{c}^{2})),$ either marginalising or sampling the g^(c) function.

Marginal algorithm

In the first case, the hyperparameters are sampled from the following posterior distribution:

p (θ_{c} | z, β, y^{(c)}, t^{(c)}, w^{(c)}) \propto p (θ_{c}) \cdot p (y^{(c)} | z, θ_{c}, β, t^{(c)}, w^{(c)})

with y^(c) and t^(c) the vectors of observations and time points, respectively, of all the individuals in cluster c, and w^(c) the corresponding covariate matrix. The likelihood is defined in Equation (15) and involves inverting a covariance $ℳ_{c} \times ℳ_{c}$ -matrix at each update of either z, β or θ_c. To reduce computation time and, to a greater extent, handle large datasets, we used the Woodbury matrix identity (Woodbury, 1950) to update the allocation variable z in the corresponding Gibbs sampling step (Strauss, 2019) (see Sections B.1 and B.2 of the Supplementary Materials). We also implemented an approximation to inverse cluster-specific covariance matrices K^(c) using an intermediate grid of regular time points (Snelson and Ghahramani, 2006) (Section B.3 of the Supplementary Materials). This approximation is used if either the grid of time points or its size is pre-specified by the user. In the second case, the grid step is then defined from the range of observed time points.

Conditional algorithm

In the case where this approximation is not sufficient to reduce computation times (either the size of the time grid or the number of time points for all individuals in a same cluster is high), we implemented a second algorithm in which the function g^(c) is sampled and the conditional posterior distributions of the hyperparameters $θ_{c}^{1, 2} = (log (a_{c}), log (l_{c})) and θ_{c}^{3} = \log (σ_{c}^{2})$ are given by:

\begin{matrix} p (θ_{c}^{1, 2} | z, g^{(c)}, t^{(c)}) \propto p (θ_{c}^{1, 2}) \cdot p (g^{(c)} | z, θ_{c}^{1, 2}, t^{(c)}) \\ p (θ_{c}^{3} | z, g^{(c)}, β, y^{(c)}, t^{(c)}, w^{(c)}) \propto p (θ_{c}^{3}) \cdot p (y^{(c)} | z, θ_{c}^{3}, g^{(c)}, β, t^{(c)}, w^{(c)}) \end{matrix}

where the likelihood corresponds to a $ℳ_{c}$ -variate normal density with null mean and diagonal covariance matrix, with diagonal elements equal to $σ_{c}^{2} .$ The posterior distribution of g^(c)(t*), with t* any input time vector, is a multivariate Gaussian density with mean and covariance matrices:

m^{* (c)} = K^{(c)} (t *, t^{(c)}) \times {[K^{(c)} (t^{(c)}, t^{(c)}) + σ_{c}^{2} I_{ℳ_{c}}]}^{- 1} \times y^{(c)}

(16)

K^{* (c)} = K^{(c)} (t^{*}, t^{*}) - K^{(c)} (t^{*}, t^{(c)}) \times {[K^{(c)} (t^{(c)}, t^{(c)}) + σ_{c}^{2} I_{ℳ_{c}}]}^{- 1} \times K^{(c)} (t^{(c)}, t^{*})

(17)

When the cluster-specific covariance matrices have low dimensions (equal to the sum of the time points of all individuals allocated to the given cluster), the marginalized algorithm will be faster, as hyperparameters are not sampled. However, if these dimensions are high, solving the necessary matrix inversions will become prohibitively time consuming or even encounter numerical difficulties. The computational complexity of matrix inversion being at least O(n^2.37) according to state-of-the-art blockwise inversion algorithm (Alman and Williams, 2021), as a rule of thumb, we recommend using the conditional algorithm in the case where the mean number of time points per individual multiplied by the number of individuals is relatively high with respect to the computing power (e.g. higher than 1000 on a laptop with a 2.2 GHz dual core processor). When, in addition to this, time points are irregular across individuals, we recommend using the approximation mentioned above to compute the inverse of the covariance matrices.

We recommend using at least as many iterations as data points for the sampling phase and 10% iterations for the burn-in phase. In general, one should keep increasing the number of iterations while convergence of the MCMC chain is not reached. This convergence (or lack thereof) can be checked in such mixture models via the trace of the global parameters (α, β and the number of non-empty clusters) and the marginal partition posterior defined as P(Z|D) by Hastie et al. (2015).

Finally, the MVN specification does not handle missing outcome values and all individual should have the same number of observations, considered to be collected at the same observation times. However, the GP specification handles missing outcome values and loss to follow-up under the assumption that data are missing either completely at random or at random. The R page for PReMiuMlongi, https://github.com/premium-profile-regression/PReMiuMlongi, contains both the package and detailed documentation.

3.4. Estimated cluster trajectories and individual predictions

A representative partition is obtained using post-processing methods: for each number of clusters from 2 to a specified threshold, a Partitioning Around Medoid (PAM) algorithm is performed on the posterior dissimilarity matrix 1–S, where S is a N×N-matrix estimating the posterior probability of co-clustering for each pair of individuals. Then the optimal partition is chosen by maximising the average silhouette width across the best PAM partitions, as in Liverani et al. (2015).

The uncertainty on cluster allocations is then integrated into the estimation of the cluster-specific parameters, following Molitor et al. (2010). The GP hyperparameters of the c^th cluster of the optimal partition are estimated by considering the N_c individuals allocated to it, and averaging over the sampled hyperparameters corresponding to the clusters these individuals were allocated to, at each MCMC iteration:

{\hat{θ}}_{c} = \frac{1}{N_{c}} \frac{1}{H} \sum_{h = 1}^{H} \sum_{i | z_{i}^{*} = c} θ_{z_{i}^{(h)}}^{(h)}

(18)

with $z_{i}^{*}$ the cluster of individual i in the optimal partition and $θ_{z_{i}^{(h)}}^{(h)}$ the hyperparameters corresponding to the cluster $z_{i}^{(h)}$ individual i was allocated to at iteration h. Following the same procedure, we can compute individual longitudinal predictions (or predictions for a given profile of individuals) while accounting for the uncertainty on cluster allocations: at each iteration, the given individual is assigned to a cluster based on his/her posterior membership probabilities given the data; then, the final longitudinal predictions ${\hat{Y}}_{i}$ are computed by averaging over his/her cluster-specific predictions at each MCMC iteration. In the MVN specification, the predicted trajectory is obtained by

{\hat{Y}}_{i} = \frac{1}{H} \sum_{h = 1}^{H} {\hat{μ}}_{z_{i}^{(h)}}^{(h)} + ({\hat{β}}^{(h) ⊤} w_{i}) 1_{M}

with ${\hat{β}}^{(h)}$ the vector of estimated fixed effects at iteration h and ${\hat{μ}}_{z_{i}^{(h)}}^{(h)}$ the estimated mean corresponding to the cluster $z_{i}^{(h)}$ the individual was allocated to, at iteration h. In the GP specification, the predicted trajectory is obtained by the point-wise median of the posterior cluster-specific mean functions computed on a specified time grid t of length n_t, over the H MCMC iterations:

{\hat{Y}}_{i} (t) = {median}_{h} {m_{z_{i}^{(h)}}^{* (h)} (t) + (β^{(h) ⊤} w_{i}) 1_{n_{t}}}

with $m_{z_{i}^{(h)}}^{* (h)}$ the posterior mean function (Equation 16) in cluster $z_{i}^{(h)}$ the individual was allocated to at iteration h. The 90% credible intervals were computed by considering the 5^th and 95^th percentiles. Section F of the Supplementary Materials describes how these models have been incorporated into PReMiuMlongi, and the outputs that are generated.

4. Simulation studies

In this section, we present four simulation studies. The first two demonstrate our motivation in using outcomes in profile regression to guide clustering: first, where the clustering structure in the covariates is weak and, second, where there are multiple clustering structures in the covariates, of which only one has clinical relevance. The third demonstrates the MVN and GP model formulations, comparing them with each other in terms of duration and efficacy in recovering the generated clustering structure. The methods are also assessed as the number of observation times is varied, and as the clustering structures overlap. The fourth study assesses the GP model on unequal time points.

Efficacy in recovering the target clustering structure is assessed via Rand indices (Rand, 1971). The Rand index is a metric ranging from 0 to 1 that re ects the similarity between two clustering structures. The index is 1 for identical structures. The adjusted Rand index corrects for the extent of agreement that would be expected by chance, given the size of the clusterings (Hubert and Arabie, 1985). Hence, it is possible for the adjusted Rand index to take a negative value. In this section, we use the posterior expectation adjusted Rand (PEAR) criterion (Fritsch and Ickstadt, 2009) that corresponds to the adjusted Rand index between the optimal partition obtained from postprocessing the fitted model and the generated partition. This index is computed using the R package mcclust (Fritsch and Ickstadt, 2009).

4.1. Inference

In the following simulation studies, we adopt a GEM prior for the mixture distribution (Equation 2), with α ∼ Gamma(2, 1). The covariate model is defined by Equation (3) with ϕ_c,q,e ∼ Dirichlet(1) for each cluster c, covariate q and covariate level e. In Simulations 1, 2 and 3, we use the MVN specification for the outcome model, defined in Equation (5) with κ₀ = 0.01 and ν₀ equal to the number of time points M. In Simulations 3 and 4, the outcome model with the GP specification is defined by Equations 10 and 11, adopting a null mean function for the GP prior and a squared exponential function for the variance covariance, with standard log normal priors for the hyperparameters $[log (a_{c}), log (l_{c}), \log (σ_{c}^{2})] \sim N (0, ℐ_{3})$ . The initial number of clusters is set to 20.

4.2. Simulation study 1: Semi-supervised vs. unsupervised clustering

Here we demonstrate the importance of using outcome data in identifying latent clustering structures. We define two clusters, each with 50 individuals, and simulate covariate data and 0 to 4 outcome variables. We use PReMiuMlongi with an MVN response model and we compare their estimated clustering structures with the generated one.

Our focus is on the difference between inference using no outcome data (M = 0), and inference using M ≥ 1 observation times. Inference using no outcome data corresponds to studies in which, first, clustering is performed on covariate data and, second, cluster memberships are leveraged to explain some outcome. The rationale of profile regression is that, when the purpose of clustering is to group individuals according to their outcomes, using outcome data leads to more meaningful cluster labels.

4.2.1. Design

The covariates are discrete variables generated from a categorical distribution. We set the number of covariates to 10, denoted q = 1, …, 10. Each covariate q has R_q = R = 3 categories, with multinomial R–parameter ϕ_c,q generated as follows: $ϕ_{c, q} = v ϕ_{c, q}^{(0)} + \frac{1}{R} (1 - v) 1_{R} with ϕ_{c, q}^{(0)} \sim Dir (α_{0} 1_{R}) Where ϕ_{c, q}^{(0)}$ and 1_R are vectors of length R, and α₀ = 0.01.

The v parameter in uences the cluster separability in terms of covariate profiles. A value of v close to 0 leads to similar profiles, as the probability ϕ_c,q for each covariate q tends towards 1/R for all clusters c. A value of v close to 1 leads to distinct profiles, as the small value of α₀ leads to most people in cluster c having the same value for covariate q, which is unlikely to be the same between the clusters for all ten covariates. We choose v = 0.4 as an intermediate value to demonstrate the added benefit of integrating repeated outcome measures for recovering the clustering.

The outcome is generated from a multivariate normal distribution with means 1_M and 4 · 1_M for the first and second clusters, respectively, and covariance matrices 0.5ℐ_M, where ℐ_M is the identity matrix of dimension M, the number of observation times.

4.2.2. Results

We ran PReMiuMlongi with the MVN response model and max(2000, 2000M) iterations each for burn in and sampling. The effect of the number of observation times on the resulting PEAR indices is shown in Figure 1. The x axis represents 5 simulation settings with all individuals having 0 to 4 time points. Each setting is simulated 1000 times. When the outcome is unaccounted for, the PEAR indices are low as the separability parameter between the covariate profiles is relatively low (ν = 0.4). The major PEAR improvement is from zero to one observation time. This is due to the way the longitudinal pattern is simulated: the clustering structure is weak in the covariates and stronger in the outcomes. It is possible to recover, to some extent, the clustering structure in the covariates without use of outcome data. However, inclusion of outcome data improves cluster-structure (Figure 1).

The effect of number of observation times on PEAR index. Points are median PEAR indices for 1000 simulations with error bars showing 0.05 and 0.95 quantiles. In each simulation, data are generated for 100 individuals belonging to one of two clusters. Each individual has 10 covariates and 0 to 4 time points (or MVN outcomes).

4.3. Simulation study 2: Semi-supervised clustering and variable selection

Here we adapt the simulation study of Section 4.2 to a case in which there are two sets of covariates, that correspond to two different clustering structures of the population. The outcome data shares the clustering structure of one set of covariates and is independent from the other one (which we refer to as “alternative clustering”). Moreover, we use the variable selection feature of PReMiuMlongi, in order to select the covariates that guide the final clustering. In the absence of outcome data, there is no reason for one set of covariates to be selected over the other one. Outcome data are therefore required for identifying meaningful clustering structures when more than one structure is present in the data according to different covariate subsets.

4.3.1. Design

Covariate data consist of four variables, two of which correspond to a first clustering structure (which we describe as the “alternative clustering”, and denote č) and two of which correspond to a second clustering structure, which retains the label c. We simulate covariates as in Simulation 1, while choosing ν = 1 for all covariates, so that each clustering structure has a strong signature when those covariates are viewed alone (see Supplementary Figure 10). Extending the notation from Section 4.2, where each individual has a cluster assignment ž_i for č and a cluster assignment z_i for c, covariate x_q,i|ž_i = č has parameter ϕ_č,q for q = 1, 2, and x_q,i|z_i = c has parameter ϕ_c,q for q = 3, 4.

Response data are simulated as in Section 4.2, according to clustering structure c:

y_{i} | z_{i} = c \sim {\begin{matrix} M V N (1_{M}, 0.5 I_{M}) & c = 1 \\ M V N (4 \cdot 1_{M,} 0.5 I_{M}) & c = 2 \end{matrix}} .

The response variable is aligned with clustering structure c, and is independent of the alternative clustering structure č. The two covariates with clustering structure c therefore align with the response variable. Again we vary the number of observation times from 0 to 4 in five simulation scenarios. We use this model to illustrate how inclusion of outcome data influences variable selection, and how these together affect PEAR indices.

4.3.2. Results

Inclusion of response variables enables recovery of the target clustering structure (or generated clustering structure) (Figure 2). The top row of Figure 2 shows density plots of variable selection for each number of observation times. The bottom row represents the corresponding 5^th, 50^th and 95^th percentiles of PEAR indices for the 1000 simulations of each scenario, when comparing the estimated clustering structure with the one that aligns with the outcome. When the response is excluded (number of observation times is zero), covariates 1–4 are not differentially selected, and correspondence between the inferred and generated clustering structures is low (median PEAR index around 0.5). When four time points (or MVN outcomes) are included, covariates one and two are deselected in favour of covariates three and four, resulting in identification of the clustering structure that aligns with the response.

The effect of number of observation times on PEAR indices when there are two clustering structures in the covariate data. Top: density plots showing selection for the four covariates. Bars show the weights given to the covariate in each of 1000 simulations, where 0 means unselected and 1 selected. Darker blue indicates higher density. Covariates 3 and 4 align with the outcome data. Bottom: points are median PEAR indices for 1000 simulations with error bars showing 0.05 and 0.95 quantiles.

Here we have shown that, for a dataset in which there is more than one clustering structure, use of covariate variable selection with outcome data in profile regression improves recovery of the target clustering structure.

4.4. Simulation study 3: Comparison of MVN and longitudinal specifications

To draw out the differences between the two response models, we use each in turn to simulate data, and use both to infer clusters and profiles. For the sake of comparison, we simulate data that are longitudinal, with all individuals having observations at the same time points, so that both the MVN and GP model formulations may be applied to the data generated. We compare the two methods in terms of PEAR indices and the time taken by the algorithm, and then assess the effects of a) the data-generating model, b) the number of observation times, and c) the extent to which the generated clusters are identifiable.

4.4.1. Design

In this section, we generate two clusters of 30 individuals. Covariate data consist of two discrete variables. In order to better highlight the difference between MVN and GP methods and avoid help from the covariates for cluster identification, this simulation study features no cluster structure in the covariate data.

Outcome data for the two clusters are simulated from a mean vector of 10 · 1_M and 10(1 – ξ · j), with j = 0, … 10, respectively, with M the number of equally spaced time points and ξ a gradient (or separability parameter) ranging from -1 to 0. However, the covariance matrices are the same for both clusters. When generating MVN data, we use a matrix of dimension M with diagonal values of 1 and lag-1 values of 0.5. When generating from a GP response model, we use the covariance function defined in Equation 9 with parameters {log(a), log(l), log(σ²)} = {−0.5, −0.1, −0.5}.

We consider several scenarios by varying the data-generating model (MVN or GP), the number of observation times M (3 to 6), the gradient ξ (-1 to 0). Some examples of simulated data are given in Figure 3. Both response models are estimated on each scenario, with 5000(M – 2) iterations each in the burn-in phase and in the sampling phase for the MVN response model and 5000 iterations for each phase for the GP one. Each scenario is repeated 1000 times.

Simulated data. Examples of data generated with the MVN model (left) and the GP model (right). Top line: three data points across the time period 0–10. The mean gradient is -1 for the second (purple) cluster. Bottom line: five data points across the time period 0–10. The mean gradient is -0.5 for the second (purple) cluster

4.4.2. Results

Figure 4 summarises the effects of the three variables of interest on the two response models. A small gradient (ξ¡-0.5) between the two generated clusters seems to have a strong negative impact on the MVN response model performances. Also, we observe that the target clustering structure is better recovered by the MVN response model when there are fewer observation times, for a budgeted computation of 10000(M-2) iterations. Since the model dimension increases with more observation times, more iterations of the sampler are required to reach convergence, as illustrated in Supplementary Figure 11.

Simulation study results and timings. The first two columns of red heatmaps show PEAR indices following inference with the MVN response model (top row) and the GP response model (bottom row). Data were generated with a MVN model (first column) and a GP model (second column). The last two columns of heatmaps show the corresponding duration, in seconds, of inference. Each square within each heatmap represents a single experimental set-up: the gradient (which indicates how separable the clusters are) varies on the y axis, and the number of observation times varies on the x axis. The colour in the first two columns of heatmaps corresponds to the average PEAR index of 1000 repetitions, where darker red indicates a higher index. In the last two columns, the colour corresponds to the average duration of 1000 repetitions, where darker blue/green indicates a longer time. PReMiuMlongi was run with 5000 iterations each for burn in and sampling for the GP response model. For the MVN response model, the number of iterations for the burn in and sampling were each 5000(M – 2), where M is the number of observation times

For the GP response model, the generated clustering structure is well recovered for the GP-generated data, as these data have an inherent structure that the response model can uncover. This response model has only three parameters to infer per cluster regardless of the number of observation times, in comparison with $\frac{1}{2} M (M - 1)$ for the MVN response model, so fewer iterations are required for the GP response model. Both models perform better when the clusters are more separable, as we would expect.

Figure 4 summarises the time taken for PReMiuMlongi to run. The MVN response model is much quicker than the GP model, by two orders of magnitude, despite a larger number of iterations in the burn-in phase and the sampler for datasets with more than three observation times.

4.5. Simulation study 4: Handling irregularly spaced time points

4.5.1. Design

We generated one dataset of 200 individuals, allocated to 5 clusters of sizes 10, 30, 50, 70 and 40 individuals, respectively. Conditionally on the cluster allocations, a Gaussian outcome was generated from a Gaussian Process Y_i,j = g_c(t_ij) + ϵ_i,j where $ϵ_{i, j} \sim N (0, σ_{c}^{2}) and g_{c} \sim G P (m_{c}, K_{c})$ with a null prior mean function m_c ≡ 0 and a squared exponential covariance function $K_{c} (s, t) = a_{c} exp (\frac{- {(s - t)}^{2}}{l_{c}}) .$ Outcome observations were generated for each individual at 7 visits (v = [0, 2, 4, 6, 8, 10, 12] years) with individual-specific time points $(t_{i j} \sim U ([υ_{j}, υ_{j} + 0.9]), i = 1, \dots 200, j = 1, \dots 7)$ . Figure 5 displays the individual outcome trajectories for the 200 individuals, coloured according to the cluster allocations. A set of 5 categorical covariates, with 3 categories each, were generated for each individual given the cluster allocations, with the cluster-specific parameters shown in Table 1.

Table 1. Hyperparameters for data generation $(with θ_{c} = \log ({(a_{c}, l_{c}, σ_{c}^{2})}^{⊤}))$ .

Cluster	m_c	θ_c	ϕ_c
1	[9,8.5,8,6,5,4,3]	[0.5,0.1,-0.7]	[0.8,0.1,0.1]	[0.8,0.1,0.1]	[0.8,0.1,0.1]	[0.8,0.1,0.1]	[0.8,0.1,0.1]
2	[9,7,6,4,2,1,0]	[0.6,0.2,-0.3]	[0.1,0.8,0.1]	[0.1,0.8,0.1]	[0.1,0.8,0.1]	[0.1,0.8,0.1]	[0.1,0.1,0.8]
3	[7,5,8,9,10,11,9]	[0.1,0.3,-0.7]	[0.4,0.5,0.1]	[0.5,0.4,0.1]	[0.1,0.4,0.5]	[0.5,0.1,0.4]	[0.5,0.3,0.2]
4	[8,5,8,9,10,11,9]	[0.3,0.4,-0.5]	[0.2,0.7,0.1]	[0.7,0.2,0.1]	[0.3,0.6,0.1]	[0.2,0.1,0.7]	[0.6,0.3,0.1]
5	[7,7.5,6,5,5,3,2]	[0.1,0.5,-0.7]	[0.6,0.3,0.1]	[0.2,0.1,0.7]	[0.3,0.6,0.1]	[0.7,0.2,0.1]	[0.2,0.7,0.1]

Open in a new tab

The model was then estimated using cluster-specific Gaussian Processes with standard normal hyper priors for the variance logparameters: $\log ({(a_{c}, l_{c}, σ_{c}^{2})}^{⊤}) \sim N (0, ℐ_{3})$ . We adopted a Gamma prior for the concentration parameter of the DP, α ∼ Gamma(2, 1), and Dirichlet priors for the cluster-specific parameters of the covariate models ϕ_c,q,e ∼ Dirichlet(1) for cluster c, covariate q and covariate level e. The initial number of clusters was set to 20.

4.5.2. Results

We ran 7 chains on the generated dataset, with 10000 iterations and compared the traces of α, number of clusters (nClus), number of non empty clusters (Figure 6) and posterior similarity matrices in Supplementary Figure 12.

The traces of α overlap and show convergence of the 7 chains. The traces of the number of clusters across chains also overlap but have different minimal values, bounded by the number of non-empty clusters, shown on the third plot. Even though the traces of non-empty clusters vary between 3 and 7 across the iterations, the similarity matrices (Supplementary Figure 12) show a good recovering of the generated clustering, represented in the first column of each plot. Posterior similarity matrices are highly contrasted because of the overall clear separability of the clusters. However, the third and fifth chains merge clusters 1 and 5, whose longitudinal trajectories are actually overlapping (Figure 5). This may imply that the chains are stuck in local minima, which is a common challenge in mixture models, and we could improve the algorithm by altering their proposal distribution. Split and merge moves could be added as well in the PReMiuMlongi sampler as in Jain and Neal(Jain and Neal, 2007) but this is beyond the scope of the current work.

We present a similar simulation study with 500 individuals in Section D in the Supplementary Materials, to mimic the size of the data application.

5. Application

We applied our methodology on budding-yeast data. Yeast is widely studied in genetic research as, like humans, they are eukaryotic organisms, with DNA information enclosed in cell nuclei. Additionally, they are unicellular, and therefore easier to study. The genome of the yeast species Saccharomyces cerevisiae has been entirely sequenced and is publicly available. During the cell cycle, gene expression is regulated by transcription factors, proteins inducing or repressing the transcription of DNA into mRNA information by binding to specific DNA sequences. Similar expression patterns across genes suggest a co-regulation process, indicating that these genes may be regulated by a common set of transcription factors. The protein-DNA interactions can be investigated using chromatin immunoprecipitation (ChIP) microarrays, which identifies the DNA binding sites. The objective was to analyze jointly ChIP and gene expression data, to uncover groups of genes co-regulated during the budding yeast cell cycle and identify the transcription factors involved in this process, in order to better understand the complex relationships between genes.

5.1. Data

We analyzed gene expression time course data collected by Granovskaia et al. (2010) and ChIP data provided by Harbison et al. (2004), also analyzed in Kirk et al. (2012a) and Zurauskiene et al. (2016). Gene expression was quantified over time by collecting high-resolution, strand-specific tiling microarray profiles of RNA expression every 5 minutes over three Saccharomyces cerevisiae cell cycles, for a total of 41 measures. The gene measurements were then standardised at the gene level, so that the mean and variance of the expression measures of each gene were 0 and 1, respectively. The binary ChIP data contained binding information for 117 transcription factors, informing which transcription factors bound to which genes, during the cell cycle. We selected 551 genes with periodic expression patterns identified in Granovskaia et al. (2010) for which ChIP data were available and considered the transcription factors which bound to at least one of these genes, totalling 80. To alleviate computational burden, we reduced the set of repeated gene expression measurements to a set of 11 regularly spaced time points. The corresponding trajectories for the 551 genes are depicted in Figure 7.

5.2. Model

We analysed the gene expression time course Y and the transcription factors X_q, q = 1, … 80, specifying the following model:

\begin{array}{l} \begin{array}{l} f (X, Y; Θ) = \sum_{c = 1}^{\infty} π_{c} f (Y | θ_{c}) \prod_{q = 1}^{80} f (X_{q} | ϕ_{c}) \\ π \sim G E M (α) \end{array} \\ X_{i, q} \sim B i n (ϕ_{c, q}) \\ Y_{i} = g_{c} (t_{i}) + ϵ_{i} (t_{i}) with ϵ_{i} (t_{i}) \sim N (0, σ_{c}^{2} I_{n_{i}}) \end{array}

with Y_i, t_i the vectors of individual repeated measures and time points, respectively, X_i,q the binary variable indicating whether transcription factor q binds to gene i, α the concentration parameter of the Dirichlet Process prior influencing the number of clusters, $Θ$ the vector of all parameters, ϕ_c,q the probability that transcription factor q equals 1 in cluster c. Hyperpriors were defined as:

\begin{array}{l} α \sim G a m m a (2, 1) \\ ϕ_{c, q} \sim D i r i c h l e t (1) \\ g_{c} \sim G P (m_{c,} K_{c}) with log (a_{c}) \sim N (1.48, 0.5) and log (l_{c}) \sim N (5, 0.1) \end{array}

with m_c defined as the null function and $K_{c}$ a squared exponential function (defined in Equation (9)). The lognormal hyperpriors adopted for the covariance functions $K_{c}$ were chosen based on an empirical variogram of the data (Diggle, 1990). We constrained the variance of the measurement error, defining it as proportional to the cluster variance: $a_{c} = r * σ_{c}^{2}$ with r = 4. Finally, a variable selection approach was used to identify the transcription factors likely involved in the co-regulation process of the genes in each group.

5.3. Results

The model was estimated by MCMC with 10000 iterations. Based on the posterior similarity matrix (Figure 9), we identified a final partition of 4 clusters of 109, 206, 113 and 123 genes respectively. Based on a 10% lower threshold for the variable-specific relevance indicator ρ_q (Equation 4), the variable selection identified 10 transcription factors which had different distributions across the clusters and were thus likely involved in the co-regulation process: MCM1, MBP1, SWI5, SWI6, SWI4, NDD1, ACE2, FKH2, FKH1 and STE12. The gene expression trajectories in these 4 clusters are depicted in Figure 8, exhibiting distinct patterns with different peak times, occurring in different phases of the cycle. The cell cycle is divided into 4 phases: Gap 1 (G1) during which the cell grows and prepares for duplication, synthesis (S) where DNA replication occurs, Gap 2 (G2) during which the cell keeps growing to prepare for division, and mitosis (M). The 4 clusters seem to be involved in the M/G1, G1, G1 and G2 phases, respectively. The description of the same clusters regarding all the transcription factors is depicted in Supplementary Figure 17. The right panels of figures 8 and 13 were produced with the R package premiumPlots (https://github.com/simisc/premiumPlots).

Posterior similarity matrix (PSM) representing the probability of pairwise co-clustering for the 551 genes. Each element (*i ,j*) of this 551 × 551-matrix is proportional to the number of times genes i and j were allocated to the same cluster, over the 10000 MCMC iterations.

Trajectories of gene expression and covariate profiles for the 4 clusters of the final partition. On the left panels, trajectories of the genes in the color associated with the cluster they are allocated to. Dashed vertical black lines delimit the three cell cycles and dashed vertical gray ones delimit the different phases, in the first cycle: M/G1, G1, S, G2, G2/M, M and M/G1 again. On the right panels: cluster-specific profiles with regard to the 10 selected transcription factors: MCM1, MBP1, SWI5, SWI6, SWI4, NDD1, ACE2, FKH2, FKH1 and STE12. In each cell, associated with cluster k = 1, …, 4 (represented by rows) and transcription factor q = 1, …, 10 (represented by columns), the black pip represents the empirical proportion of genes with *x_q* = 1 in the whole sample. The turquoise filled square and the red one represent the estimated cluster-specific proportions $\hat{P} (x_{c, q} = 1)$ and $\hat{P} (x_{c, q} = 0)$ , respectively. Squares are dark filled if the 95% credibility interval of $\hat{P} (x_{c, q} = 1)$ does not contain the empirical proportion, meaning that the considered category is significantly different in the cluster compared to the whole sample. Red squares are dark filled if the 5th percentile is above the empirical mean, and blue ones are dark filled if the 95th is below.

We assessed the relevance of the clustering solution using the Gene Ontology Term Overlap (GOTO) scores (Mistry and Pavlidis, 2008). GOTO scores can be interpreted as quantifying the homogeneity of Gene Ontology annotations within clusters of genes: they represent the average number of shared GO annotations for a pair of genes from the same cluster. Having similar values for our clustering compared to previous results on this dataset (see Supplementary Table 2) thus validates biological function homogeneity for our identified clusters.

Finally, we analyzed the same dataset using the MVN specification in Section E.2 of the Supplementary Materials. The results show that the GP specification over the MVN specification allows the apparent underlying clustering structure of the gene expression data to be recovered, notably thanks to its flexibility and reduced number of parameters to estimate, compared to the MVN outcome model with an unstructured covariance matrix.

5.4. Convergence diagnosis and sensitivity analysis

We ran two additional chains to assess the convergence of the model, and compared the trace of the concentration parameter α of the Dirichlet processes, as well as the number of clusters across the MCMC iterations, shown in Supplementary Figure 18. The cluster-specific parameters cannot be traced as the number of clusters varies from one iteration to another. Supplementary Figure 19 presents the two posterior similarity matrices which are very similar to the one in Figure 9, which also confirms the convergence of the model.

We estimated the same model with different values for the ratio between the signal variance parameter and the measurement error variance $r = \frac{a_{c}}{σ_{c}^{2}}$ in a sensitivity analysis. All the models were specified in the same way, as described above. We compared in terms of PEAR indices the partitions obtained with different values of r (Supplementary Table 3). We observe very stable results when r varies between 2 and 6, showing little sensitivity to the setting of this parameter.

6. Discussion

We extended the semi-supervised profile regression approach to uncover the intrinsic clustering structure in a heterogeneous population from a longitudinal outcome and a set of correlated covariates. Outcome patterns and covariate profiles, defined as combinations of covariate values, are linked through a non-parametric clustering structure. A Dirichlet Process prior is used on the mixing distribution to handle an unconstrained number of clusters, also allowing us to quantify the uncertainty in cluster allocations. We implemented two modelling approaches to describe the repeated outcome, based on either a multivariate Gaussian distribution to handle regular time points for all individuals or Gaussian Process regression, accommodating either irregular time points and/or a large number of time points. The cluster-specific parameters are estimated while incorporating the uncertainty in the cluster allocations, and individual marginal predictions can be computed, accounting for the population heterogeneity. The simulation study showed that the MVN modelling recovered better the clustering structure and was robust with a low number of time points, while the GP modelling had better clustering performances with more than 5 or 6 time points, even though it required a longer computation time regardless of the number of time points and separability of the clusters. Applying the model to budding yeast data, we integrated gene expression time course measurements and transcription factor binding data to identify clusters of co-regulated genes and associated transcriptional modules.

In the current version of the algorithm, we adopted an Inverse Wishart prior distribution for the variance covariance matrix in the MVN specification, which is advantageous for its conjugacy. However, with a high number of time points, a Hierarchical Inverse Wishart prior may provide more flexibility to the model (Alvarez et al., 2014). We also implemented a squared exponential kernel for the GP prior covariance function. This is a common choice for such priors, but can lead to some instability when fitting longitudinal data with highly overlapping trajectories. Indeed, the variance hyperprior has an important impact on the final number of non-empty clusters, with high signal variances leading to a small number, and vice versa. In our application, we fixed the variance ratio to ensure that the signal variance is larger (or at least equal) to the measurement variance for each cluster, in conjunction with weakly informative hyperpriors for the signal variance. Additionally, we performed a sensitivity analysis showing the stability of our results with different values for this variance ratio.

Duvenaud et al. (2013) considered other base kernels such as linear, quadratic, rational quadratic or periodic ones, and proposed composite kernels defined as combinations of the base ones. Such extension in profile regression could potentially improve the recovering of known structures in the dataset. Besides, conditionally on the cluster allocations, the variability of the outcome is entirely modelled by this covariance function. Cluster-specific fixed effects could refine the cluster identification and help disentangle the effect of a covariate across the different clusters. Finally, hierarchical Gaussian Process regression (Hensman et al., 2013) would allow the handling of multiple longitudinal markers, all correlated through an underlying cluster-specific process. However, the estimation of such models would require a large amount of data to estimate all the levels of the hierarchical GP modelling.

The dataset analysed in this paper contained no missing values, but our model used with the GP specification could also be applied to incomplete data, such as epidemiological cohorts. However, it relies on the assumption that missing outcome data are missing at random. The MVN specification does not handle missing outcome data, but multiple imputation methods (Rubin, 1987, van Buuren and Groothuis-Oudshoorn, 2011) could be considered for a future extension. For handling not-at-random missing data, we intend to extend the algorithm to model jointly a longitudinal outcome and a time-to-event, using a cluster-specific proportional hazards model. The missingness process would then be linked to the outcome through the clusters, which would capture entirely the correlation between the two quantities. Multiple endpoints, such as time to disease onset and disease progression, could also be accommodated by combining a multi-state model, associating specific risks of disease with the different clusters, as in Rouanet et al. (2016).

Finally, profile regression assumes a common clustering structure to all data types integrated in the analysis. More flexible approaches consider different structures for the longitudinal outcome and the covariates, with possible common clusters (Kirk et al., 2012b, Kirk and Richardson, 2021, Savage et al., 2010). This enables the identification of the meaningful clusters with regard to all the biomarkers, while discarding the irrelevant stratifications in the population, possibly linked to the inherent structure of a given set of variables. This would also be an interesting area for development.

Supplementary Material

Appendix

EMS176875-supplement-Appendix.pdf^{(8.7MB, pdf)}

Acknowledgements

This work was supported by the National Institute for Health Research [Cambridge Biomedical Research Centre at the Cambridge University Hospitals NHS Foundation Trust]. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.

Funding

AR, SR and BDT: MRC-funded Dementias Platform UK (RG74590).

SR: MRC programme MC_UU_00002/10. BDT: MRC programme MC_UU_00002/2.

RJ: MRC programme MR/R015600/1.

SRW: MRC programme MC_UU 00002/2 and NIHR BRC-1215-20014.

PDWK: MRC programme MC_UU 00002/13.

References

Akritas AG, Akritas EK, Malaschonok GI. Various proofs of Sylvester’s (determinant) identity. Mathematics and Computers in Simulation. 1996;42(4–6):585–593. [Google Scholar]
Alman J, Williams VV. A refined laser method and faster matrix multiplication; Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA); 2021. pp. 522–539. [Google Scholar]
Alvarez I, Niemi J, Simpson M. Bayesian inference for a covariance matrix. arXiv preprint. 2014:arXiv:1408.4050 [Google Scholar]
Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS biology. 2004;2(4):e108. doi: 10.1371/journal.pbio.0020108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brock G, Pihur V, Datta S. clValid: An R package for cluster validation. Journal of Statistical Software. 2008;25(4) [Google Scholar]
Cui Y, Fern XZ, Dy JG. Non-redundant multi-view clustering via orthogonalization; Seventh IEEE International Conference on Data Mining (ICDM 2007); 2007. pp. 133–142. [Google Scholar]
Diggle PJ. Time series. A biostatistical introduction. Clarendon Press; Oxford: 1990. [Google Scholar]
Duvenaud D, Lloyd J, Grosse R, Tenenbaum J, Zoubin G. Structure discovery in non-parametric regression through compositional kernel search; International Conference on Machine Learning; 2013. pp. 1166–1174. [Google Scholar]
Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal Of The American Statistical Association. 2002;97(458):611–631. [Google Scholar]
Fritsch A, Ickstadt K. Improved criteria for clustering based on the posterior similarity matrix. Bayesian Analysis. 2009;4(2):367–391. [Google Scholar]
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis. 3rd edition. Chapman and Hall/CRC; 2013. [Google Scholar]
Granovskaia MV, Jensen LJ, Ritchie ME, Toedling J, Ning Y, Bork P, Huber W, Steinmetz LM. High-resolution transcription atlas of the mitotic cell cycle in budding yeast. Genome biology. 2010;11(3):1–11. doi: 10.1186/gb-2010-11-3-r24. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guan Y, Dy JG, Niu D, Ghahramani Z. Variational inference for nonparametric multiple clustering; MultiClust Workshop, KDD-2010; 2010. [Google Scholar]
Handl J, Knowles JD, Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005;21(15):3201–3212. doi: 10.1093/bioinformatics/bti517. [DOI] [PubMed] [Google Scholar]
Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne J-B, Reynolds DB, Yoo J, et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431(7004):99–104. doi: 10.1038/nature02800. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hastie DI, Liverani S, Richardson S. Sampling from Dirichlet process mixture models with unknown concentration parameter: mixing issues in large data implementations. Statistics and Computing. 2015;25(5):1023–1037. doi: 10.1007/s11222-014-9471-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hathaway DK, D’agostino RB. A technique for summarizing longitudinal data. Statistics in Medicine. 1993;12(23):2169–2178. doi: 10.1002/sim.4780122303. [DOI] [PubMed] [Google Scholar]
Hensman J, Lawrence ND, Rattray M. Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters. BMC Bioinformatics. 2013;14(1):252. doi: 10.1186/1471-2105-14-252. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218. [Google Scholar]
Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nature genetics. 2002;31(4):370–377. doi: 10.1038/ng941. [DOI] [PubMed] [Google Scholar]
Ishwaran H, James LF. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96(453):161–173. [Google Scholar]
Jacob F, Monod J. Genetic regulatory mechanisms in the synthesis of proteins. Journal of Molecular Biology. 1961;3(3):318–356. doi: 10.1016/s0022-2836(61)80072-7. [DOI] [PubMed] [Google Scholar]
Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM computing surveys (CSUR) 1999;31(3):264–323. [Google Scholar]
Jain S, Neal RM. Splitting and merging components of a nonconjugate dirichlet process mixture model. BMC Bioinformatics. 2007;2(3):445–472. [Google Scholar]
Kalli M, Griffin JE, Walker SG. Slice sampling mixture models. Statistics and Computing. 2009;21(1):93–105. [Google Scholar]
Kerr MK, Churchill GA. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proceedings of the national academy of sciences. 2001;98(16):8961–8965. doi: 10.1073/pnas.161273698. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kirk PDW, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012a;28(24):3290–3297. doi: 10.1093/bioinformatics/bts595. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kirk PDW, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012b;28(24):3290–3297. doi: 10.1093/bioinformatics/bts595. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kirk PDW, Richardson S. Semi-supervised multi-view Bayesian clustering. 2021 In preparation. [Google Scholar]
Law MH, Jain AK, Figueiredo M. In: Advances in Neural Information Processing Systems. Becker S, Thrun S, Obermayer K, editors. MIT Press; 2003. Feature selection in mixture-based clustering; pp. 641–648. [Google Scholar]
Law MHC, Figueiredo MAT, Jain AK. Simultaneous feature selection and clustering using mixture models. IEEE transactions on pattern analysis and machine intelligence. 2004;26(9):1154–1166. doi: 10.1109/TPAMI.2004.71. [DOI] [PubMed] [Google Scholar]
Li D, Shafto P. Bayesian hierarchical cross-clustering; Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics; Fort Lauderdale, FL, USA. 2011. pp. 443–451. [Google Scholar]
Liverani S, Hastie DI, Azizi L, Papathomas M, Richardson S. PReMiuM: An R package for profile regression mixture models using Dirichlet processes. Journal of Statistical Software. 2015;64(7):1–30. doi: 10.18637/jss.v064.i07. [DOI] [PMC free article] [PubMed] [Google Scholar]
McDowell IC, Manandhar D, Vockley CM, Schmid AK, Reddy TE, Engelhardt BE. Clustering gene expression time series data using an infinite gaussian process mixture model. PLoS computational biology. 2018;14(1):e1005896. doi: 10.1371/journal.pcbi.1005896. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mistry M, Pavlidis P. Gene ontology term overlap as a measure of gene functional similarity. BMC bioinformatics. 2008;9(1):327. doi: 10.1186/1471-2105-9-327. [DOI] [PMC free article] [PubMed] [Google Scholar]
Molitor J, Papathomas M, Jerrett M, Richardson S. Bayesian profile regression with an application to the National Survey of Children’s Health. Biostatistics. 2010;11(3):484–498. doi: 10.1093/biostatistics/kxq013. [DOI] [PubMed] [Google Scholar]
Murphy KP. Conjugate Bayesian analysis of the Gaussian distribution, Technical report. 2007
Neal RM. Regression and classification using Gaussian process priors. Bayesian statistics. 1999;6:475–501. [Google Scholar]
Niu D, Dy JG, Jordan MI. Multiple non-redundant spectral clustering views; Proceedings of the 27th International Conference on International Conference on Machine Learning; USA. 2010. pp. 831–838. [Google Scholar]
Niu D, Dy JG, Jordan MI. Iterative discovery of multiple alternative clustering views. IEEE transactions on pattern analysis and machine intelligence. 2014;36(7):1340–1353. doi: 10.1109/TPAMI.2013.180. [DOI] [PubMed] [Google Scholar]
Papathomas M, Molitor J, Hoggart C, Hastie D, Richardson S. Exploring data from genetic association studies using Bayesian variable selection and the Dirichlet process: application to searching for gene-gene patterns. Genetic epidemiology. 2012;36(6):663–674. doi: 10.1002/gepi.21661. [DOI] [PubMed] [Google Scholar]
Pitman J. Combinatorial stochastic processes. Technical report. Dept Statistics; UC Berkeley: 2002. [Google Scholar]
Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66(336):846–850. [Google Scholar]
Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. The MIT Press; 2006. [Google Scholar]
Reid JE, Wernisch L. Pseudotime estimation: deconfounding single cell time series. Bioinformatics. 2016;32(19):2973–2980. doi: 10.1093/bioinformatics/btw372. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ross J, Dy J. In: Dasgupta S, McAllester D, editors. Nonparametric mixture of gaussian processes with constraints; Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research; Atlanta, Georgia, USA. 2013. pp. 1346–1354. [Google Scholar]
Rouanet A, Joly P, Dartigues J-F, Proust-Lima C, Jacqmin-Gadda H. Joint latent class model for longitudinal data and interval-censored semi-competing events: Application to dementia. Biometrics. 2016;72(4):1123–1135. doi: 10.1111/biom.12530. [DOI] [PubMed] [Google Scholar]
Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987;20:53–65. [Google Scholar]
Rubin DB. Multiple imputation for survey nonresponse. 1987 [Google Scholar]
Savage RS, Ghahramani Z, Griffin JE, de la Cruz BJ, Wild DL. Discovering transcriptional modules by Bayesian data integration. Bioinformatics. 2010;26(12):158–167. doi: 10.1093/bioinformatics/btq210. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009;25(22):2906–2912. doi: 10.1093/bioinformatics/btp543. [DOI] [PMC free article] [PubMed] [Google Scholar]
Snelson E, Ghahramani Z. Sparse gaussian processes using pseudo-inputs. Advances in neural information processing systems. 2006:1257–1264. [Google Scholar]
Strauss ME. Doctoral thesis. 2019. Bayesian modelling and sampling strategies for ordering and clustering problems with a focus on next-generation sequencing data. [Google Scholar]
Strauss ME, Kirk PDW, Reid JE, Wernisch L. GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution. Bioinformatics. 2019;36(5):1484–1491. doi: 10.1093/bioinformatics/btz778. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sugar CA, James GM. Finding the number of clusters in a dataset. Journal of the American Statistical Association. 2003;98(463):750–763. [Google Scholar]
Tadesse MG, Sha N, Vannucci M. Bayesian variable selection in clustering high-dimensional data. Journal Of The American Statistical Association. 2005;100(470):602–617. [Google Scholar]
Tibshirani R, Walther G. Cluster validation by prediction strength. Journal of Computational and Graphical Statistics. 2005;14(3):511–528. [Google Scholar]
Tong AHY, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, et al. Global mapping of the yeast genetic interaction network. science. 2004;303(5659):808–813. doi: 10.1126/science.1091317. [DOI] [PubMed] [Google Scholar]
van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software. 2011;45(3):1–67. [Google Scholar]
Woodbury MA. Memorandum Rept 42, Statistical Research Group. Princeton Univ; 1950. Inverting modified matrices; p. 4. [Google Scholar]
Yeung KY, Haynor DR, Ruzzo WL. Validating clustering for gene expression data. Bioinformatics. 2001;17(4):309–318. doi: 10.1093/bioinformatics/17.4.309. [DOI] [PubMed] [Google Scholar]
Zurauskiene J, Kirk PDW, Stumpf MPH. A graph theoretical approach to data fusion. Statistical Applications in Genetics and Molecular Biology. 2016;15(2):107–122. doi: 10.1515/sagmb-2016-0016. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

EMS176875-supplement-Appendix.pdf^{(8.7MB, pdf)}

[R1] Akritas AG, Akritas EK, Malaschonok GI. Various proofs of Sylvester’s (determinant) identity. Mathematics and Computers in Simulation. 1996;42(4–6):585–593. [Google Scholar]

[R2] Alman J, Williams VV. A refined laser method and faster matrix multiplication; Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA); 2021. pp. 522–539. [Google Scholar]

[R3] Alvarez I, Niemi J, Simpson M. Bayesian inference for a covariance matrix. arXiv preprint. 2014:arXiv:1408.4050 [Google Scholar]

[R4] Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS biology. 2004;2(4):e108. doi: 10.1371/journal.pbio.0020108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Brock G, Pihur V, Datta S. clValid: An R package for cluster validation. Journal of Statistical Software. 2008;25(4) [Google Scholar]

[R6] Cui Y, Fern XZ, Dy JG. Non-redundant multi-view clustering via orthogonalization; Seventh IEEE International Conference on Data Mining (ICDM 2007); 2007. pp. 133–142. [Google Scholar]

[R7] Diggle PJ. Time series. A biostatistical introduction. Clarendon Press; Oxford: 1990. [Google Scholar]

[R8] Duvenaud D, Lloyd J, Grosse R, Tenenbaum J, Zoubin G. Structure discovery in non-parametric regression through compositional kernel search; International Conference on Machine Learning; 2013. pp. 1166–1174. [Google Scholar]

[R9] Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal Of The American Statistical Association. 2002;97(458):611–631. [Google Scholar]

[R10] Fritsch A, Ickstadt K. Improved criteria for clustering based on the posterior similarity matrix. Bayesian Analysis. 2009;4(2):367–391. [Google Scholar]

[R11] Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis. 3rd edition. Chapman and Hall/CRC; 2013. [Google Scholar]

[R12] Granovskaia MV, Jensen LJ, Ritchie ME, Toedling J, Ning Y, Bork P, Huber W, Steinmetz LM. High-resolution transcription atlas of the mitotic cell cycle in budding yeast. Genome biology. 2010;11(3):1–11. doi: 10.1186/gb-2010-11-3-r24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Guan Y, Dy JG, Niu D, Ghahramani Z. Variational inference for nonparametric multiple clustering; MultiClust Workshop, KDD-2010; 2010. [Google Scholar]

[R14] Handl J, Knowles JD, Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005;21(15):3201–3212. doi: 10.1093/bioinformatics/bti517. [DOI] [PubMed] [Google Scholar]

[R15] Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne J-B, Reynolds DB, Yoo J, et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431(7004):99–104. doi: 10.1038/nature02800. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Hastie DI, Liverani S, Richardson S. Sampling from Dirichlet process mixture models with unknown concentration parameter: mixing issues in large data implementations. Statistics and Computing. 2015;25(5):1023–1037. doi: 10.1007/s11222-014-9471-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Hathaway DK, D’agostino RB. A technique for summarizing longitudinal data. Statistics in Medicine. 1993;12(23):2169–2178. doi: 10.1002/sim.4780122303. [DOI] [PubMed] [Google Scholar]

[R18] Hensman J, Lawrence ND, Rattray M. Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters. BMC Bioinformatics. 2013;14(1):252. doi: 10.1186/1471-2105-14-252. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218. [Google Scholar]

[R20] Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nature genetics. 2002;31(4):370–377. doi: 10.1038/ng941. [DOI] [PubMed] [Google Scholar]

[R21] Ishwaran H, James LF. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001;96(453):161–173. [Google Scholar]

[R22] Jacob F, Monod J. Genetic regulatory mechanisms in the synthesis of proteins. Journal of Molecular Biology. 1961;3(3):318–356. doi: 10.1016/s0022-2836(61)80072-7. [DOI] [PubMed] [Google Scholar]

[R23] Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM computing surveys (CSUR) 1999;31(3):264–323. [Google Scholar]

[R24] Jain S, Neal RM. Splitting and merging components of a nonconjugate dirichlet process mixture model. BMC Bioinformatics. 2007;2(3):445–472. [Google Scholar]

[R25] Kalli M, Griffin JE, Walker SG. Slice sampling mixture models. Statistics and Computing. 2009;21(1):93–105. [Google Scholar]

[R26] Kerr MK, Churchill GA. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proceedings of the national academy of sciences. 2001;98(16):8961–8965. doi: 10.1073/pnas.161273698. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Kirk PDW, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012a;28(24):3290–3297. doi: 10.1093/bioinformatics/bts595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Kirk PDW, Griffin JE, Savage RS, Ghahramani Z, Wild DL. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics. 2012b;28(24):3290–3297. doi: 10.1093/bioinformatics/bts595. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Kirk PDW, Richardson S. Semi-supervised multi-view Bayesian clustering. 2021 In preparation. [Google Scholar]

[R30] Law MH, Jain AK, Figueiredo M. In: Advances in Neural Information Processing Systems. Becker S, Thrun S, Obermayer K, editors. MIT Press; 2003. Feature selection in mixture-based clustering; pp. 641–648. [Google Scholar]

[R31] Law MHC, Figueiredo MAT, Jain AK. Simultaneous feature selection and clustering using mixture models. IEEE transactions on pattern analysis and machine intelligence. 2004;26(9):1154–1166. doi: 10.1109/TPAMI.2004.71. [DOI] [PubMed] [Google Scholar]

[R32] Li D, Shafto P. Bayesian hierarchical cross-clustering; Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics; Fort Lauderdale, FL, USA. 2011. pp. 443–451. [Google Scholar]

[R33] Liverani S, Hastie DI, Azizi L, Papathomas M, Richardson S. PReMiuM: An R package for profile regression mixture models using Dirichlet processes. Journal of Statistical Software. 2015;64(7):1–30. doi: 10.18637/jss.v064.i07. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] McDowell IC, Manandhar D, Vockley CM, Schmid AK, Reddy TE, Engelhardt BE. Clustering gene expression time series data using an infinite gaussian process mixture model. PLoS computational biology. 2018;14(1):e1005896. doi: 10.1371/journal.pcbi.1005896. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Mistry M, Pavlidis P. Gene ontology term overlap as a measure of gene functional similarity. BMC bioinformatics. 2008;9(1):327. doi: 10.1186/1471-2105-9-327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Molitor J, Papathomas M, Jerrett M, Richardson S. Bayesian profile regression with an application to the National Survey of Children’s Health. Biostatistics. 2010;11(3):484–498. doi: 10.1093/biostatistics/kxq013. [DOI] [PubMed] [Google Scholar]

[R37] Murphy KP. Conjugate Bayesian analysis of the Gaussian distribution, Technical report. 2007

[R38] Neal RM. Regression and classification using Gaussian process priors. Bayesian statistics. 1999;6:475–501. [Google Scholar]

[R39] Niu D, Dy JG, Jordan MI. Multiple non-redundant spectral clustering views; Proceedings of the 27th International Conference on International Conference on Machine Learning; USA. 2010. pp. 831–838. [Google Scholar]

[R40] Niu D, Dy JG, Jordan MI. Iterative discovery of multiple alternative clustering views. IEEE transactions on pattern analysis and machine intelligence. 2014;36(7):1340–1353. doi: 10.1109/TPAMI.2013.180. [DOI] [PubMed] [Google Scholar]

[R41] Papathomas M, Molitor J, Hoggart C, Hastie D, Richardson S. Exploring data from genetic association studies using Bayesian variable selection and the Dirichlet process: application to searching for gene-gene patterns. Genetic epidemiology. 2012;36(6):663–674. doi: 10.1002/gepi.21661. [DOI] [PubMed] [Google Scholar]

[R42] Pitman J. Combinatorial stochastic processes. Technical report. Dept Statistics; UC Berkeley: 2002. [Google Scholar]

[R43] Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66(336):846–850. [Google Scholar]

[R44] Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. The MIT Press; 2006. [Google Scholar]

[R45] Reid JE, Wernisch L. Pseudotime estimation: deconfounding single cell time series. Bioinformatics. 2016;32(19):2973–2980. doi: 10.1093/bioinformatics/btw372. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Ross J, Dy J. In: Dasgupta S, McAllester D, editors. Nonparametric mixture of gaussian processes with constraints; Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research; Atlanta, Georgia, USA. 2013. pp. 1346–1354. [Google Scholar]

[R47] Rouanet A, Joly P, Dartigues J-F, Proust-Lima C, Jacqmin-Gadda H. Joint latent class model for longitudinal data and interval-censored semi-competing events: Application to dementia. Biometrics. 2016;72(4):1123–1135. doi: 10.1111/biom.12530. [DOI] [PubMed] [Google Scholar]

[R48] Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987;20:53–65. [Google Scholar]

[R49] Rubin DB. Multiple imputation for survey nonresponse. 1987 [Google Scholar]

[R50] Savage RS, Ghahramani Z, Griffin JE, de la Cruz BJ, Wild DL. Discovering transcriptional modules by Bayesian data integration. Bioinformatics. 2010;26(12):158–167. doi: 10.1093/bioinformatics/btq210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics. 2009;25(22):2906–2912. doi: 10.1093/bioinformatics/btp543. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] Snelson E, Ghahramani Z. Sparse gaussian processes using pseudo-inputs. Advances in neural information processing systems. 2006:1257–1264. [Google Scholar]

[R53] Strauss ME. Doctoral thesis. 2019. Bayesian modelling and sampling strategies for ordering and clustering problems with a focus on next-generation sequencing data. [Google Scholar]

[R54] Strauss ME, Kirk PDW, Reid JE, Wernisch L. GPseudoClust: deconvolution of shared pseudo-profiles at single-cell resolution. Bioinformatics. 2019;36(5):1484–1491. doi: 10.1093/bioinformatics/btz778. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] Sugar CA, James GM. Finding the number of clusters in a dataset. Journal of the American Statistical Association. 2003;98(463):750–763. [Google Scholar]

[R56] Tadesse MG, Sha N, Vannucci M. Bayesian variable selection in clustering high-dimensional data. Journal Of The American Statistical Association. 2005;100(470):602–617. [Google Scholar]

[R57] Tibshirani R, Walther G. Cluster validation by prediction strength. Journal of Computational and Graphical Statistics. 2005;14(3):511–528. [Google Scholar]

[R58] Tong AHY, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, et al. Global mapping of the yeast genetic interaction network. science. 2004;303(5659):808–813. doi: 10.1126/science.1091317. [DOI] [PubMed] [Google Scholar]

[R59] van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in r. Journal of Statistical Software. 2011;45(3):1–67. [Google Scholar]

[R60] Woodbury MA. Memorandum Rept 42, Statistical Research Group. Princeton Univ; 1950. Inverting modified matrices; p. 4. [Google Scholar]

[R61] Yeung KY, Haynor DR, Ruzzo WL. Validating clustering for gene expression data. Bioinformatics. 2001;17(4):309–318. doi: 10.1093/bioinformatics/17.4.309. [DOI] [PubMed] [Google Scholar]

[R62] Zurauskiene J, Kirk PDW, Stumpf MPH. A graph theoretical approach to data fusion. Statistical Applications in Genetics and Molecular Biology. 2016;15(2):107–122. doi: 10.1515/sagmb-2016-0016. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Bayesian profile regression for clustering analysis involving a longitudinal response and explanatory variables

Anaïs Rouanet, PhD

Rob Johnson, PhD

Magdalena Strauss, PhD

Sylvia Richardson, PhD

Brian D Tom, PhD

Simon R White, PhD

Paul D W Kirk

Abstract

1. Introduction

2. Bayesian profile regression: A semi-supervised clustering model

3. Clustering guided by a longitudinal outcome

3.1. Multivariate normal outcome model

3.2. Gaussian process outcome model: a Bayesian functional data approach

3.2.1. Gaussian process priors for unknown functions

3.2.2. Likelihood

3.3. Model inference

Marginal algorithm

Conditional algorithm

3.4. Estimated cluster trajectories and individual predictions

4. Simulation studies

4.1. Inference

4.2. Simulation study 1: Semi-supervised vs. unsupervised clustering

4.2.1. Design

4.2.2. Results

Figure 1.

4.3. Simulation study 2: Semi-supervised clustering and variable selection

4.3.1. Design

4.3.2. Results

Figure 2.

4.4. Simulation study 3: Comparison of MVN and longitudinal specifications

4.4.1. Design

Figure 3.

4.4.2. Results

Figure 4.

4.5. Simulation study 4: Handling irregularly spaced time points

4.5.1. Design

Figure 5. Generated longitudinal trajectories for 200 individuals, coloured by clusters.

Table 1. Hyperparameters for data generation (withθc=log((ac,lc,σc2)⊤)).

4.5.2. Results

Figure 6. From left to right: Traces of the α DP parameter, of the number of clusters and number of non-empty clusters for the 7 chains.

5. Application

5.1. Data

Figure 7. Standardised gene expression for 551 genes of budding yeast observed over time, in minutes.

5.2. Model

5.3. Results

Figure 9.

Figure 8.

5.4. Convergence diagnosis and sensitivity analysis

6. Discussion

Supplementary Material

Acknowledgements

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 1. Hyperparameters for data generation $(with θ_{c} = \log ({(a_{c}, l_{c}, σ_{c}^{2})}^{⊤}))$ .