Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Sep 11.
Published in final edited form as: Ann Appl Stat. 2025 Aug 28;19(3):2193–2217. doi: 10.1214/25-aoas2045

BAYESIAN LEARNING OF CLINICALLY MEANINGFUL SEPSIS PHENOTYPES IN NORTHERN TANZANIA

Alexander Dombowsky 1,a, David B Dunson 1,2, Deng B Madut 3,4,5, Matthew P Rubach 3,4,5, Amy H Herring 1,3,6
PMCID: PMC12422288  NIHMSID: NIHMS2082701  PMID: 40937183

Abstract

Sepsis is a life-threatening condition caused by a dysregulated host response to infection. Recently, researchers have hypothesized that sepsis consists of a heterogeneous spectrum of distinct subtypes, motivating several studies to identify clusters of sepsis patients that correspond to subtypes, with the long-term goal of using these clusters to design subtype-specific treatments. Therefore, clinicians rely on clusters having a concrete medical interpretation, usually corresponding to clinically meaningful regions of the sample space that have a concrete implication to practitioners. In this article, we propose Clustering Around Meaningful Regions (CLAMR), a Bayesian clustering approach that explicitly models the medical interpretation of each cluster center. CLAMR favors clusterings that can be summarized via meaningful feature values, leading to medically significant sepsis patient clusters. We also provide details on measuring the effect of each feature on the clustering using Bayesian hypothesis tests, so one can assess what features are relevant for cluster interpretation. Our focus is on clustering sepsis patients from Moshi, Tanzania, where patients are younger and the prevalence of HIV infection is higher than in previous sepsis subtyping cohorts.

Keywords: Clustering, Informative prior, Meta analysis, Prior elicitation, Unsupervised learning

1. Introduction.

For hospitals across the world, the diagnosis and rapid treatment of sepsis is of paramount importance for improving in-hospital outcomes. Broadly, sepsis is defined as organ dysfunction resulting from an immune response to an infection (Singer et al., 2016) and is the cause of one-third to one-half of in-hospital deaths in the USA (Liu et al., 2014). The catalyst for sepsis is usually a bacterial infection (Bennett, Dolin and Blaser, 2019), located in the lungs, brain, urinary tract, or digestive system (Tintinalli et al., 2016), but sepsis can also be a generalized systemic infection (e.g., bloodstream infection), and microbes causing sepsis also include fungal, viral, and protozoal pathogens. The clinical manifestations are dynamic, changing over the course of the illness. If not recognized and treated early, a patient is more likely to progress to a state of circulatory collapse, known as septic shock, which significantly increases the risk of mortality (Singer et al., 2016). Sepsis is life-threatening: even minor organ dysfunction in the presence of infection has a mortality rate of at least 10% (Singer et al., 2016), and septic shock is estimated to have as high as 25% mortality (Rowan et al., 2017). Once detected, prompt interventions, such as antibiotics, vasopressors (Rhodes et al., 2017), intravenous fluids (Rowan et al., 2017), or adjunctive corticosteroids (Annane et al., 2018) can improve outcomes.

However, despite its prevalence and risk, therapies specifically targeted towards treatment of sepsis, including early goal-directed therapy (EGDT), have not led to improved outcomes (Opal et al., 2014; Rowan et al., 2017; van der Poll et al., 2017). This phenomenon has been attributed to the hypothesis that the general condition of sepsis is comprised of heterogeneous and distinct subtypes which may respond better to treatment plans specifically oriented to the pathophysiology of these respective subtypes. For example, randomized trials of adjunctive corticosteroids in adults with sepsis had mixed results, and in pediatric patients adjunctive corticosteroids were associated with increased mortality in one subtype of septic shock (Wong et al., 2017). Sepsis subtypes have been derived by several studies which clustered in-hospital cohorts and found distinct patient groups, or phenotypes, using comparative mRNA transcriptional analysis of the patient’s immune response (Davenport et al., 2016; Scicluna et al., 2017; Wong et al., 2017; Sweeney et al., 2018). Given that these descriptions of sepsis subtypes were primarily based upon immunologic characteristics, the Sepsis Endotyping in Emergency Care (SENECA) project hypothesized that subtypes could also be derived from routine clinical information that is readily extractable from patient electronic health record sources (Seymour et al., 2019). They described four distinct phenotypes characterized by age and renal dysfunction (β), inflammation and pulmonary dysfunction (γ), liver dysfunction and septic shock (δ), and the absence of these characteristics (α). Moreover, they found evidence that EGDT was beneficial for patients in the α subtype, whereas the δ subtype responded negatively to the treatment.

Our interest is in inferring phenotypes present within a patient population not adequately represented by previous studies. The Investigating Febrile Deaths in Tanazania (INDITe) study sought to describe the etiologies of fatal febrile illness in sub-Saharan Africa (sSA) with the over-arching goal of identifiying priorities for the prevention and treatment of severe infectious diseases (Snavely et al., 2018). From September 2016 to May 2019, the study collected detailed transcriptome and clinical information from 599 febrile patients across two hospitals in Moshi, Tanzania, the administrative seat of the Kilimanjaro Region. Studies have estimated that the in-hospital mortality rate of severe sepsis in sSA is between 20 and 60% (Jacob et al., 2009; Andrews et al., 2014, 2017), and the region has the highest health-related burden of sepsis in the world (Rudd et al., 2020). In contrast to the primarily European and North American populations in previous phenotype classifications such as SENECA, the INDITe cohort had markedly different immunological profiles and sepsis etiologies. For one, the average age of the INDITe cohort was 41.3, whereas for SENECA the average age was 64 (Seymour et al., 2019). Secondly, the prevalence of HIV infection among the INDITe participants was 38.2%, and of the persons living with HIV, 69.0% had CD4 T-cell lymphocyte percentage indicative of advanced HIV; this prevalence of HIV and advanced HIV is much higher than the HIV prevalence in the sepsis subtyping populations from Europe and North America. In terms of etiologies, febrile patients in Kilimanjaro exhibit pathogens not common in Europe and North America, including cryptococcosis, chikungunya, and histoplasmosis (Crump et al., 2013; Rubach et al., 2015). Due to these substantial differences, it is possible that there were endotypes present in the INDITe cohort that have not been previously cataloged.

Clustering methods utilized by the aforementioned phenotype analyses include agglomerative hierarchical clustering (Jain and Dubes, 1988), consensus k-means (Wilkerson and Hayes, 2010), and a combined mapping of clustering algorithms (Sweeney, Chen and Gevaert, 2015). Such methods can be categorized as being algorithmic in using an iterative mechanism to minimize a clustering loss function or meet a clustering criterion. Algorithmic clustering methods can work well in certain cases but need the number of clusters to be pre-specified and typically lack a characterization of uncertainty in clustering. These points are addressed by model-based methods which compute clusters using tools from statistical inference. Model-based clustering methods rely on mixture models to simultaneously sort individuals into groups while inferring the probability distribution of data within each group. The Gaussian mixture model (GMM) results when one assumes that the features of individuals in each cluster follow a multivariate Gaussian distribution. A wide range of approaches are used to estimate the clustering of the data under a GMM using either frequentist or Bayesian inference, see MacLachlan and Peel (2000), Frühwirth-Schnatter, Celeux and Robert (2019), and Wade (2023) for an overview.

This article will take a Bayesian approach to clustering. We observe data y = (y1, …, yn), where for each individual i, we have features yi = (yi1, …, yip). Our goal is to group patients into L phenotype clusters, denoted by c = (c1, …, cn), where ci ∈ {1, …, L}. To express uncertainty in c before observing y, we assign a prior π(c), which is a probability distribution over the space of all clusterings. Incorporating information in the data, the prior is updated to obtain the posterior π(cy). Since this posterior tends to be analytically intractable, Bayesian inference typically relies on Markov chain Monte Carlo (MCMC) algorithms that produce samples c(t) from π(cy) (MacLachlan and Peel, 2000; Wade, 2023). From the samples c(t), one obtains standard quantities in statistical inference, including a point estimate c* of c (Binder, 1978; Lau and Green, 2007) and uncertainty quantification using a credible ball (Wade and Ghahramani, 2018) or posterior similarity matrix (PSM), the latter of which quantifies Pr(ci = cj | y). Endotypes have been derived using Bayesian clustering for illnesses such as cancer (Savage et al., 2010; Yuan, Savage and Markowetz, 2011; Lock and Dunson, 2013; Wang et al., 2023), Alzheimer’s (Poulakis et al., 2022), emphysema (Ross et al., 2016), and type 2 diabetes (Udler et al., 2018). The Bayesian paradigm can allow a broad variety of feature types including time-to-event (Bair and Tibshirani, 2004; Raman et al., 2010; Ahmad and Fröhlich, 2017) and longitudinal data (Poulakis et al., 2020, 2022; Lu and Lou, 2022).

A key step in Bayesian clustering is modeling the distribution of observations within each cluster. Within cluster l, data are assumed to be distributed as f(;θ(l)), where f is some parametric probability kernel and θ(l) is the cluster-specific parameter. In a GMM, f is the Gaussian kernel and θ(l)=(μ(l),Σ(l)) is the mean and covariance of the lth cluster. An often under-discussed aspect of Bayesian clustering is the prior specification for θ(l). The general consensus is to assume that π(θ(l)) is conjugate to f because π(cy) can then be sampled using a Gibbs sampler. For GMMs, the conjugate family is the Gaussian-inverse-Wishart prior. It is a well-known fact that improper priors for the cluster-specific parameters lead to improper posteriors (Wasserman, 2000), so the cluster-specific priors can be weakly-informative on likely feature ranges for each cluster (Richardson and Green, 1997) or data-based (Raftery, 1995; Bensmail et al., 1997). Cluster-specific priors are inappropriate for our application because we expect to observe new phenotypes explained by the immunoprofiles of the INDITe cohort, and label-switching is still likely to occur even if substantial prior information on the phenotypes were available. Additionally, data-based priors forfeit the probabilistic updating of information that occurs in subjectivist Bayesian inference and can potentially lead to underestimation of uncertainty.

To analyze the INDITe data, we propose a simple but novel prior for θ(l) that is motivated by phenotype inference. Essential to understanding our prior is the notion of meaningful regions (MRs), areas of the feature domain that have a clear clinical interpretation. For example, MRs of the jth feature could correspond to intervals in which clinicians can classify an individual as having diminished, neutral, or elevated expression of that feature. MRs are assumed to be known in advance, usually identified by investigators with expert knowledge of the application. We focus on the cluster center, or the mean, in a GMM. We specify the prior for the centers to be itself a mixture model comprised of non-overlapping and informative components which cover the MRs. This prior highly favors cluster centers located inside the meaningful regions, ultimately leading to clinically relevant clusters. For that reason, we call our approach CLustering Around Meaningful Regions (CLAMR). In Section 2, we introduce CLAMR, discuss default prior settings, show how our prior can test the influence of features on the clustering, and provide computational details. In Section 3, CLAMR is evaluated by repeatedly sampling synthetic data. We display all results for applying CLAMR to the INDITe data in Section 4 and provide interpretations of the phenotypes we uncover. Finally, we present concluding remarks, extensions, and generalizations in Section 5.

2. Clustering Around Meaningful Regions.

2.1. Applied Context and Motivation.

In this article, we aim to catalog sepsis phenotypes in the INDITe data which reflect the subtypes of sepsis present in northern Tanzania. The INDITe study enrolled 599 febrile patients from two Moshi hospitals and measured both detailed RNA sequencing and routine clinical information on each participant. The clinical data are derived from two different sources. The first is a case report form (CRF) that details relevant signs and demographic information on the individual, and the second is comprehensive laboratory testing. From a statistical perspective, the INDITe data presents several challenges for cluster analysis. An integral factor is the lack of any clear separation in the data into isolated clusters, as can be seen in the principal component projections in Figure 1. While there are some noticeable patterns, including a clear outlier and skewness in the density of the data, we expect there to be substantial uncertainty in cluster membership. Moreover, the sample size of the INDITe study is markedly smaller than previous phenotype analyses, particularly SENECA. For these reasons, we use Bayesian methods that can coherently incorporate uncertainty quantification to cluster the INDITe data. Furthermore, despite the lack of separation, we opt to distinguish clusters by accounting for differences in clinical interpretation. The intuition here is that clusters should reflect a relevant medical designation that can inform treatment. Accounting for interpretations involves incorporating prior information on the features into our analysis. Finally, an additional challenge is the selection of clinical variables that explain fundamental differences between the clusters, which can be useful for phenotype assignment of a patient. In the following sections, we discuss an approach that can account for each of these factors while leveraging the mathematical benefits of a Bayesian clustering model.

Fig 1.

Fig 1.

Projections onto the first two principal components for a subset of the numeric features from the original INDITe dataset.

2.2. Bayesian Probabilistic Clustering.

Let yijR denote the value of the jth feature for the ith individual, j = 1, …, p and i = 1, …, n, which we collect into the matrix y = (yij)i,j. We are primarily interested in inferring the phenotype clustering c = (c1, …, cn), where ci ∈ {1, …, L} and L > 0, with Cl = {i : ci = l} denoting the lth cluster. Generally, the likelihood π(yc) depends on some kernel family ={f(θ):θΘ} and cluster-specific parameters θ=(θ(1),,θ(L)), so that θ(l)Θ for all l and yici=lf(;θ(l)). In many applications, the cluster-specific parameters are dimension dependent, so we write θ(l)=(θ1(l),,θp(l)). After specifying a prior distribution π(θ) for the cluster-specific parameters, the marginal posterior distribution of c is

π(cy)π(c)l=1Lθ(l){i:ci=lf(yi;θ(l))}π(θ(l))dθ(l). (1)

There are many possible choices for the clustering prior, including Dirichlet-Multinomial, the Dirichlet process (Ferguson, 1973), and the Pitman-Yor process (Pitman and Yor, 1997). Alternatively, when prior information on the clustering c is available, one can use informative clustering priors such as constrained clustering (Lu and Leen, 2004) and centered partition processes (Paganin et al., 2021). Also relevant to the posterior of c in (1) are the priors for the cluster-specific parameters, π(θ(l)). One approach is to calibrate π(θ(l)) to be weakly informative for each cluster using prior information. For example, π(θ(l)) can be specified to have support on a likely range of spread for cluster l if such information is available (Richardson and Green, 1997). However, a more common approach is to assume that θ(l) are identically distributed for all l from some base measure g;ν with hyperparameters ν, i.e.. π(θ(l))=g(θ(l);ν) for all l (Wade, 2023). The choice of ν in such models is a frequent topic of discussion in the literature. If ν is chosen so that g(θ(l);ν) is improper, one can show that the posterior distribution is also improper (MacLachlan and Peel, 2000; Wasserman, 2000). For that reason, noninformative or weakly-informative base measures have been implemented with modifications of Jeffrey’s prior (Diebolt and Robert, 1994; Rubio and Steel, 2014), non-informative hierarchical priors (Grazian and Robert, 2018), and global reparameterizations (Robert and Mengersen, 1995; Kamary, Lee and Robert, 2018). Another strategy is to calibrate ν via the data itself to create a data-based prior, see Raftery (1995) and Wasserman (2000) for applications of such methodology to the priors in a GMM.

We will focus on the case when the data are modeled as a GMM with diagonal covariance. The cluster specific parameters are θ(l)=(μ(l),Σ(l)), where μ(l)=(μ1(l),,μp(l)) and Σ(l)=diag(σ1(l)2,,σp(l)2), or θj(l)=(μj(l),σj(l)2). The data are generated by

(yijci=l,μj(l),σj(l)2)𝒩(μj(l),σj(l)2),forj=1,,p; (2)

with μj(l) being the mean, or center, and σj(l)2 being the variance of feature j in cluster l, for l=1,,L. The natural choice for π(μj(l),σj(l)2) is the Gaussian-inverse-Gamma base measure because it is conjugate to the normal likelihood, leading to a simple Gibbs sampler for π(cy). This choice of base measure has several pitfalls in practice, including the aforementioned ambiguity for the choice of hyperparameters and degenerate clustering behavior in high dimensional regimes (Chandra, Canale and Dunson, 2023). In this article, we present an alternative prior for (2) that is computationally tractable but favors interpretability in clustering. Similar to the approach of Richardson and Green (1997), we use expert prior information to construct the prior for (μj(l),σj(l)2), but rather than setting a different weakly-informative prior for each cluster, we collate the information into a simple base measure.

2.3. Assumptions and Prior on the Cluster Centers.

Suppose that for variable j, yijχjR for all individuals i=1,,n. The meaningful regions of χj are a set of subsets j=j(1),,jKj, so that k=1Kjj(k)=χj and the event that yijj(k) has a concrete meaning for the investigator. Typically, j is a set of disjoint intervals that comprise χj. For instance, in the case where j reflects diminished, neutral, and elevated levels of variable j, we have that Kj=3,j(k)=[aj(k),bj(k)], and aj(1)<bj(1)aj(2)<bj(2)aj(3)<bj(3). In medical applications, knowing that yijj(kj) for j=1,,p gives the clinician a clear interpretation in terms of individual i’s overall health, which can then be used to inform decisions such as treatment plans. We discuss alternative examples of MRs that arise in medicine in the Supplement (Dombowsky et al., 2025).

We assume that the data are distributed as a GMM with diagonal covariance, such as in (2), with 0<L< and ψDir(γ/L,,γ/L), where ψ=(ψ1,,ψL) are the mixture weights, i.e. ψl=π(ci=lψl). L is set to be a large fixed number, such as L = 10 or L = 50, so as to capitalize on the theoretic guarantees of such overfitted mixture models for determining the number of clusters (Rousseau and Mengersen, 2011; Van Havre et al., 2015). We set independent prior densities for μj(l) and σj(l)2, i.e. π(μj(l),σj(l)2)=π(μj(l))π(σj(l)2). For the variance, we opt for the conjugate prior 1/σj(l)2G(λj,βj), where G is the gamma distribution, with λj and βj chosen so that π(σj(l)2) is weakly-informative, though our approach is compatible with any variance prior, such as the hierarchical model given in Richardson and Green (1997).

We now turn towards specifying a novel prior for μj(l) based on careful consideration of the MRs. For variable j in cluster l, we assume that, a priori,

μj(l)k=1Kjϕj(k)hj(k), (3)

where 0<ϕj(k)<1,k=1Kjϕj(k)=1, and hj(k) are continuous probability distributions with support on j(k). The shape of the component distributions in (3) are determined using expert prior knowledge so that hj(k) is itself an informative distribution on the kth MR. The resulting density function π(μj(l)) is multimodal, with modes contained inside each meaningful region. Our prior choice is a base measure for the centers, i.e. π(μj(l))=g(μj(l);ν) with hyperparameters ν=(ϕj(1),,ϕj(Kj),hj(1),,hj(Kj)). We refer to (3) as the CLAMR prior for cluster centers. Importantly, CLAMR is a prior for the cluster-specific parameters and not an assumption about the distribution of the data. This distinguishes the approach from mixtures of mixtures that model the within-cluster distributions as GMMs (Malsiner-Walli, Frühwirth-Schnatter and Grün, 2017).

The interpretation of (3) with respect to the data generating process can be better understood after the introduction of labels sj(l)1,,Kj, with the property that the prior for μj(l) is equivalent to sampling

(μj(l)sj(l)=k)hj(k);π(sj(l)=k)=ϕj(k); (4)

such as in the typical mixture likelihood in (2). The height of each mode in (3) is controlled by ϕj=(ϕj(1),,ϕj(Kj)). If ϕj(k)1 for some k, then all cluster centers will concentrate inside the kth MR. To model uncertainty in ϕj, we assume that ϕjDir(ρj/Kj,,ρj/Kj) where ρj>0. The smaller the value of ρj, the more likely it is that the cluster centers in feature j are all representative of the same MR a priori.

The generating process of the centers can be interpreted in terms of a profile matrix s=(sj(l))l,j. Sampling (4) associates the lth cluster center with a p-dimensional vector of MRs, s(l)=(s1(l),,sp(l)), which we refer to as the clinical profile of cluster l. The profiles provide a concise medical description of the cluster centers. If, say, p = 3 and the MRs correspond to expression levels, then s(l)=(2,1,3) implies that cluster l is characterized by neutral values of feature one, diminished measurements of feature two, and elevated levels of feature three. Since we expect label-switching to occur, s is identifiable up to permutations of the rows. The columns of s, denoted sj=(sj(1),,sj(L)), are groupings of the L clusters into the Kj MRs, and give an idea on the effect of feature j on heterogeneity between the clusters. We will expand on this specific property in the following section.

In practice, the component kernels hj(k) can be any continuous probability distribution, but a natural choice is the location-scale Gaussian mixture across the MRs,

μj(l)k=1Kjϕj(k)𝒩(ξj(k),τj(k)2), (5)

where ξj(k)j and τj(k)2>0. The data generating process of y using Gaussian components in the CLAMR prior is visualized with a directed acyclic graph (DAG) in Figure 2. The three main pieces of the generation of yij are represented by three boxes, with each box being independent of the other a priori. The box on the bottom left of the DAG details the prior on the cluster center.

Fig 2.

Fig 2.

The data generating process according to the CLAMR prior after marginalizing out the prior distributions for ψ and ϕj. The observation yi is allocated into cluster l with probability determined by γ, with mean and variance equal to μj(l) and σj(l)2 for variable j. The cluster center is defined by the cluster’s profile s(l), with ξj(k) and τj(k)2 chosen so that the kernels have support on the meaningful regions. The cluster-specific variance is drawn from an inverse-gamma conjugate prior.

The component means and variances of (5) must be chosen carefully so that the bulk of the probability mass of each Gaussian kernel are non-overlapping. Clearly, ξj(k) must be some real number contained inside j. In the case where j(k)=[aj(k),bj(k)],ξj(k) is any point in that interval. A natural default version of (5) is the following specification based on controlling the symmetric tails of the Gaussian kernel,

ξj(k)=aj(k)+bj(k)2τj(k)2={bj(k)aj(k)2Φ1(1+ω2)}2, (6)

where 0<ω<1 is a constant. Note that ξj(k) is the midpoint of j(k), similar to the prior specification in Richardson and Green (1997). The free parameter ω controls the amount of probability mass that overlaps between MR k and MRs k1 and k+1. that π(μj(l)j(k)sj(l)=k)=ω. Increasing ω to 1 will increase the concentration of each kernel towards their midpoint, causing π(μj(l))k=1Kjϕj(k)δξj(k). Hence, a sensible choice is to set ω = 0.95 so that 95% of the probability density in each kernel is contained within their respective region and centered at the midpoint. If investigators do not wish to place ξj(k) as the midpoint of j(k), they can choose τj(k)2 using a similar criteria to (6) to ensure that j(k) contains 95% of the probability density of component k. The framework can also be extended to multimodality within MRs by setting hj(k) to be mixture of Gaussian kernels, combining CLAMR with mixtures of mixtures (Malsiner-Walli, Frühwirth-Schnatter and Grün, 2017).

An example comparison between the CLAMR prior, the weakly-informative case (as mentioned in Richardson and Green (1997)), and the noninformative case is given in Figure 3 when χj=[3,1.5], j={[3,1],[1,0],[0,1.5]}, and the MRs are expression levels. The CLAMR prior favors clusters with centers located at the midpoint of each MR, and the prior probability of the MRs is controlled by ϕj. For this visualization, ϕj=(1/3,1/3,1/3), but the height of each mode can be lowered or raised by altering these values. In the case where ϕj(k)1 for some for some k, the CLAMR prior in Figure 3 reduces to an informative prior on the kth MR. The weakly-informative prior, while leading to a proper posterior, favors neutral levels of feature j, and the prior probability of each MR is fixed.

Fig 3.

Fig 3.

Possible prior choices for the cluster center when the MRs are given by ℳ = {[−3, −1], [−1, 0], [0, 1.5]} and χj = [−3, 1.5]. The non-informative prior is flat on χj, whereas the weakly-informative choice is concentrated at the midpoint of χj. The CLAMR prior has hyperparameters given by (6), where each component of the mixture is a non-overlapping Gaussian with mean at aj(k)+bj(k)/2. In this specific example, the prior probability under CLAMR of observing a cluster center in each MR is 1/3, though that need not be the case in general.

2.4. Testing Feature Influence and Pre-Training.

The clinical profiles s will be useful for assessing the influence of feature j on c. In phenotype analyses, investigators are interested in what features help explain the phenotypes and which variables do not. The intuition here is that some features tend to contribute more than others to meaningful differences between the groups. Suppose then, for example, that for some k, sj(l)=k for all clusters l = 1, … , L. This would mean that the profiles of each cluster are all identical for the jth variable. If k = 2 and the MRs are expression levels, this would amount to concluding that individuals in all clusters tend to have neutral values of variable j. Therefore, variable j on its own does not explain the heterogeneity between groups.

We can formulate this intuition in terms of a Bayesian hypothesis test. Recall that for each cluster l and variable j, there is a label sj(l)1,,Kj. In practice, there may be some clusters that are empty, so for the remainder of this section, we only focus on the clusters Cl that are non-empty, i.e. |Cl| ≥ 1. Without loss of generality and for ease of notation, we will assume that clusters l = 1, …, LnL are non-empty. The labels induce p partitions of [Ln] = {1, …, Ln}, denoted Sj=Sj(1),,SjKj, where Sj(k)=l:sj(l)=k,Cl1 for j = 1, …, p. That is, each Sj(k) is the set of cluster indices that are associated with the kth MR. For example, suppose there are Ln = 3 clusters and Kj = 3 MRs. Then by sampling (4) there are three possibilities for the allocation of clusters to regions: all clusters are in the same MR (Sj = {[3]}), two clusters are in the same MR while one cluster is in a different MR (Sj = {{1, 2}, {3}}, Sj = {{1}, {2, 3}}, or Sj = {{1, 3}, {2}}), or all clusters are in different MRs (Sj = {{1}, {2}, {3}}).

From our argument above, if Sj = {[Ln]}, then the influence of variable j on c is minimal because each cluster is located in the same MR. This motivates setting the null hypothesis to be H0j : Sj = {[Ln]}. However, in this article, we set the interval null for feature influence to be H0j : d(Sj, {[Ln]}) < ϵj, where ϵj>0 is a positive constant and d is a distance metric over the partition space. We implement our approach with d set to be Binder’s loss (Binder, 1978), but other partition metrics exist that could be used instead, e.g. the Variation of Information (VI) (Meilă, 2007).

In accordance with formal Bayesian hypothesis testing, we focus on the prior and posterior probability of H0j:dSj,{Ln}<ϵj versus H1j:dSj,{Ln}ϵj. H0j is tested via computation of the Bayes factor (BFj), or the ratio of the marginal likelihoods of y given H1j and H0j (Jeffreys, 1935, 1961; Kass and Raftery, 1995). When πH0j=πH1j=1/2, the Bayes factor can be further simplified to BFj=πH1jy/πH0jy=1πdSj,Ln<ϵjy/πdSj,Ln<ϵjy, which is the ratio of the posterior probabilities of H1j and H0j. We can conclude in favor of either H0j or H1j using the magnitude of BFj, e.g. the guidelines given in Kass and Raftery (1995). The smaller BFj is, the smaller the influence of feature j on c. Note that πH0j=1/2 only occurs for specific values of ρj and γ, so we fix γ=1, which favors a small number of clusters, then choose ρj using Monte Carlo simulation.

In practice, testing feature influence can be implemented as a pre-training step in order to reduce dimension in y and identify relevant features for clustering. This involves first fitting the model, computing the Bayes factors, and removing the non-influential features. We then run the CLAMR model on the reduced set of variables and infer the clustering c. In Section 4 we implement our hypothesis testing scheme as a pre-training procedure to remove noisy features in the INDITe data.

2.5. Computational Details.

We derive a simple Gibbs sampler for CLAMR, which is detailed in the Supplement (Dombowsky et al., 2025). Once the T posterior samples {c(t)} and {s(t)} are obtained, we summarize our findings using MCMC estimation. A key advantage of the Bayesian framework is the ability to characterize uncertainty in point estimation using the posterior distribution. In Bayesian cluster analysis, the point estimate c=c1,,cn of c is a fixed clustering of y based on π(cy), and is derived by minimizing the posterior expectation of an appropriate clustering loss function. We minimize the Variation of Information (VI) loss (Meilă, 2007), which compares the entropy between c* and c as well as their mutual information (Wade and Ghahramani, 2018). However, CLAMR can be implemented with other losses if desired, including Binder’s pair-counting loss (Binder, 1978) and the robustifying loss FOLD (Dombowsky and Dunson, 2024). For uncertainty quantification, we compute a credible ball or PSM using the MCMC samples. To test the influence of the jth feature, BFj can be consistently estimated as T by πdSj,Ln<ϵjy(1/T)t=1T1dSj(t),Ln(t)<ϵj, and is computed using post-processing on the samples {c(t)} and {s(t)}.

2.6. Cluster Interpretations.

The non-identifiability of cluster labels in Bayesian mixture models makes estimating the profiles s challenging without applying a relabeling algorithm. However, once a point estimate c* is obtained, we can utilize these fixed labels to infer MR specifications. Denote C=C1,,CM, where Cm=i:ci=m is the set of participants allocated to cluster m by the point estimate c*. For participant i, define the quantity sˆj(i)=sj(l) if and only if ci = l. Recall that ci is the mixture label of participant i (which is random), whereas ci is the cluster label in the point estimate (which is fixed). We compute

Δj(m,k)=1|Cm|iCm1S^j(i)=kfor allm[M],k[Kj],j[p]. (7)

This quantity can be interpreted as the empirical probability of association with the kth MR in variable j for the individuals allocated to cluster m of c*.

We compute (7) for every iteration in the Gibbs sampler, then summarize our findings using

Δjm,=maxk=1,,KjΔ¯j(m,k)andsjm,=argmaxk=1,,KjΔ¯j(m,k) (8)

for all clusters m in c* and features j; here, Δjm,j is the Monte Carlo average of Δj(m,j). We interpret sjm, as the MR most associated with the participants in the mth group of c*, with Δjm, as a measure of variation about this MR. Alternatively, uncertainty in MR assignment can be quantified by inspecting the posterior distributions of Δj(m,k).

3. Simulation Studies.

3.1. Clustering Under Misspecification.

We now evaluate CLAMR on synthetic data. We focus on the case in which the statistician has obtained expert knowledge on the MRs but the model is misspecified. Our simulation scenario explores kernel misspecification, which in our case refers to the regime in which the true data generating process is a mixture model but with non-Gaussian components. Kernel misspecification is a key issue in model-based clustering because statistical methods tend to overestimate the number of clusters when there is a mismatch between the assumed kernels and the true ones. A number of methods have recently been proposed to robustify Bayesian clustering results to model misspecification, such as coarsening (Miller and Dunson, 2019), decision theoretic schemes (Dombowsky and Dunson, 2024; Buch, Dewaskar and Dunson, 2024), and hierarchical component merging algorithms (Aragam et al., 2020; Do et al., 2024). However, the effect of the prior distribution on robustifying inference has received little discussion in this nascent literature.

We find that CLAMR can reduce the over-clustering tendency of misspecifed models, leading to more accurate estimation and a fewer number of clusters. The synthetic data are comprised of p = 6 features with Kj = 3 true MRs each, which are assumed to be known. The number of true clusters is set to L˜=3 and we simulate the true within-cluster center parameters, denoted μ˜j(l), from the CLAMR prior (5), where the true profile matrix s˜ is given by

s˜=(131213132113133313).

Observe that this implies that only features 3 and 4 will exhibit meaningful between-cluster variation. To replicate our application to the INDITe data, we make each feature exhibit a different scale. For example, the MRs of feature 1 are {[−1, 1], [1, 2], [2, 4]}, and the MRs for feature 6 are {[0, 10], [10, 30], [30, 200]}. True cluster labels c˜i are simulated uniformly across the integers [L˜]. The cluster-specific scales σ˜j(l) are chosen so that, for the lth true cluster, the simulated data are contained within the MR given by the corresponding entry in s˜. This calibration uses the true data generating process, given by yic˜i=ltν˜,μ˜j(l),σ˜j(l), where t(ν,μ,σ) denotes the Student’s t-distribution with ν degrees of freedom, center μ, and scale σ. For our simulations, we fix ν˜=5, meaning that the true kernels have significantly heavier tails than the Gaussian distribution. We vary n ∈ {100, 500, 750, 1000} and simulate 100 independent replications for each sample size. For a given simulated dataset y, we fit a GMM with the CLAMR prior and default hyperparameters (6) with ω=0.95 and γ=1. We compare the results to clusterings yielded by a standard Bayesian GMM with weakly-informative priors (BGMM), an implementation of the EM algorithm, complete linkage hierarchical clustering (HCA), and k-means. Details on hyperparameter choice for all approaches are given in the Supplement (Dombowsky et al., 2025). For both CLAMR and the BGMM, a clustering point estimate is selected by minimizing the VI loss.

Results for the simulation study are displayed in Table 1, which shows the average adjusted Rand index (ARI) (Rand, 1971; Hubert and Arabie, 1985) between each point estimate and the true cluster labels c˜, as well as the number of clusters in the point estimates (Lˆ). For n > 100, CLAMR achieves the highest ARI across all approaches. In addition, CLAMR consistently reduces the number of clusters on average in comparison to the standard BGMM. The EM algorithm excels at inferring the number of clusters, but is notably less accurate at inferring the true cluster membership, as evidenced by the smaller values of ARI. Despite being fixed at the true number of clusters, k-means and HCA completely fail at recreating the true cluster structure. The Supplemental Material (Dombowsky et al., 2025) details a similar simulation study in which the model is well-specified, i.e., we simulate the data from Gaussian distributions.

Table 1.

Average adjusted Rand index (ARI) with c˜ and number of clusters (L^) for CLAMR, the BGMM, EM, k-means, and HCA applied to 100 independent replications for each n ∈ {100, 500, 750, 1000}. Both k-means and HCA have the number of clusters fixed at L = 3, the true value. Standard deviations of the metrics across the replications are given in brackets.

n CLAMR BGMM EM k-means HCA
ARI 100 0.96 (0.10) 0.99 (0.02) 0.99 (0.03) 0.32 (0.17) 0.22 (0.17)
500 0.98 (0.05) 0.98 (0.01) 0.94 (0.06) 0.35 (0.17) 0.20 (0.15)
750 0.98 (0.04) 0.97 (0.02) 0.94 (0.05) 0.34 (0.18) 0.17. (0.15)
1000 0.98 (0.01) 0.97 (0.02) 0.93 (0.06) 0.34 (0.16) 0.17 (0.12)

L^ 100 3.14 (0.47) 3.34 (0.55) 3.14 (0.38) 3.00 (0) 3.00 (0)
500 4.41 (1.00) 4.60 (0.89) 3.90 (0.63) 3.00 (0) 3.00 (0)
750 4.64 (1.08) 4.93 (0.89) 4.10 (0.58) 3.00 (0) 3.00 (0)
1000 5.07 (1.01) 5.44 (0.92) 4.27 (0.49) 3.00 (0) 3.00 (0)

3.2. Impact of MR Choice.

In this simulation study, we evaluate the impact of specifying MRs when analyzing a dataset with no true MR structure. That is, we do not assume that the true cluster centers are simulated from a GMM with approximately non-overlapping components. Instead, we generate synthetic data y from a standard mixture model, i.e. yil˜=1L˜tν˜,μ˜j(l),σ˜j(l)2. We simulate the true cluster variances in the same manner that we sampled the true cluster scales in the previous simulation study. However, the cluster centers are sampled in a manner without MR structure, i.e. μ˜j(l)𝒩ξj,τj2, where ξj and τj2 are chosen so that 95% of the Gaussian density encompasses all of the specified MRs. Continuing with our earlier example, if the MRs of feature 1 are {[−1, 1], [1, 2], [2, 4]}, then (ξj,τj2) are chosen according to (6) where Kj=1,aj(1)=1, and bj(1)=4. In the context of our application, this is a setting in which approximate bounds for the features are known. As before, we set the degrees of freedom to be ν˜=5. All other hyperparameters and settings for CLAMR, the BGMM, EM, k-means, and HCA are kept the same as in the previous simulation study.

We display the results in Table 2. Interestingly, we find that CLAMR can improve over BGMM in both accuracy and estimating the number of clusters, despite the MRs having no concrete bearing on the data generating process aside from approximate bounds on the centers. As we observed previously, both CLAMR and the BGMM have higher ARI than other methods, though the EM algorithm is generally more robust in this case to over-clustering the data. We do see improvement in the performance of k-means and HCA in this setting, but their average ARI scores are markedly lower than those of model-based methods.

Table 2.

ARI with c˜ and L^ for CLAMR, the BGMM, EM, k-means, and HCA applied to 100 independent replications for each n ∈ {100, 500, 750, 1000} (brackets indicate standard deviation).

n CLAMR BGMM EM k-means HCA
ARI 100 0.98 (0.08) 0.98 (0.08) 0.99 (0.03) 0.71 (0.20) 0.67 (0.21)
500 0.99 (0.01) 0.99 (0.01) 0.96 (0.05) 0.77 (0.19) 0.62 (0.24)
750 0.99 (0.01) 0.98 (0.01) 0.92 (0.07) 0.75 (0.21) 0.61 (0.24)
1000 0.98 (0.04) 0.98 (0.02) 0.91 (0.06) 0.79 (0.18) 0.62 (0.22)

L^ 100 3.09 (0.40) 3.06 (0.34) 3.15 (0.44) 3.00 (0) 3.00 (0)
500 4.09 (0.84) 4.20 (0.79) 3.77 (0.69) 3.00 (0) 3.00 (0)
750 4.43 (0.99) 4.49 (0.81) 4.24 (0.74) 3.00 (0) 3.00 (0)
1000 4.92 (0.93) 5.00 (0.89) 4.47 (0.61) 3.00 (0) 3.00 (0)

4. Sepsis Phenotypes in Northern Tanzania.

4.1. Data and Pretraining.

Our main interest is using CLAMR to derive interpretable phenotypes from a cohort of sepsis patients. While the original dataset consists of a vast collection of variables, we opt for features that have been used in SENECA, which we enumerate in Table 3. The clinical variables can be categorized as belonging to four distinct classes: clinical signs, markers of inflammation in the body, markers of organ dysfunction, and additional immunological markers. The original INDITe cohort consisted of febrile patients that are not necessarily sepsis positive, so we only include individuals with systemic inflammatory response syndrome (SIRS) score greater than or equal to 2 in order to derive phenotypes that correspond to sepsis definition, ultimately resulting in n = 265. The following features have missing observations: bilirubin (0.023%), platelets (0.011%), BUN (0.011%), WBC (0.011%), and bicarbonate (0.008%). Based on our knowledge of the data collection process, we assume that these variables are missing completely at random (MCAR), and we impute these entries using the Gibbs sampler. We also have information on the in-patient outcome (e.g., death), which we do not use for clustering but will refer to in our interpretations.

Table 3.

The complete 15 variables from the INDITe data with accompanying Bayes factors from the pre-training phase. Features with Bayes factors indicating strong evidence against H0 are indicated by bolded script.

Feature Type BFj
Temperature 6.392
Respiratory Rate Clinical Signs 0.024
Pulse Rate 0.402
Systolic Blood Pressure (SBP) 2.765

White Blood Cell Count (WBC) Inflammation 14.553
C-Reactive Protein (CRP) Markers 2.636

Aspartate Transaminase (AST) 250.736
Bilirubin Organ 169.488
Blood Urea Nitrogen (BUN) Dysfunction 12607.696
Creatinine Markers 45.208
Platelets 3.500

Glucose 0.661
Sodium Additional 1.746
Bicarbonate Markers 6.139
Albumin 4.600

MRs correspond to expression levels specified by our clinical science collaborators, who have expert knowledge of the patient population context in northern Tanzania. Thirteen of the characteristics have expression levels written in terms of D, N, and E (Kj = 3), while the remaining two are expressed in D and N or N and E (Kj = 2). For example, systolic blood pressure, which is measured in mm Hg, has the following MRs: D = [65, 111], N = [111, 220], and E = [220, 300]. In contrast, creatinine, measured in ml / dL, has only two MRs: N = [0, 90] and E = [90, 250]. These cutoffs are unique to northern Tanzania and are non-overlapping. For more general implementations of the CLAMR algorithm, overlapping MRs may be considered, which we discuss in Section 4.6 and the Supplement (Dombowsky et al., 2025). However, these D/N/E MRs are the most straightforward MR specifications for our clinical collaborators to elicit, and their clinical relevance makes them sensible choices for creating subgroups that could be used for treatment design. These MRs are then used to define the hyperparameters ξj(k),σj(l)2 through (6) with the overlap parameter set to ω=0.95. No other preprocessing is applied to the data, and features are kept in their original units for the analysis.

Before computing the clustering point estimate, we implement a pretraining phase to select influential features for clustering. We set γ=1 and choose ρj=(1.1)1Kj=2+(0.7)1Kj=3, which corresponds to πdSj,Ln<0.1=1/2. The resulting Bayes factors for all variables are displayed in the third column of Table 3, with bolded script indicating features that have substantial evidence of clustering influence (Kass and Raftery, 1995). We select p = 4 of the original 15 features for our analysis, corresponding to AST, bilirubin, BUN, and creatinine, all of which are organ dysfunction markers. We then calculate the point estimate using the reduced set of features.

4.2. Clustering Point Estimate and Interpretations.

We compute the point estimate c* by minimizing the VI loss using the MCMC samples from π(cy). We also calculate the PSM for the INDITe data to express uncertainty in our point estimate. The point estimate consists of 7 clusters, however, the seventh cluster contains only one individual who has abnormally high levels of bilirubin, so we focus on the 6 clusters derived from the remaining 264 participants. We display a general summary of relevant clinical characteristics among participants, including the proportion of females, persons living with HIV, malaria infection, in-hospital mortality, and age in Table 4 stratified by cluster; we provide a pairs plot of the influential features in Figure 4. For convenience, the clusters are labeled by decreasing size. No cluster contains the majority of participants, and the smallest cluster is cluster 6 which comprises just 18 individuals (or 7%) from the overall cohort. In Figure 4, observe that cluster 6 displays multimodality for some features (e.g., the bilirubin measurements). This is largely due to the fact that the sample size for this cluster is small relative to the size of the cohort, as it includes fewer than 7% of the participants.

Table 4.

Basic demographic information consisting of sex, HIV infection, advanced HIV status, malaria status, inpatient outcome, and age measured within each cluster, with the overall numbers included for comparison. In addition, estimated profiles sjm, with Δjm, in parentheses, for each cluster and feature are included.

Overall C1 C2 C3 C4 C5 C6
Size 264 87 49 44 39 27 18
Pr(Female) 0.485 0.690 0.388 0.409 0.436 0.185 0.500
Pr(HIV+) 0.383 0.414 0.306 0.523 0.359 0.185 0.444
Pr(Adv. HIV) 0.258 0.241 0.224 0.386 0.231 0.148 0.333
Pr(Malaria) 0.098 0.092 0.163 0.045 0.051 0.185 0.056
Pr(Deceased) 0.125 0.057 0.122 0.182 0.179 0.148 0.167
Avg. Age (SD) 42.8 (17) 37 (14.2) 47.5 (17.5) 38.2 (12.1) 48.5 (19) 40.4 (16.8) 60.2 (16.6)

BUN N (0.98) N (0.81) N (0.95) E (0.68) E (0.89) E (0.98)
Creatinine N (0.98) N (0.91) N (0.95) E (0.54) N (0.63) E (0.91)
AST N (0.93) N (0.93) E (0.90) N (0.84) E (0.93) N (0.87)
Bilirubin D (0.89) N (0.86) D (0.89) D (0.66) N (0.89) N (0.77)

Fig 4.

Fig 4.

Pairs plots for the influential features BUN, creatinine, AST, and bilirubin. All features are log-transformed.

To aid interpretation, we compute the estimated MR specifications in (8), while also inspecting descriptive statistics for the clusters post hoc. Participants in cluster 1 are generally female (69%), have the best inpatient outcome (5.7% mortality), and have the lowest average age in comparison to the other clusters (37). These participants tend to have neutral feature expression with the exception of bilirubin, which tends to exhibit diminished values. In comparison, cluster 2 is majority male (61%), has a lower proportion of participants living with HIV (31% in comparison to 41% for cluster 1), and has more than double the mortality (12%). Cluster 2 is also generally associated with neutral feature expression, though it is more inclined towards higher levels of BUN, bilirubin, and AST than cluster 1. Cluster 3 is the only group for which the majority of patients live with HIV (52%, the overall incidence is 38%), and 74% of these patients have a percentage of CD4 indicative of advanced HIV. This group has the lowest proportion of subjects with malaria (4. 5%, the prevalence among all 264 is 9. 8%), the participants are generally male, younger (average age 38.2 years) and have elevated levels of AST. Cluster 4 is characterized by elevated BUN and creatinine levels(though the Δj4, values of these features are both < 0.7), but neutral AST levels and generally diminished bilirubin. Clusters 3 and 4 are similar in size (44 and 39 participants, respectively), and have the joint highest mortality rate (18.2% and 17.9%, respectively). The highest levels of AST are observed in cluster 5, which is also associated with neutral bilirubin and the highest proportion of men (81. 5%). Furthermore, this group has the lowest proportion of participants living with HIV (18. 5%), although 80% of these individuals have advanced HIV. Cluster 6, the smallest cluster, is characterized by markedly high levels of both BUN and creatinine, but neutral levels of AST and bilirubin. In addition, patients in this group tend to be older (average age 60.2), and group 6 has the second highest proportion of people with HIV (44. 4%) and of these individuals, 75% have advanced HIV.

Another defining characteristic of clusters 5 and 6 is their shape. Despite our assumption of non-diagonal covariance, Figure 4 shows that BUN and creatinine are highly correlated within clusters 5 and 6 (with a values of 0.69 and 0.58, respectively). In contrast, these values for the first four clusters are 0.40, 0.44, 0.24, and 0.24, respectively. Although this is a consequence of the inherent correlation between these variables, we can see that the magnitude of the correlation distinguishes the clusters.

Upon comparison, there are clinical characteristics shared by the clusters, particularly cluster 4 and cluster 6. Participants in these clusters tend to have neutral AST and elevated levels of BUN and creatinine. Despite these similarities in terms of the MR specifications, there are other factors that differentiate the subgroups. The average age of cluster 4 is 48.5, more than a decade younger than that of cluster 6, which is 60.2. Furthermore, Figure 4 shows that, in general, the BUN and creatinine levels of group 6 are markedly higher than those of group 4. This distinction is picked up by the CLAMR model; the relatively low values of Δj4, for BUN and creatinine signal association of cluster 4 with several MRs. These differences may be useful for developing customized treatments for these clusters. In the general case, similar clusters can be combined after a post hoc clinical analysis or using a Gaussian kernel merging algorithm, e.g. the FOLD post processing method (Dombowsky and Dunson, 2024).

In light of these differential characteristics, we can summarize the interpretation of the clusters as follows. Cluster 1 comprises young female patients with generally neutral markers, low bilirubin (not a clinical concern), and low incidence of mortality; cluster 2 has modestly elevated levels of BUN that suggest possible renal dysfunction and skews older and more toward male sex than cluster 1; cluster 3 is characterized by increased frequency of HIV and elevated AST levels indicative of liver dysfunction; cluster 4 has elevated levels of BUN and creatinine that suggest septic shock or renal dysfunction; cluster 5 is composed mainly of men with low HIV prevalence/frequency and extremely high levels of AST indicative of liver dysfunction; and cluster 6 is composed of older individuals with extremely elevated BUN and creatinine indicative of septic shock or renal dysfunction.

4.3. Comparison to SENECA Clusters.

Next, we qualitatively compare and contrast the INDITe clusters c* with the four clusters reported in the SENECA study (Seymour et al., 2019), denoted as α,β,γ, and δ. Cluster 1 is comparable to their α cluster due to the lack of organ dysfunction and generally low mortality, and cluster 6 closely resembles their β-cluster as it is characterized by advanced age and renal dysfunction. Their δ cluster is associated with liver dysfunction; the elevated levels of AST in clusters 3 and, especially, 5 mean that liver dysfunction is also a defining factor in these INDITe clusters. However, we also find that factors specific to sSA explain the heterogeneity between the INDITe clusters. For example, malaria prevalence and low HIV prevalence are key characteristics of cluster 5, while cluster 3 is in part driven by low malaria prevalence and high HIV prevalence. Although cluster 1 is similar to the cluster α, there are notable differences in demographics, including average age (37 years for INDITe, 60 years for SENECA) and sex (69% female for INDITe, 49% female for SENECA). However, we may interpret groups 1 and 6 as clinical manifestations of the groups α and β, respectively, in the northern Tanzanian population. Regarding participant outcomes, we observe comparable, but distinct associations with those of the SENECA study, who found that the δ group (characterized by liver dysfunction) had the highest mortality rate in their cohort. We find that cluster 3 (characterized by HIV incidence and possible liver dysfunction) has the highest observed mortality at 18.2%, although cluster 4 (characterized by renal dysfunction) has a similar mortality at 17.9%.

4.4. Comparison to Standard Methods.

First, we compare our results with the clustering point estimate obtained after fitting a Bayesian GMM (BGMM), as in (2), with weakly informative Gaussian and inverse-Gamma priors. Explicit details on hyperparameters are given in the Supplement (Dombowsky et al., 2025). The adjusted Rand index between c*, the CLAMR point estimate and c, the BGMM point estimate is 0.273, indicating a modest resemblance between the two clusterings. Both c* and c have 7 clusters and create singleton clusters (though not of the same participant). In contrast to the cluster sizes shown in Table 4, the sizes of the BGMM clusters are 125, 77, 26, 15, 11, 10, and 1. An explicit comparison between the two point estimates is given in Figure 6, where it can be seen that the CLAMR clusters are split and subsequently merged to create those from the BGMM. For example, cluster 1 in CLAMR is spread over three clusters in the BGMM, and cluster 1 of the BGMM is comprised of participants from clusters 1, 2, 3, and 4 in CLAMR. The clusterings differ in interpretation as well. Clusters 3 and 4 of the BGMM, for instance, have an extremely small HIV incidence (10% and 9.1%), whereas the HIV incidence for all CLAMR clusters are larger than these proportions. Malaria incidences are generally higher in the BGMM clustering, with three of the seven clusters observed to have more than 20% malaria incidence. All CLAMR-derived clusters have a lower incidence of malaria. Finally, BGMM places 202 of the individuals in just two clusters (76% of the cohort), while CLAMR clusters more evenly divide the data set, making it easier to provide definitive interpretations. For these reasons, we find the results of CLAMR to be preferable from a clinical perspective. We further evaluate our methodology in the Supplemental Material (Dombowsky et al., 2025) by comparing the results from the CLAMR model to a Latent Class Model (LCM) (Linzer and Lewis, 2011). The latter model is implemented by dichotomizing the observed feature values based on which MR they are observed to be in.

Fig 6.

Fig 6.

A comparison between the point estimate yielded by CLAMR and that by a standard BGMM. Singleton clusters have been included.

Furthermore, we compare the results from CLAMR to several easily implementable clustering algorithms: the EM algorithm for a GMM, k-means, and complete linkage HCA. The GMM model used in the EM algorithm is determined by minimizing the Bayesian information criterion (BIC) for a selection of different covariance structures and number of components. The number of clusters in k-means is chosen by an elbow plot of the total within-cluster sum of squares, and we fix the number of clusters in HCA to be 6, the same number of clusters in the CLAMR point estimate (with the singleton removed). The adjusted Rand indices between c* and the clusterings of EM, k-means, and HCA are 0.385, 0.042, and 0.087, respectively. A visual comparison between these methods is given in Figure 7. There are some general patterns that emerge, such as a tendency for methods to cluster points with high values when projecting onto the first two principal components (i.e., in the upper right quadrant of the plots in Figure 7), as well as points with lower values when projecting onto the first principal component (i.e., the middle sections of the plots). It is between these two extremes that the methods seem to differ, and this region is where clusters 2-5 occur for the CLAMR clustering, c*. The differences in cluster definition for all methods in this region demonstrate the impact of using prior information when the data are not well-separated.

Fig 7.

Fig 7.

Comparison of the clusterings derived by EM, k-means, and complete linkage HCA on the INDITe data to the CLAMR point estimate. Singleton clusters have been included.

4.5. Uncertainty Quantification.

The PSM for the INDITe data is shown in Figure 5, where each box along the diagonal indicates a cluster. Observe that both clusters 5 and 6 are notably distinct from cluster 1, i.e. Pr(ci = cj | y) is very close to 0 for iC1 and jC5C6. In fact, these clusters appear to be well separated among the influential variables in Figure 4, suggesting differing underlying subtypes and possibly differing treatment regimes. The PSM also indicates similarity between clusters 1, 2, 3, and 4 via the off-diagonal boxes in Figure 5. Comparing again to the within-cluster distributions in Figure 4, we can see that these clusters have similar distributions for the septic shock and renal dysfunction markers (BUN and creatinine). Hence, one would expect that the underlying subtypes manifest themselves similarly in the patient population, although differences in key immunologic factors such as HIV status and age, as well as AST and bilirubin levels, still distinguish these clusters in a clinical sense. Instead, it may be the case that individuals can embody characteristics of multiple subtypes. In the Supplement (Dombowsky et al., 2025), we display histograms of the posterior distribution of Δjm,sjm, to convey uncertainty in our interpretative quantities; this is the posterior for Δj(m,k) for k=sjm,. These plots shed further light on our interpretations of the results. For example, the posteriors of Δjm,sjm, for cluster 4 exhibit substantial uncertainty in MR association, whereas the posteriors for clusters 1 and 6 are more diffuse. We can also see multimodality for this quantity for creatinine in cluster 5; this is likely due to these participants being close to the border of the MRs.

Fig 5.

Fig 5.

Posterior similarity matrix (PSM) for the INDITe data, ordered by point estimate membership.

4.6. Posterior Predictive Samples and Sensitivity Analysis.

To gain insight into the capability of the CLAMR model to recapture patterns in the data, we sample from the posterior predictive distribution. For every measurement yij, we generate a posterior predictive sample yij(t) from (2) using the Gibbs samples of the cluster centers, scales, and labels. The density functions for 500 equally spaced posterior predictive samples across all chains are shown in Figure 8, with the true density of the data displayed in red. Here, we can see that the CLAMR model is most accurate in predicting measurements below the 95% percentile for all feature values. For measurements above the 95th percentile, the CLAMR model is notably less accurate at prediction, with a general tendency to cluster together measurements that are far in the tails of the empirical distribution. This behavior is explained by the MR specifications: since many of these tail measurements fall in the same MR, they are likely to cluster together. For example, measurements in the right tail for AST are all above the cut-off point for the second MR, and the CLAMR model rewards co-clustering these observations. This results in a mode in the posterior predictive samples midway through the right tail.

Fig 8.

Fig 8.

Summary of posterior predictive samples for the INDITe participants. The dark line indicates the density of the data, whereas the light curves are densities from 500 posterior predictive samples. For visualization, missing features are imputed using multivariate imputation (van Buuren and Groothuis-Oudshoorn, 2011).

Figure 8 also highlights the differences between CLAMR and the standard BGMM. Generally, Bayesian Gaussian mixtures can excel in density estimation (Escobar and West, 1995), although the interpretations of the resulting clusters may not be relevant to the application of interest. The specification of the CLAMR prior is motivated specifically by clustering and not density estimation, which means that the modes in the posterior predictive samples correspond to subgroups relevant to our clinical application and not necessarily actual Gaussian-shaped modes in the empirical distribution.

In the Supplemental Material (Dombowsky et al., 2025), we apply CLAMR to the INDITe data with different MR specifications based on the SENECA results. These MRs are derived by translating the characteristics of the SENECA clusters to values relevant to the INDITe cohort (e.g., defining what expression levels for the variable j characterize the clusters α,β, γ and δ. Unlike the expression level MRs presented above, these MR specifications were allowed to be identical. There are differences and similarities in the results of both approaches. The SENECA MR specifications lead to more influential variables, but the expression level MRs lead to more clusters. However, BUN, creatinine, bilirubin, and AST clusters are derived using both MR specifications, which means that there are some common clinical factors driving the results, regardless of the MRs chosen.

5. Discussion.

In this article, we present CLAMR, a novel enhancement of the Bayesian Gaussian mixture model motivated by disease subtype analysis. The main idea behind our approach is to incorporate commonly used cut-offs in feature values, or meaningful regions, into the prior for the cluster centers, with the ultimate goal being to distinguish clusters via their medical interpretations. We accomplish this mathematically by assuming that the prior on the cluster centers is a mixture model over the MRs. When the CLAMR prior is a Gaussian mixture, a simple Gibbs sampler can be derived for posterior computation. The algorithm performs excellently on synthetic data in which the Gaussian model is misspecifed. In addition to clustering, our prior can be used for a pre-training procedure to select relevant clustering features using the results of Bayesian hypothesis tests.

CLAMR is primarily motivated by the clustering challenges presented by the INDITe sepsis data from northern Tanzania, such as the lack of separation in the data and the small sample size. We find that the influential features are all markers of organ dysfunction and, using these, we derive 6 clusters of varying size and interpretation. These groups are related to clinical factors and signs, including HIV prevalence, malaria status, age, sex, and kidney and liver dysfunction, and we also observe varying mortality rates across the groups. In addition, several of our clusters are similar in interpretation to the clusters of the SENECA study, despite the fact that the sepsis patient population in sSA has markedly different distributions of age, comorbidities and etiologies than those in North America.

There are multiple possible extensions to CLAMR that can generalize our approach. The cluster profiles Sj can be estimated after applying label-switching algorithms to the samples c(t) and s(t), then minimizing a clustering loss such as Binder’s or VI. The CLAMR prior is presented here as a GMM, but any continuous probability distribution can be used to model the prior density within the MRs. If investigators require the distribution within MRs to be multimodal, for example, one can set the component kernels of πμj(l) to be themselves GMMs, and posterior computation can be carried out using a similar Gibbs sampler to the one we utilize. We assume that the cluster-specific variances follow weakly-informative inversegamma priors, but this can be simply augmented to a mixture of inverse-gamma kernels over the MRs in order to model varying shapes, not just centers, between profiles. Finally, in this article, we implement the CLAMR prior for cluster centers in a GMM with diagonal covariance matrices. The CLAMR prior could also be applied to a GMM with non-diagonal covariance, and we provide analytic details of this extension in the Supplementary Material (Dombowsky et al., 2025).

Our framework is a two-step procedure: first, we screen for influential variables, then report the clustering results of the reduced dataset. We perform these tasks in two separate steps to (a) formalize the choice of influential features as a hypothesis testing problem, and (b) to avoid the negative theoretic properties of Bayesian clustering that arise even in a modest dimension (Chandra, Canale and Dunson, 2023). However, CLAMR could be extended to a one-step procedure, in which an appropriate variable selection prior is used to reduce the feature set. For example, we could incorporate latent discrimination vectors αj{0,1} for all characteristics j = 1, …, p (Tadesse, Sha and Vannucci, 2005; Kim, Tadesse and Vannucci, 2006). When αj=1, we include this feature in the GMM likelihood and draw j from the CLAMR prior; if αj=0, feature j is modeled with a Gaussian distribution with conjugate Gaussian-inverse-gamma priors.

In addition to the MRs, other sources of prior information can be incorporated into the CLAMR algorithm. Our current approach relies on overfitting the GMM, but if a reliable source on the number of clusters is available (either from expert information or from a pretraining step using the EM algorithm), this can be used by either (a) fixing L to be this estimate or (b) using this estimate to choose the hyperparameters in a clustering loss function. If, instead, clinicians have an informed guess of the clustering, one could use a centered partition process (Paganin et al., 2021) for the prior on c. The CLAMR prior can also be extended to more complicated regimes that frequently arise in clinical analyses, such as multiview clustering (Franzolini et al., 2023).

Supplementary Material

Supplement A

Supplement A: Supplement to “Bayesian learning of clinically meaningful sepsis phenotypes in northern Tanzania”

The PDF supplement includes (a) extended computational details including a Gibbs sampler; (b) a description of alternate notions of MRs; (c) technical details including hyperparameter choices for the simulation study; (d) a simulation study for a well-specified Gaussian mixture; (e) hyperparameter choices, MCMC diagnostics, and additional figures for the application to the INDITe data; (f) a comparison of our results to a latent class model (LCM); (g) an additional analysis of the INDITe data using MRs derived from the patient clusters described by SENECA; (h) an extension of the CLAMR prior for categorical data; and (i) technical details on extending CLAMR to non-diagonal covariance matricies.

Supplement B

Supplement B: Code for “Bayesian learning of clinically meaningful sepsis phenotypes in northern Tanzania”

Code for (a) the main functions used in the analysis of INDITe data; and (b) reproducible code for the simulation studies can be accessed at https://github.com/adombowsky/CLAMR, and is also included as a ZIP file.

Funding.

This work is supported in part by funds from the National Institutes of Health under grant numbers R01AI12137, R01AI155733, and R01ES035625; the Office of Naval Research under grant number N00014-21-1-2510; and the National Institute of Environmental Health Sciences under grant number R01ES027498. Dombowsky was funded by the Myra and William Waldo Boone Fellowship.

REFERENCES

  1. Ahmad A and Fröhlich H (2017). Towards clinically more relevant dissection of patient heterogeneity via survival-based Bayesian clustering. Bioinformatics 33 3558–3566. [DOI] [PubMed] [Google Scholar]
  2. Andrews B, Muchemwa L, Kelly P, Lakhi S, Heimburger DC and Bernard GR (2014). Simplified severe sepsis protocol: a randomized controlled trial of modified early goal-directed therapy in Zambia. Critical Care Medicine 42 2315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Andrews B, Semler MW, Muchemwa L, Kelly P, Lakhi S, Heimburger DC, Mabula C, Bwalya M and B ernard GR (2017). Effect of an early resuscitation protocol on in-hospital mortality among adults with sepsis and hypotension: a randomized clinical trial. JAMA 318 1233–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Annane D, Renault A, Brun-Buisson C, Megarbane B, Quenot J-P, Siami S, Cariou A, Forceville X, Schwebel C, Martin C et al. (2018). Hydrocortisone plus fludrocortisone for adults with septic shock. New England Journal of Medicine 378 809–818. [DOI] [PubMed] [Google Scholar]
  5. Aragam B, Dan C, Xing EP and Ravikumar P (2020). Identifiability of nonparametric mixture models and Bayes optimal clustering. The Annals of Statistics 48 2277–2302. [Google Scholar]
  6. Bair E and Tibshirani R (2004). Semi-supervised methods to predict patient survival from gene expression data. PLoS Biology 2 e108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bennett JE, Dolin R and Blaser MJ (2019). Mandell, Douglas, and Bennett’s Principles and Practice of Infectious Diseases E-Book: 2-Volume Set. Elsevier Health Sciences. [Google Scholar]
  8. Bensmail H, Celeux G, Raftery AE and Robert CP (1997). Inference in model-based cluster analysis. Statistics and Computing 7 1–10. [Google Scholar]
  9. Binder DA (1978). Bayesian cluster analysis. Biometrika 65 31–38. [Google Scholar]
  10. Buch D, Dewaskar M and Dunson DB (2024). Bayesian Level-Set Clustering. arXiv preprint arXiv:2403.04912. [Google Scholar]
  11. Chandra NK, Canale A and Dunson DB (2023). Escaping the curse of dimensionality in Bayesian model-based clustering. Journal of Machine Learning Research 24 1–42. [Google Scholar]
  12. Crump JA, Morrissey AB, Nicholson WL, Massung RF, Stoddard RA, Galloway RL, Ooi EE, Maro VP, Saganda W, Kinabo GD et al. (2013). Etiology of severe non-malaria febrile illness in Northern Tanzania: a prospective cohort study. PLoS Neglected Tropical Diseases 7 e2324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Davenport EE, Burnham KL, Radhakrishnan J, Humburg P, Hutton P, Mills TC, Rautanen A, Gordon AC, Garrard C, Hill AV et al. (2016). Genomic landscape of the individual host response and outcomes in sepsis: a prospective cohort study. The Lancet Respiratory Medicine 4 259–271. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Diebolt J and Robert CP (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society Series B: Statistical Methodology 56 363–375. [Google Scholar]
  15. Do D, Do L, McKinley SA, Terhorst J and Nguyen X (2024). Dendrogram of mixing measures: Learning latent hierarchy and model selection for finite mixture models. arXiv preprint arXiv:2403.01684. [Google Scholar]
  16. Dombowsky A and Dunson DB (2024). Bayesian Clustering via Fusing of Localized Densities. Journal of the American Statistical Association In press. [Google Scholar]
  17. Dombowsky A, Dunson DB, Madut DB, Rubach MP and Herring AH (2025). Supplement to “Bayesian Learning of Clinically Meaningful Sepsis Phonotypes in Northern Tanzania”. https://doi.org/DOIprovidedbythetypesetter. [Google Scholar]
  18. Escobar MD and West M (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90 577–588. [Google Scholar]
  19. Ferguson TS (1973). A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics 1 209–230. [Google Scholar]
  20. Franzolini B, Cremaschi A, van den Boom W and De Iorio M (2023). Bayesian clustering of multiple zero-inflated outcomes. Philosophical Transactions of the Royal Society A 381 20220145. [Google Scholar]
  21. Frühwirth-schnatter S, Celeux G and Robert CP (2019). Handbook of Mixture Analysis. CRC Press. [Google Scholar]
  22. Grazian C and Robert CP (2018). Jeffreys priors for mixture estimation: Properties and alternatives. Computational Statistics & Data Analysis 121 149–163. [Google Scholar]
  23. Hubert L and Arabie P (1985). Comparing partitions. Journal of Classification 2 193–218. [Google Scholar]
  24. Jacob ST, Moore CC, Banura P, Pinkerton R, Meya D, Opendi P, Reynolds SJ, Kenya-Mugisha N, Mayanja-Kizza H, Scheld WM et al. (2009). Severe sepsis in two Ugandan hospitals: a prospective observational study of management and outcomes in a predominantly HIV-1 infected population. PloS One 4 e7782. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Jain AK and Dubes RC (1988). Algorithms for Clustering Data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. [Google Scholar]
  26. Jeffreys H (1935). Some tests of significance, treated by the theory of probability. In Mathematical Proceedings of the Cambridge Philosophical Society 31 203–222. Cambridge University Press. [Google Scholar]
  27. Jeffreys H (1961). The Theory of Probability. Oxford University Press. [Google Scholar]
  28. Kamary K, Lee JE and Robert CP (2018). Weakly informative reparameterizations for location-scale mixtures. Journal of Computational and Graphical Statistics 27 836–848. [Google Scholar]
  29. Kass RE and Raftery AE (1995). Bayes factors. Journal of the American Statistical Association 90 773–795. [Google Scholar]
  30. Kim S, Tadesse MG and Vannucci M (2006). Variable selection in clustering via Dirichlet process mixture models. Biometrika 93 877–893. [Google Scholar]
  31. Lau JW and Green PJ (2007). Bayesian model-based clustering procedures. Journal of Computational and Graphical Statistics 16 526–558. [Google Scholar]
  32. Linzer DA and Lewis JB (2011). poLCA: An R package for polytomous variable latent class analysis. Journal of Statistical Software 42 1–29. [Google Scholar]
  33. Liu V, Escobar GJ, Greene JD, Soule J, Whippy A, Angus DC and Iwashyna TJ (2014). Hospital deaths in patients with sepsis from 2 independent cohorts. JAMA 312 90–92. [DOI] [PubMed] [Google Scholar]
  34. Lock EF and Dunson DB (2013). Bayesian consensus clustering. Bioinformatics 29 2610–2616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Lu Z and Leen T (2004). Semi-supervised learning with penalized probabilistic clustering. Advances in Neural Information Processing Systems 17 849–856. [Google Scholar]
  36. Lu Z and Lou W (2022). Bayesian consensus clustering for multivariate longitudinal data. Statistics in Medicine 41 108–127. [DOI] [PubMed] [Google Scholar]
  37. MacLachlan G and Peel D (2000). Finite Mixture Models. Wiley, New York, USA. [Google Scholar]
  38. Malsiner-Walli G, Frühwirth-Schnatter S and Grün B (2017). Identifying mixtures of mixtures using Bayesian estimation. Journal of Computational and Graphical Statistics 26 285–295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Meilă M (2007). Comparing clusterings—an information based distance. Journal of Multivariate Analysis 98 873–895. [Google Scholar]
  40. Miller JW and Dunson DB (2019). Robust Bayesian Inference via Coarsening. Journal of the American Statistical Association 114 1113–1125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Opal SM, Dellinger RP, Vincent J-L, Masur H and Angus DC (2014). The next generation of sepsis trials: What’s next after the demise of recombinant human activated Protein C? Critical Care Medicine 42 1714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Paganin S, Herring AH, Olshan AF and Dunson DB (2021). Centered partition processes: Informative priors for clustering (with discussion). Bayesian Analysis 16 301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Pitman J and Yor M (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability 25 855–900. [Google Scholar]
  44. Poulakis K, Ferreira D, Pereira JB, Smedby Ö, Vemuri P and Westman E (2020). Fully Bayesian longitudinal unsupervised learning for the assessment and visualization of AD heterogeneity and progression. Aging (Albany NY) 12 12622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Poulakis K, Pereira JB, Muehlboeck J-S, Wahlund L-O, Smedby Ö, Volpe G, Masters CL, Ames D, Niimi Y, Iwatsubo T et al. (2022). Multi-cohort and longitudinal Bayesian clustering study of stage and subtype in Alzheimer’s disease. Nature Communications 13 4566. [Google Scholar]
  46. Raftery AE (1995). Hypothesis testing and Model Selection. Markov Chain Monte Carlo in Practice 163. [Google Scholar]
  47. Raman S, Fuchs TJ, Wild PJ, Dahl E, Buhmann JM and Roth V (2010). Infinite mixture-of-experts model for sparse survival regression with application to breast cancer. BMC Bioinformatics 11 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Rand WM (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66 846–850. [Google Scholar]
  49. Rhodes A, Evans LE, Alhazzani W, Levy MM, Antonelli M, Ferrer R, Kumar A, Sevransky JE, Sprung CL, Nunnally ME et al. (2017). Surviving sepsis campaign: international guidelines for management of sepsis and septic shock: 2016. Intensive Care Medicine 43 304–377. [DOI] [PubMed] [Google Scholar]
  50. Richardson S and Green PJ (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society Series B: Statistical Methodology 59 731–792. [Google Scholar]
  51. Robert C and Mengersen K (1995). Reparameterization Issues in Mixture Modelling and their Bearings on the Gibbs Sampler Technical Report No. 9538, CREST, Insee Paris. [Google Scholar]
  52. Ross JC, Castaldi PJ, Cho MH, Chen J, Chang Y, Dy JG, S ilverman EK, Washko GR and Estépar RSJ (2016). A Bayesian nonparametric model for disease subtyping: application to emphysema phenotypes. IEEE Transactions on Medical Imaging 36 343–354. [Google Scholar]
  53. Rousseau J and Mengersen K (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society Series B: Statistical Methodology 73 689–710. [Google Scholar]
  54. Rowan KM, Angus DC, Bailey M, Barnato AE, Bellomo R, Canter RR, Coats TJ, Delaney A, Gimbel E, Grieve RD et al. (2017). Early, goal-directed therapy for septic shock–a patient-level meta-analysis. New England Journal of Medicine 376 2223–2234. [DOI] [PubMed] [Google Scholar]
  55. Rubach MP, Maro VP, Bartlett JA and Crump JA (2015). Etiologies of illness among patients meeting integrated management of adolescent and adult illness district clinician manual criteria for severe infections in northern Tanzania: implications for empiric antimicrobial therapy. The American Journal of Tropical Medicine and Hygiene 92 454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Rubio FJ and Steel MF (2014). Inference in two-piece location-scale models with Jeffreys priors. Bayesian Analysis 9 1–22. [Google Scholar]
  57. Rudd KE, Johnson SC, Agesa KM, Shackelford KA, Tsoi D, Kievlan DR, Colombara DV, Ikuta KS, Kissoon N, Finfer S et al. (2020). Global, regional, and national sepsis incidence and mortality, 1990–2017: analysis for the Global Burden of Disease Study. The Lancet 395 200–211. [Google Scholar]
  58. Savage RS, Ghahramani Z, Griffin JE, De la Cruz BJ and Wild DL (2010). Discovering transcriptional modules by Bayesian data integration. Bioinformatics 26 i158–i167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Scicluna BP, Van Vught LA, Zwinderman AH, Wiewel MA, Davenport EE, Burnham KL, Nürnberg P, Schultz MJ, Horn J, Cremer OL et al. (2017). Classification of patients with sepsis according to blood genomic endotype: a prospective cohort study. The Lancet Respiratory Medicine 5 816–826. [DOI] [PubMed] [Google Scholar]
  60. Seymour CW, Kennedy JN, Wang S, Chang C-CH, Elliott CF, Xu Z, Berry S, Clermont G, Cooper G, Gomez H et al. (2019). Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis. JAMA 321 2003–2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Singer M, Deutschman CS, Seymour CW, Shankar-Hari M, Annane D, Bauer M, Bellomo R, Bernard GR, Chiche J-D, Coopersmith CM et al. (2016). The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA 315 801–810. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Snavely ME, Maze MJ, Muiruri C, Ngowi L, Mboya F, Beamesderfer J, Makupa GF, Mwingwa AG, Lwezaula BF, Mmbaga BT et al. (2018). Sociocultural and health system factors associated with mortality among febrile inpatients in Tanzania: a prospective social biopsy cohort study. BMJ Global Health 3 e000507. [Google Scholar]
  63. Sweeney TE, Chen AC and Gevaert O (2015). Combined mapping of multiple clustering algorithms (communal): a robust method for selection of cluster number, K. Scientific Reports 5 16971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Sweeney TE, Azad TD, D onato M, Haynes WA, Perumal TM, Henao R, Bermejo-Martin JF, Almansa R, Tamayo E, Howrylak JA et al. (2018). Unsupervised analysis of transcriptomics in bacterial sepsis across multiple datasets reveals three robust clusters. Critical Care Medicine 46 915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Tadesse MG, Sha N and Vannucci M (2005). Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association 100 602–617. [Google Scholar]
  66. Tintinalli JE, Stapczynski JS, Ma O, Yealy D, Meckler G and Cline D (2016). Tintinalli’s Emergency Medicine: A Comprehensive Study Guide, 8e. McGraw Hill Education. [Google Scholar]
  67. Udler MS, Kim J, von Grotthuss M, Bonàs-Guarch S, Cole JB, Chiou J, on behalf of METASTROKE, C. D. A., THE ISGC, Boehnke M, Laakso M, Atzmon G et al. (2018). Type 2 diabetes genetic loci informed by multi-trait associations point to disease mechanisms and subtypes: a soft clustering analysis. PLoS Medicine 15 e1002654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. van Buuren S and Groothuis-Oudshoorn K (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software 45 1–67. [Google Scholar]
  69. van der Poll T, van de Veerdonk FL, Scicluna BP and Netea MG (2017). The immunopathology of sepsis and potential therapeutic targets. Nature Reviews Immunology 17 407–420. [Google Scholar]
  70. Van Havre Z, White N, Rousseau J and Mengersen K (2015). Overfitting Bayesian mixture models with an unknown number of components. PloS One 10 e0131739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Wade S (2023). Bayesian cluster analysis. Philosophical Transactions of the Royal Society A 381 20220149. [Google Scholar]
  72. Wade S and Ghahramani Z (2018). Bayesian cluster analysis: Point estimation and credible balls (with discussion). Bayesian Analysis 13 559 – 626. [Google Scholar]
  73. Wang D, Quesnel-Vallieres M, Jewell S, Elzubeir M, Lynch K, Thomas-Tikhonenko A and Barash Y (2023). A Bayesian model for unsupervised detection of RNA splicing based subtypes in cancers. Nature Communications 14 63. [Google Scholar]
  74. Wasserman L (2000). Asymptotic inference for mixture models by using data-dependent priors. Journal of the Royal Statistical Society Series B: Statistical Methodology 62 159–180. [Google Scholar]
  75. Wilkerson MD and Hayes DN (2010). ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics 26 1572–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Wong HR, Sweeney TE, Hart KW, Khatri P and Lindsell CJ (2017). Pediatric sepsis endotypes among adults with sepsis. Critical Care Medicine 45 e1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Yuan Y, Savage RS and Markowetz F (2011). Patient-specific data fusion defines prognostic cancer subtypes. PLoS Computational Biology 7 e1002227. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement A

Supplement A: Supplement to “Bayesian learning of clinically meaningful sepsis phenotypes in northern Tanzania”

The PDF supplement includes (a) extended computational details including a Gibbs sampler; (b) a description of alternate notions of MRs; (c) technical details including hyperparameter choices for the simulation study; (d) a simulation study for a well-specified Gaussian mixture; (e) hyperparameter choices, MCMC diagnostics, and additional figures for the application to the INDITe data; (f) a comparison of our results to a latent class model (LCM); (g) an additional analysis of the INDITe data using MRs derived from the patient clusters described by SENECA; (h) an extension of the CLAMR prior for categorical data; and (i) technical details on extending CLAMR to non-diagonal covariance matricies.

Supplement B

Supplement B: Code for “Bayesian learning of clinically meaningful sepsis phenotypes in northern Tanzania”

Code for (a) the main functions used in the analysis of INDITe data; and (b) reproducible code for the simulation studies can be accessed at https://github.com/adombowsky/CLAMR, and is also included as a ZIP file.

RESOURCES