Integrating sample similarities into latent class analysis: a tree-structured shrinkage approach

Mengbing Li; Daniel E Park; Maliha Aziz; Cindy M Liu; Lance B Price; Zhenke Wu

doi:10.1111/biom.13580

. Author manuscript; available in PMC: 2023 Nov 13.

Published in final edited form as: Biometrics. 2021 Nov 10;79(1):264–279. doi: 10.1111/biom.13580

Integrating sample similarities into latent class analysis: a tree-structured shrinkage approach

Mengbing Li ¹, Daniel E Park ³, Maliha Aziz ³, Cindy M Liu ³, Lance B Price ³, Zhenke Wu ^1,²

PMCID: PMC10642217 NIHMSID: NIHMS1943067 PMID: 34658017

Abstract

This paper is concerned with using multivariate binary observations to estimate the probabilities of unobserved classes with scientific meanings. We focus on the setting where additional information about sample similarities is available and represented by a rooted weighted tree. Every leaf in the given tree contains multiple samples. Shorter distances over the tree between the leaves indicate a priori higher similarity in class probability vectors. We propose a novel data integrative extension to classical latent class models with tree-structured shrinkage. The proposed approach enables (1) borrowing of information across leaves, (2) estimating data-driven leaf groups with distinct vectors of class probabilities, and (3) individual-level probabilistic class assignment given the observed multivariate binary measurements. We derive and implement a scalable posterior inference algorithm in a variational Bayes framework. Extensive simulations show more accurate estimation of class probabilities than alternatives that suboptimally use the additional sample similarity information. A zoonotic infectious disease application is used to illustrate the proposed approach. The paper concludes by a brief discussion on model limitations and extensions.

Keywords: Gaussian diffusion, latent class models, phylogenetic tree, spike-and-slab prior, variational Bayes, zoonotic infectious diseases

1 |. INTRODUCTION

1.1 |. Motivating application

The fields of infectious disease epidemiology and microbial ecology need better tools for tracing the transmission of microbes between humans and other vertebrate animals (i.e., zoonotic transmissions), especially for colonizing opportunistic pathogens (COPs). Unlike frank zoonotic pathogens (e.g., Salmonella, SARS-CoV-2), the epidemiology of COPs, such as Escherichia coli (E. coli), Staphylococcus aureus (S. aureus), and Enterococcus spp., can be particularly cryptic due to their ability to asymptomatically colonize the human body for indefinite periods prior to initiating an infection, transmitting to another person, or being shed without a negative outcome (e.g., Price et al., 2017). Some COPs can colonize many different vertebrate hosts and cross-species transmissions can go unrecognized. Estimating the probability of zoonotic origin for a population of isolates and for each isolate would provide important insights into the natural history of infections and inform more effective intervention strategies, such as eliminating high-risk clones from livestock via vaccination.

Scientists often have two complementary sources of information: (i) a phylogenetic tree constructed based on a few single nucleotide polymorphisms (SNPs) in the core genome shared by all isolates, where the leaves represent distinct core-genome multi-locus sequence types (STs, Maiden et al., 1998); the tree is useful for identifying a recent common ancestor for isolates that comprise an infectious disease outbreak; (ii) presence or absence of multiple mobile genetic elements (MGE) that provide selective advantages in particular hosts and may be lost and gained as COPs transmit among hosts (e.g., Lindsay and Holden, 2004).

Recent research on two COP species, E. coli and S. aureus, has demonstrated the utility of complementing core-genome phylogenetic trees with host-associated MGEs to resolve host origins (e.g., Liu et al., 2018). However, in both cases only a single host-associated MGE was used. Analyses were largely limited to visual inspection of how each element fell on the scaffold of the evolutionary tree. For this approach to reach its full potential, we would need a statistical model that can (1) integrate phylogenetic information with the presence and absence of multiple host-associated MGEs, and (2) estimate the probability with which the isolates were derived from a particular host in each ST-specific population and for each individual isolate.

1.2 |. Integrating sample similarities into latent class analysis

Based on multivariate binary data (e.g., presence or absence of multiple MGEs), we use latent class models (LCMs; e.g., Lazarsfeld, 1950; Goodman, 1974) to achieve the scientific goal of estimating the probabilities of unobserved host origins and perform individual-level probabilistic assignment of host origin. LCMs are examples of latent variable models that assume the observed dependence among multivariate discrete responses is induced by variation among unobserved or “latent” variables. It is well known that any multivariate discrete data distribution can be approximated arbitrarily closely by an LCM with a sufficiently large number of classes (Dunson and Xing, 2009, corollary 1). The most commonly used LCMs assume the class membership indicators for the observations are drawn from a population with the same vector of class probabilities.

Trees or hierarchies are useful and intuitive for representing and reasoning about similarity or relation among objects in many real-world domains. We assume known entities at the leaves. In our context, each leaf may contain multiple observations or samples, each associated with the multivariate binary responses that are then combined to form the rows of a binary data matrix $Y$ . In the motivating application, the latent class indicates the unobserved host origin (human or nonhuman) to be inferred by the presence or absence of multiple MGEs. The additional sample similarity information is represented by a maximum likelihood phylogenetic tree (e.g., Scornavacca et al., 2020). The leaves represent distinct contemporary core-genome E. coli STs.

To integrate tree-encoded sample similarity information into a latent class analysis, ad hoc groupings of the leaves may be adopted. From the finest to the coarsest leaf grouping, one may (1) analyze data from distinct lineages one at a time, (2) manually form groups of at least one leaf node and fit separate LCMs, or (3) fit all the data by a single LCM. However, all these methods pose significant statistical challenges. First, separate latent class analyses may have low accuracy in estimating latent class probabilities and other model parameters for rare lineages. Second, observations of similar lineages may have similar propensities in host jump resulting in similar host origin class probabilities. Modeling these similarities could lead to gain in statistical efficiency. Third, approaches based on coarse ad hoc groupings may obscure the study of the variation in the latent class probabilities across different parts of the tree. Finally, based on a single LCM or other approaches that use ad hoc leaf groupings, individual-specific posterior class probabilities can be averaged within in each leaf to produce a local estimate of the vector of class probabilities. However, the ad hoc post-processing cannot fully address the issue of assessment of posterior uncertainty nor data-driven grouping of leaves, necessitating development of an integrative probabilistic modeling framework for uncertainty quantification and adaptive formation of leaf groups.

In this paper, we focus on integrating the tree-encoded sample similarity information into latent class analysis. We assume the tree information is given and not computed from the multivariate binary measurements. Observations in nearby leaves are assumed to have a priori similar propensities of being members of a particular class as characterized by the latent class probabilities. For example, higher similarities are indicated by shorter pairwise distances between observations. More generally, classical covariate-dependent latent class models (e.g., Formann, 1992; Bandeen-Roche et al., 1997) let the latent class probabilities vary explicitly as functions of observed covariates so that observations with more similar covariate values are assumed to have more similar latent class probabilities. Fully probabilistic tree-integrative methods have appeared in machine learning literature (e.g., Roy et al., 2006; Ghahramani et al., 2010; Ranganath et al., 2015) or in statistics for modeling hierarchical topic annotations (e.g., Airoldi and Bischof, 2016) or hierarchical outcome annotations based on given trees (e.g., Thomas et al., 2019). In epidemiology, Avila et al. (2014) proposed a two-stage approach to link patient clusters estimated from the tree and by the LCM results, which, however, remains ad hoc. The existing literature does not address probabilistic tree-integrative latent class analysis or adaptive formation of leaf groups for dimension reduction.

1.3 |. Primary contributions

In this paper, we propose an unsupervised, tree-integrative LCM framework to (1) discover groups of leaves where multivariate binary measurements in distinct leaf groups have distinct vectors of latent class probabilities; and observations nested in any leaf group may belong to a pre-specified number of latent classes; (2) accurately estimate the latent class probabilities for each discovered leaf group and assign probabilities of an individual sample belonging to the latent classes; (3) leverage the relationship among the observations as encoded by the tree to boost the accuracy of the estimation of latent class probabilities. Without pre-specifying the leaf groups, the automatic data-driven approach enjoys robustness by avoiding potential mis-specification of the grouping structure. On the other hand, the discovered data-driven leaf groups dramatically reduce the dimension of leaves into fewer subgroups of leaves hence improving interpretation. In addition, the proposed approach shows better accuracy in estimating the latent class probabilities in terms of root mean squared errors, indicating the advantage of the shrinkage. On posterior computation, we derive a scalable inference algorithm based on variational inference (VI).

The rest of the paper is organized as follows. Section 2.2 defines tree-related terminologies and formulates LCMs. Section 3 proposes the prior for tree-structured shrinkage in LCMs. Section 4 derives a variational Bayes algorithm for inference. Section 5 compares the performances of the proposed and alternative approaches via simulations. Section 6 illustrates the approach by analyzing an E. coli data set. The paper concludes with a brief discussion.

2 |. MODEL

We first introduce necessary terminologies and notations to describe a rooted weighted tree. LCMs are then formulated for data on the leaves of the tree.

2.1 |. Rooted weighted trees

A rooted tree is a graph $𝒯 = (𝒱, E)$ with node set $𝒱$ and edge set $E$ where there is a root $u_{0}$ and each node has at most one parent node. Let $p = | 𝒱 |$ represent the total number of leaf and nonleaf nodes. Let $𝒱_{L} \subset 𝒱$ be the set of leaves, and $p_{L} = |𝒱_{L}| < p$ . We typically use $u$ to denote any node $(u \in 𝒱)$ and $v$ to denote any leaf $(v \in 𝒱_{L})$ . Each edge in a rooted tree defines a clade: the group of leaves below it. Splitting the tree at an edge creates a partition of the leaves into two groups. For any node $u \in 𝒱$ , the following notations apply: $c (u)$ is the set of offspring of $u, p a (u)$ is the parent of $u, d (u)$ is the set of descendants of $u$ including $u$ , and $a (u)$ is the set of ancestors of $u$ including $u$ . In Figure 3(a), if $u = 2$ , then $c (u) = {6,7, 8}, p a (u) = {1}$ , $d (u) = {2,6, 7,8}$ , and $a (u) = {1,2}$ . The phylogenetic tree in our motivating application is a nested hierarchy of 133 STs for $N = 2663$ observations, where the $p_{L} = |𝒱_{L}| = 133$ leaves represent distinct STs and the $p - p_{L} = 132$ internal (nonleaf) nodes represent ancestral E. coli strains leading up to the observed leaf descendants.

Simulation studies show the proposed model produces grouped estimates ${\hat{π}}_{v}^{dgrp}$ with similar or smaller RMSEs compared to alternatives (see Section 5). This figure appears in color in the electronic version of this article, and any mention of color refers to that version

Edge-weighted graphs appear as a model for numerous problems where nodes are linked with edges of different weights. In particular, the edges in $𝒯$ are attached with weights where $w : E \to R^{+}$ is a weight function. Let $𝒯_{w} = (𝒯, w)$ be a rooted weighted tree. A path in a graph is a sequence of edges that joins a sequence of distinct vertices. For a path $P$ in the tree connecting two nodes, $w (P)$ is defined as the sum of all the edge weights along the path, often referred to as the “length” of $P$ . The distance between two vertices $u$ and $u^{'}$ , denoted by ${d i s t}_{𝒯_{w}} (u, u^{'})$ is the length of a shortest (with minimum length) $(u, u^{'})$ -path. $dist_{𝒯_{w}}$ is a distance: it is symmetric and satisfies the triangle inequality. In our motivating application, the edge length represents the number of nucleotide substitutions per position; the distance between two nodes provides a measurement of the similarity or divergence between any two core-genome sequences of the input set. In this paper, we use $w_{u}$ to represent the edge length between a node $u$ and its parent node $p a (u) . w_{u}$ is fully determined by $𝒯_{w}$ . For the root $u_{0}$ , there are no parents, that is, $p a (u_{0}) = \emptyset$ ; we set $w_{u_{0}} = 1$ .

2.2 |. Latent class models for data on the leaves

Although LCMs can deal with multiple categorical responses in general, for simpler presentations in this paper, we focus on presenting the model and algorithm using multivariate binary responses and their application to the motivating data.

Notations

Let $Y_{i}^{(v)} = {(Y_{i 1}^{(v)}, \dots, Y_{i J}^{(v)})}^{⊤} \in {0,1}^{J}$ be the vector of binary responses for observation $i \in [n_{v}]$ that is nested within leaf node $v \in 𝒱_{L}$ , where $n_{v}$ is the number of observations in leaf $v$ . Throughout this paper, let $[Q] ≔ {1, \dots, Q}$ denote the set of positive integers smaller than or equal to $Q$ , where $Q$ is a positive integer. Let $Y^{(v)} = {(Y_{1}^{(v)}, \dots, Y_{n_{v}}^{(v)})}^{⊤}$ be the data from observations in leaf $v$ . Let $Y = {({(Y^{(1)})}^{⊤}, \dots, {(Y^{(p_{L})})}^{⊤})}^{⊤}$ represent the binary data matrix with $N = \sum_{v \in 𝒱_{L}} n_{v}$ rows and $J$ columns. Let $ℒ = {(v_{1}, \dots, v_{N})}^{⊤}$ be the “sample-to-leaf indicators” that map every row of data $Y$ into a leaf in $𝒯_{w}$ . Sample similarities are then characterized by between-leaf distances in $𝒯_{w}$ . In this paper, we assume $ℒ$ and $𝒯_{w}$ are given and focus on incorporating $(ℒ, 𝒯_{w})$ into a statistical model for $Y$ .

LCM for Data on the Leaves

The LCM is specified in two steps:

class indicator : I_{i}^{(v)} ∣ π_{v} ~ {Categorical}_{K} {π_{v}}, π_{v} \in S_{K - 1},

(1)

data : Y_{i j}^{(v)} ∣ I_{i}^{(v)} ~ Bernoulli {θ_{j, I_{i}^{(v)}}}, independently for feature j \in [J],

(2)

and independently for observation $i \in [n_{v}]$ and leaf node $v \in 𝒱_{L}$ . Here $K$ is a pre-specified number of latent classes in the context of the application, that is, $K = 2$ for unobserved human and nonhuman hosts; see Section 4.1 for a simple strategy in applications where data-driven $K$ is desired. In addition, $I = \{I_{i}^{(v)} : i \in [n_{v}]; v \in 𝒱_{L}\}$ represent the latent class indicators and $Z_{i k}^{(v)} = 1 \{I_{i}^{(v)} = k}, k \in [K]$ , where $1 {A}$ is an indicator function that equals 1 if statement $A$ is true and 0 otherwise; Let $Z = \{Z_{i k}^{(v)}\}$ . We have assumed observations in different leaves have potentially different vectors of class probabilities $π_{v} = {(π_{v 1}, \dots, π_{v K})}^{⊤} \in S_{K - 1}, v \in 𝒱_{L}$ , where $S_{K - 1} = {r \in [0,1]^{K} : \sum_{k = 1}^{K} r_{k} = 1\}$ is the probability simplex. $θ_{j k} \in [0,1]$ is the positive response probability for feature $j \in [J]$ in class $k \in [K]$ . In our motivating application, the MGEs adapt to the unobserved type of host origin (i.e., latent class) that can be characterized by class-specific response probability profiles $θ_{\cdot k} = {(θ_{1 k}, \dots, θ_{J k})}^{⊤}, k \in [K]$ ; let $Θ = {(θ_{\cdot 1}, \dots, θ_{\cdot})}^{⊤}$ . Because the latent class indicators $I_{i}^{(v)}$ ’s are assumed to be unobserved, the observed data likelihood for $N$ observations is $\prod_{v \in 𝒱_{L}} \prod_{i = 1}^{n_{v}} \sum_{k = 1}^{K} π_{v k} P (Y_{i}^{(v)} ∣ I_{i}^{(v)} = k, θ_{\cdot k})$ .

Throughout this paper, we assume that we wish to classify individuals into $K$ classes with the same set of $(θ_{\cdot 1}, \dots, θ_{\cdot K})$ so classes have coherent interpretation. However, we do not assume that observations are drawn from a population with a single vector of latent class probabilities. Figure 1 provides a schematic of the data generating mechanism given $π_{v}$ for three leaves.

Schematic representation of a hypothetical rooted weighted tree with three leaves and data generated based on the proposed model with $K = 3$ latent classes, $n_{v_{1}} = 2, n_{v_{2}} = 4$ and $n_{v_{3}} = 2, J = 8$ . This figure appears in color in the electronic version of this article, and any mention of color refers to that version

3 |. PRIOR DISTRIBUTION

We first specify a prior distribution for $\{π_{v} : v \in 𝒱_{L}\}$ . Because leaf-specific sample sizes may vary, we propose a tree-structured prior to borrow information across nearby leaves. The prior encourages collapsing certain parts of the tree so that observations within a collapsed leaf group share the same vector of latent class probabilities. In particular, we extend Thomas et al. (2019) to deal with rooted weighted trees in an LCM setting. The prior specification is completed by priors for the class-specific response probabilities $Θ$ .

Tree-structured prior for latent class probabilities $π_{v}$

We specify a spike-and-slab Gaussian diffusion process prior along a rooted weighted tree based on a logistic stick-breaking parameterization of $π_{v}$ . We first reparameterize $π_{v}$ with a stick-breaking representation: $π_{v k} = V_{v k} \prod_{s < k} (1 - V_{v s})$ , for $k \in [K]$ , where $0 \leq V_{v k} \leq 1$ , for $k \in [K - 1]$ and $V_{v K} = 1$ .

We further logit-transform $V_{v k}, k \in [K - 1]$ , to facilitate the specification of a Gaussian diffusion process prior without range constraints. In particular, let $η_{v k} = σ^{- 1} (V_{v k}), k \in [K - 1], v \in 𝒱_{L},$ where $σ (x) = 1 / {1 + e x p (- x)}$ is the sigmoid function. The logistic stick-breaking parameterization is completed by

π_{v k} = {σ (η_{v k})}^{1 {k < K}} \prod_{s < k} σ (- η_{v s}), k \in [K],

(3)

which affords simple and accurate posterior inference via variational Bayes (see Section 4).

For a leaf $v \in 𝒱_{L}$ , let

η_{v k} = \sum_{u \in a (v)} ξ_{u k}, k \in [K - 1] .

(4)

Here $η_{v k}$ is defined for leaves only and $ξ_{u k}$ is defined for all the nodes. Suppose $v$ and $v^{'}$ are leaves and siblings in the tree such that $p a (v) = p a (v^{'})$ , setting $ξ_{v k} = ξ_{v^{'} k} = 0$ implies $η_{v k} = η_{v^{'} k}$ for $k \in [K - 1]$ , and hence $π_{v} = π_{v^{'}}$ . More generally, a sufficient condition for $M$ leaves $η_{v k}, v \in \{v_{1}, \dots, v_{M}\}$ to fuse is to set $ξ_{u k} = 0$ for any $u$ that is an ancestor of any of $\{v_{1}, \dots, v_{M}\}$ but not common ancestors for all $v_{m}$ . That is, to achieve grouping of observations that share the same vector of latent class probabilities, in our model, it is equivalent to parameter fusing. In the following, we specify a prior on the $ξ_{u k}$ that a priori encourages sparsity, so that closely related observations are likely grouped to have the same vector of class probabilities. The fewer distinct ancestors two nodes have, the more likely the parameters $η_{v k}$ are fused, because the prior would encourage fewer auxiliary variables $ξ_{u k}$ to be set to zero. In particular, we specify

ξ_{u k} = s_{u} α_{u k}, \forall u \in 𝒱,

(5)

α_{u k} ~ N (0, τ_{1 k ℓ_{u}} w_{u}), independently for k \in [K - 1], \forall u \in 𝒱,

(6)

s_{u_{0}} = 1, and s_{u} ~ Bernoulli (ρ_{ℓ_{u}}), independently for u \in 𝒱 ∖ u_{0},

(7)

ρ_{ℓ} ~ Beta (a_{ℓ}, b_{ℓ}), independently for ℓ \in [L],

(8)

where $N (m, s)$ represents a Gaussian density function with mean $m$ and variance $s . τ_{1 k ℓ}$ is the unit-length variance and controls the degree of diffusion along the tree that may differ by dimension $k$ and node level $ℓ_{u}$ where $ℓ_{u} \in [L]$ represents the “level” or “hyperparameter set indicator” for node $u$ . For example, in simulations and data analysis, we will assume that the root for the diffusion process has a prior unit-length variance distinct from other non-root nodes. For the root $u_{0}$ with $s_{u_{0}} = 1, α_{u_{0} k}$ initializes the diffusion of $η_{u k}$ .

Leaf groups are formed by selecting a subset of nodes in $𝒱 : 𝒰 = \{u \in 𝒱 : s_{u} = 1\}$ . Except a probability-zero set, two leaves $v$ and $v^{'}$ are grouped, or “fused,” if and only if $a (v) \cap 𝒰 = a (v^{'}) \cap 𝒰$ . In particular, the null set is $\{η_{v k} = η_{v^{'} k}, k \in [K - 1]\} \cap \{\sum_{u \in [a (v) \cap 𝒱] ∖ [a (v^{'}) \cap 𝒱^{'}]} α_{u k} = \sum_{u \in [a (v^{'}) \cap 𝒞] ∖ [a (v) \cap 𝒱]} α_{u k}\}$ where the latter has probability zero. In Section 4.1, we will estimate $𝒰$ using the posterior median model.

Remark 1. Equations (4)–(8) define a Gaussian diffusion process initiated at $α_{u_{0} k}$ :

η_{u k} ∣ others ~ N (\sum_{u^{'} \in a (u)} ξ_{u^{'} k}, s_{u} τ_{1 k ℓ_{u}} w_{u}), independentlyfork k \in [K - 1],

(9)

for any nonroot node $u \neq u_{0}$ ; also see the seminal formulation by Felsenstein (1985). To aid the understanding of this Gaussian diffusion prior, it is helpful to consider a special case of $s_{u} = 1$ and $ℓ_{u} = 1, \forall u \in 𝒱$ . For two leaves $v, v^{'} \in 𝒱_{L}$ , the prior correlation between $η_{v k}$ and $η_{v^{'} k}$ is

Corr (η_{v k}, η_{v^{'} k}) = \frac{\sum_{u \in a (v) \cap a (v^{'})} w_{u}}{{{dist}_{𝒯_{w}} (u_{0}, v) {dist}_{𝒯_{w}} (u_{0}, v^{'})}^{1 / 2}} .

(10)

When $v$ and $v^{'}$ have the same number of ancestors $(| a (v) | = |a (v^{'})|)$ and all edges have identical weight $w_{u} = c, \forall u$ , the prior correlation is the fraction of common ancestors. Note that $η_{v}$ fully determines $π_{v}$ in (3) and induces correlations among $\{π_{v}, v \in 𝒱_{L}\}$ .

Remark 2. One reviewer raised an important question on the choice of encouraging prior correlation among $\{π_{v}\}$ rather than among the latent class indicators $\{I_{i}^{(v)}\}$ . In the present prior distribution, by integrating out $\{π_{v}\}$ , we have induced prior marginal correlation among $\{I_{i}^{(v)}\}$ for observation in nearby leaves. Additional prior correlation among the $\{I_{i}^{(v)}\}$ can be introduced via an additional layer of prior over the $\{I_{i}^{(v)}\}$ conditional on $\{π_{v}\}$ , for example, through clustered samples. The absence of such clustered sampling structure in the motivating application points us toward the former simpler strategy.

Priors for class-specific response probabilities

Let $γ_{j k} = l o g \{θ_{j k} / (1 - θ_{j k})\}$ . We specify

γ_{j k} ~ N (0, τ_{2 j k}), independently for feature j \in [J] and class k \in [K] .

(11)

Joint distribution

Let $β = (Z, s, γ, α, ϱ)$ collect all the unknown parameters where

s = {s_{u} : u \in 𝒱}, γ = {γ_{j k}, j \in [J]; k \in [K]}, α = {α_{u k} : u \in 𝒱, k \in [K - 1]}, ϱ = {(ρ_{1}, \dots, ρ_{L})}^{⊤}, a = {(a_{1}, \dots, a_{L})}^{⊤},

and $b = {(b_{1}, \dots, b_{L})}^{⊤}$ . Hereafter we use $p r (A ∣ B)$ to denote a probability density or mass function of quantities in $A$ with parameters $B$ ; when $B$ represents hyperparameters or given information in this paper, we simply use $p r (A)$ , for example, we will use $p r (Y, β)$ to represent $p r (Y, β ∣ τ_{1}, τ_{2}, a, b, 𝒯_{w}, ℒ)$ . The joint distribution of data and unknown quantities can thus be written as

pr (Y ∣ β) pr (β) = \prod_{v \in 𝒱_{L}} \prod_{i = 1}^{n_{v}} \prod_{k = 1}^{K} [{σ (η_{v k})}^{1 {k < K}} \prod_{s < k} \times {1 - σ (η_{v s})} \prod_{j = 1}^{J} σ (X_{i j}^{(v)} γ_{j k})] Z_{i k}^{(v)}

(12)

\times \prod_{u \in 𝒱} \prod_{k = 1}^{K - 1} (\frac{1}{\sqrt{2 π τ_{1 k ℓ_{u}} w_{u}}} exp {- \frac{1}{2 τ_{1 k ℓ_{u}} w_{u}} α_{u k}^{2}}) \times \prod_{j = 1}^{J} \prod_{k = 1}^{K} (\frac{1}{\sqrt{2 π τ_{2 j k}}} exp {- \frac{1}{2 τ_{2 j k}} γ_{j k}^{2}}) \times \prod_{u \in 𝒱} ρ_{ℓ_{u}}^{s_{u}} \times {(1 - ρ_{ℓ_{u}})}^{1 - s_{u}} \cdot \prod_{ℓ = 1}^{L} \frac{1}{Beta (a_{ℓ}, b_{ℓ})} ρ_{ℓ}^{a_{ℓ} - 1} {(1 - ρ_{ℓ})}^{b_{ℓ} - 1},

(13)

where $X_{i j}^{(v)} = 2 Y_{i j}^{(v)} - 1$ . Tree information $𝒯_{w}$ enters the joint distribution in the definition of $η_{v}$ (Equations 4); sample-to-leaf indicators $ℒ$ choose among $\{η_{v}, v \in 𝒱_{L}\}$ for every observation in Equation (12). By setting $s_{u} = 0$ for all the nonroot nodes in Equation (5), the classical LCM with a single $π = π_{u_{0}}$ results. Figure 2 shows a directed acyclic graph (DAG) that represents the model likelihood and prior specifications.

The directed acyclic graph (DAG) representing the structure of the model likelihood and priors. The quantities in squares are either data or hyperparameters; the unknown quantities are shown in the circles. The arrows connecting variables indicate that the parent parameterizes the distribution of the child node (solid lines) or completely determines the value of the child node (double-stroke arrows). The rectangular “plates” where the variables are enclosed indicate that a similar graphical structure is repeated over the index; The index in a plate indicate nodes, hyperparameter levels, leaves, subjects, classes, and features. This figure appears in color in the electronic version of this article, and any mention of color refers to that version

4 |. VARIATIONAL INFERENCE ALGORITHM

Calculating a posterior distribution often involves intractable high-dimensional integration over the unknowns in the model. Traditional sequential sampling approaches such as Markov chain Monte Carlo (MCMC) remains a widely used inferential tool based on approximate samples from the posterior distribution. They can be powerful in evaluating multidimensional integrals. However, they do not guarantee closed-form posterior distributions. Variational inference (VI) is a popular alternative to MCMC for approximating the posterior distribution and has been widely used in machine learning and gaining interest in statistics (e.g., Ormerod and Wand, 2010; Blei et al., 2017). In particular, VI has also been used for fitting the classical LCMs (e.g., Grimmer, 2011). VI requires a user-specified family of distributions that can be expressed in tractable forms while being flexible enough to approximate the true posterior; the approximating distributions and their parameters are referred to as “variational distributions” and “variational parameters”, respectively. VI algorithms find the best variational distribution that minimizes the Kullback-Leibler (KL) distance between the variational family and the true posterior distribution. VI has been widely applied in Gaussian (Titsias and Lázaro-Gredilla, 2011; Carbonetto and Stephens, 2012) and binary likelihoods (e.g., Jaakkola and Jordan, 2000; Thomas et al., 2019). Also see Blei et al. (2017) for a detailed review. We use VI because it is fast, bypasses infeasible analytic integration or data augmentation that is otherwise needed for MCMC under Dirac spike components and prior-likelihood nonconjugacy (Tüchler, 2008), and enables data-driven selection of hyperparameters via approximate empirical Bayes (Equation (S8), Supporting Information). These advantages of VI are achieved at a cost of slight variance-covariance under-estimation, the degree of which we assess in Section 5.

We use VI algorithm to conduct inference using variational distributions factorized as

q (β) = q (γ) \cdot \underset{q (s, α)}{\underset{︸}{\prod_{u \in 𝒱} q (s_{u}, α_{u})}} \cdot \underset{q (Z)}{\underset{︸}{\prod_{v \in 𝒱_{L}} \prod_{i = 1}^{n_{v}} q (Z_{i}^{(v)})}} \cdot \underset{q (ϱ)}{\underset{︸}{\prod_{ℓ = 1}^{L} q (ρ_{ℓ})}},

(14)

where $q (Z_{i}^{(v)})$ is a multinomial distribution with variational parameters $r_{i}^{(v)} = {(r_{i 1}^{(v)}, \dots, r_{i K}^{(v)})}^{⊤}$ , and $r_{i k}^{(v)}$ represents the approximate posterior probability of observation $i$ in leaf $v$ belonging to class $k$ and $\sum_{k = 1}^{K} r_{i k}^{(v)} = 1$ . Importantly, we make no other assumptions about the particular parametric form of variational distributions, which by the VI updating rules can be shown to take familiar distributional forms (see Section A1, Supporting Information).

VI finds $q$ that minimizes the KL distance between the variational family and the true posterior distribution: $K L (q (β) | | p r (β ∣ Y)) = - \int q (β) l o g \{\frac{p r (β ∣ Y)}{q (β)}\} d β$ . However, the KL distance depends on the intractable posterior distribution is not easily computed. Fortunately, based on a well-known equality, $l o g p r (Y) = ℰ (q) + K L (q (β) ∥ p r (β ∣ Y)$ ), where $ℰ (q) = \int q (β) l o g \frac{p r (Y, β)}{q (β)} d β$ is referred to as evidence lower bound (ELBO) because $l o g p r (Y) \geq ℰ (q)$ . Because $p r (Y)$ is a constant, minimizing the KL divergence is equivalent to maximizing $ℰ (q)$ . The VI algorithm updates each component of $q (β)$ in turn while holding other components fixed. However, because of the nonlinear sigmoid functions in Equation (12), generic VI updating algorithms for $q (s_{u}, α_{u})$ and $q (γ)$ involve integrating over random variables in the sigmoid function hence lack closed forms. To make the updates analytically tractable, we replace Equation (12) with an analytically tractable lower bound. In particular, we use a technique introduced by Jaakkola and Jordan (2000) that bounds the sigmoid function from below by a Gaussian kernel with a tuning parameter, hence affords closed-form VI updates; also see Durante et al. (2019) for a modern view of this technique as a bona fide mean-field approximation with Pòlya Gamma data augmentation. In particular, we will use the inequality

σ (x) \geq σ (ψ) exp {(x - ψ) / 2 - g (ψ) (x^{2} - ψ^{2})} ≔ h (x, ψ),

(15)

with $g (ψ) = \frac{1}{2 ψ} [σ (ψ) - \frac{1}{2}]$ where $ψ$ is a tuning parameter.

We approximate $E L B O ℰ (q)$ by $ℰ^{*} (q) :$

ℰ^{*} (q) ≔ \int q (β) log \times \frac{h^{*} (X, ψ, γ, Z) h^{* *} (ϕ, s, α, Z) pr (s, γ, α, ϱ)}{q (β)} d β \leq ℰ (q),

(16)

where

h^{*} (X, ψ, γ, Z) = \prod_{v \in 𝒱_{L}} \prod_{i = 1}^{n_{v}} \prod_{k = 1}^{K} {\prod_{j = 1}^{J} h (X_{i j}^{(v)} γ_{j k}, ψ_{j k})}^{Z_{i k}^{(v)}},

and

h^{* *} (ϕ, s, α, Z) = \prod_{v \in 𝒱_{L}} \prod_{i = 1}^{n_{v}} \prod_{k = 1}^{K} {{h (η_{v k}; ϕ_{k}^{(v)})}^{1 {k < K}} {\times \prod_{m < k} h (- η_{v m}; ϕ_{m}^{(v)})}}^{Z_{i k}^{(v)}} .

The VI algorithm iterates until convergence to find the optimal variational distribution $q$ that maximizes $ℰ^{*} (q)$ . Because $ℰ^{*} (q) \leq l o g π (Y)$ , it can be viewed as an approximation to the marginal likelihood. We maximize over $ψ$ and $ϕ$ to obtain the best approximation. In addition, we adopt an approximate empirical Bayes approach by optimizing the VI objective function $ℰ^{*} (q)$ over the hyperparameters $τ_{1}$ and $τ_{2}$ . Relative to specifying weakly informative but often nonconjugate hyperprior for the variance parameters, optimizing hyperparameter is more practically convenient (e.g., Thomas et al., 2019). Because updating the hyperparameters changes the prior, we need to update $q, ψ$ , and $ϕ$ again. This leads to an algorithm that alternates between maximizing $ℰ^{*} (q)$ in $(q, ψ, ϕ)$ and in $(τ_{1}, τ_{2})$ until convergence. We update the hyperparameters every $d$ complete VI iterations. Pseudocode in Algorithm 1 outlines the VI updates; Section A1 in the Supporting Information details the exact updating formula.

4.1 |. Posterior summaries

Two sets of point and interval estimates for $\{π_{v} : v \in 𝒱_{L}\}$ are available from the VI algorithm: (1) data-driven grouped (“fused”) estimates ( ${\hat{π}}_{v}^{dgrp}$ ) that are formed by setting a subset of $s$ to one and the rest to zero, and (2) leaf-specific estimates $({\hat{π}}_{v}^{leaf})$ . For (1), we select the posterior median model by setting $s_{u} = 1$ for nodes in $\hat{𝒰} = \{u : E_{q_{t}} [s_{u}] > 0.5\}$ (see Step 1b, Section A1, Supporting Information). For leaves $v$ and $v^{'}$ , with probability one, ${\hat{π}}_{v}^{d g r p} = {\hat{π}}_{v^{'}}^{d g r p}$ if and only if $a (v) \cap \hat{𝒰} = a (v^{'}) \cap \hat{𝒰}$ . Because no closed-form posterior distributions for $π_{v}$ are readily available under logistic stick-breaking representation, we compute the approximate posterior mean and approximate 95% credible intervals (CrIs) by a Monte Carlo procedure after convergence of Algorithm 1. For $u \in \hat{𝒰}$ , we first draw $B = 10^{5}$ random independent samples of $α_{u k}$ from $N (E_{q_{t}} [α_{u k} ∣ s_{u} = 1], V_{q_{t}} [α_{u k} ∣ s_{u} = 1])$ , for $k \in [K - 1]$ . We then compute $B$ corresponding $π_{v}$ vectors based on Equations (3)–(5) with $s_{u} = 1 {u \in \hat{𝒰}}$ in (5). Finally, we compute the empirical means and 95% CrIs marginally for $π_{v k}, k \in [K]$ . The above Monte Carlo procedure is extremely fast given only independent Gaussian samples are drawn. As a comparison, for (2), we define leaf-specific estimates ${\hat{π}}_{v}^{leaf}$ by the mean of (3) where $η_{v k} \overset{d}{~} N (\sum_{u \in a (v)} E_{q_{t}} [s_{u} α_{u k}], \sum_{u \in a (v)} V_{q_{t}} [s_{u} α_{u k}])$ , for $k \in [K]$ . We also use Monte Carlo simulation to approximate the posterior means and 95% CrIs. In general, ${\hat{π}}_{v}^{leaf}$ differ across the leaves. In contrast, the data-driven grouped estimates $\{{\hat{π}}_{v}^{d g r p}\}$ induce dimension reduction.

Prediction

The out-of-sample predictive probability of class $k$ for a new observation nested in leaf $v$ is $r_{i^{'} k}^{(v)} ≔ pr (I_{i^{'}}^{(v)} = k ∣ Y_{i^{'}}^{(v)}, 𝒟)$ , where $𝒟 = (Y, 𝒯_{w}, ℒ, a, b, τ_{1}, τ_{2})$ . We have

r_{i^{'} k}^{(v)} = \int \underset{(i)}{\underset{︸}{pr (I_{i^{'}}^{(v)} = k ∣ θ_{\cdot k}, π_{v}, Y_{i^{'}}^{(v)}, 𝒟)}} \underset{(i i)}{\underset{︸}{pr (θ_{\cdot k}, π_{v} ∣ Y_{i^{'}}^{(v)}, 𝒟)}} d θ_{\cdot k} d π_{v} .

(17)

We approximate (17) by plug-in estimators: ${\hat{r}}_{i^{'} k}^{(v)} \propto pr (Y_{i^{'}}^{(v)} ∣ I_{i^{'}}^{(v)} = k, {\hat{θ}}_{\cdot k}, 𝒯_{w}) \cdot {\hat{π}}_{v k}, k \in [K]$ . This can be seen by noting that term $(i) \propto p r (Y_{i^{'}}^{(v)} ∣ I_{i^{'}}^{(v)} = k, θ_{\cdot k}, 𝒯_{w}) \cdot π_{v k}$ , and term $(i i) \approx p r (θ_{\cdot k}, π_{v} ∣ D)$ that we approximate by a Dirac measure at $({\hat{θ}}_{\cdot k}, {\hat{π}}_{v})$ . Here ${\hat{π}}_{v} = {\hat{π}}_{v}^{dgrp}$ .

Choice of $K$

In applications where data-driven selection of $K$ is more desirable, we may follow Bishop (2006) and use criterion $ℰ_{K}^{*} (q) + l o g (K!)$ , where $ℰ_{K}^{*} (q)$ is the lower bound of $l o g$ marginal data likelihood for a $K$ -class model and the correction term is to make different models comparable (e.g., Grimmer, 2011, section 5.2).

5 |. SIMULATION

5.1 |. Design and performance metrics

We conducted a simulation study to evaluate the performance of the proposed tree-integrative LCM. We compare our model to a few alternatives with ad hoc grouping of observations in terms of accuracy in estimating $\{π_{v}, v \in V_{L}\}$ . Data were generated under two scenarios with different class-specific response profiles $Θ$ . Section A2 in the Supporting Information details the true parameter settings of the simulations. Figure 3(a) visualizes the tree $𝒯_{w}$ with equal edge weights and true leaf groups used in the simulation with $p_{L} = 11$ leaves and $G = 3$ groups.

We simulated $R = 200$ independent replicate data sets for different total sample sizes $(N = 1000,4000)$ . For each $N$ , we set $n_{v} \approx N / p_{L}$ for $ν \in 𝒱_{L}$ (with rounding where needed) to investigate balanced leaves and set $n_{v}$ to be approximately $\frac{1}{5} N / p_{L}$ or $\frac{4}{5} N / p_{L}$ with equal chance for mimicking unbalanced observations across leaves. For observations in a leaf $v$ , we simulate $Y_{i}^{(v)}$ according to an LCM with class probabilities $π_{v}$ and class-specific response probabilities $Θ$ . We simulated data for different dimensions $J = 21,84$ , for $K = 3$ classes.

For each simulated data set, we fitted the proposed model, based on which we compute ${\hat{π}}_{v}^{dgrp}$ and ${\hat{π}}_{v}^{leaf}$ (see Section 4.1). Our primary interest is in $\{{\hat{π}}_{v}^{dgrp}\}; \{{\hat{π}}_{v}^{leaf}\}$ are for comparisons. In addition, we also tested a few approaches based on ad hoc leaf node groupings: (1) True grouping analysis (fit separate LCMs to obtain estimates in each of the true groups); (2) Single group LCM analysis (omit sample-to-leaf indicators $ℒ$ , hence the tree information); (3) Ad hoc grouping 1 (manual grouping coarser than the true grouping); 4) Ad hoc grouping 2: classical LCMs for data on each leaf. All analyses assume $Θ$ does not vary by leaves.

5 |.

We used three model performance metrics. First, we computed the root mean squared errors (RMSE) for an estimate ${\hat{π}}_{v}$ where $R M S E ({\hat{π}}_{v}) = \sqrt{{(K p_{L})}^{- 1} \sum_{k = 1}^{K} \sum_{v \in V_{L}} {\{{\hat{π}}_{v k} - π_{v k}\}}^{2}}$ . Second, we compared the true and the estimated leaf groupings via adjusted Rand Index (ARI, Hubert and Arabie, 1985). ARI is a chance-corrected index that takes value between −1 and 1 with values closer to 1 indicating better agreement. Finally, we estimated the coverage probability of the approximate 95% CrIs. For each true group g, we compute the frequency of the approximate 95% CrI (computed along with ${\hat{π}}_{v}^{d g r p}$ ) containing the truth, conditional on the event that an estimated partition of the leaf nodes includes $g$ .

5.2 |. Simulation results

Figure 3 shows comparisons among the RMSEs for different models under different scenarios. For sample sizes $N = 1000$ and $N = 4000$ , the proposed methods with data-driven grouping $({\hat{π}}_{v}^{dgrp})$ produced similar or better RMSE than analyses based on ad hoc leaf groupings, which restrict leaves into incorrect groupings that are coarser (single LCM and ad hoc grouping 1) or finer (ad hoc grouping 2) than the truth. The proposed approach $({\hat{π}}_{v}^{d g r p})$ achieved similar RMSE as ${\hat{π}}_{v}^{leaf}$ , indicating little accuracy was lost in exchange for dimension reduction. The RMSEs of ${\hat{π}}_{v}^{dgrp}$ were similar to estimates of $π_{v}, v \in 𝒱_{L}$ obtained from analyses based on the true leaf grouping. Indeed, the accuracy of group discovery increased with sample sizes with other settings fixed. Average ARIs across replications for each scenario were high (0.94 to 0.99) indicating good recovery of the true leaf groups. Although the groups discovered were not perfect, the comparable RMSEs suggest desirable adaptability of the proposed approach in effective collapsing of the leaves. The RMSE for ${\hat{π}}_{v}^{dgrp}$ was smaller than analyses based on a refined leaf-level grouping: smaller sample sizes in the leaves resulted in loss of efficiency in separate estimations of $π_{v}$ across leaves. RMSEs were further reduced under a larger $J$ or balanced sample sizes in the leaves. However, we again observed similar relative advantage of the proposed ${\hat{π}}_{v}^{dgrp}$ . The relative comparisons of RMSEs under less discrepant true class-specific response profiles remained similar (see Figure S2, Supporting Information).

The observed coverage rates of the approximate 95% CrIs achieved the nominal level satisfactorily (see Figure S1, Supporting Information). Slight under-coverage occurred under smaller $N$ , unbalanced sample sizes, smaller $J$ and leaf groups with smaller number of observations. This is partially a consequence of VI as an inner approximation to the posterior distribution that may underestimate the posterior uncertainty (e.g., Bishop, 2006, chapter 10).

Finally, we also considered scenarios where only a single group of leaves is present in truth for which the classical LCM is perfectly appropriate. Figure S3 in the Supporting Information shows, by learning the posterior node-specific slab-versus-spike selection probabilities, the proposed model produces similar RMSEs as the classical LCM.

6 |. E. COLI DATA APPLICATION

6.1 |. Background and data

E. coli infections cause millions of urinary tract infections (UTIs) in the United States each year (e.g., Johnson and Russo, 2002). Many studies have shown that extraintestinal pathogenic E. coli (ExPEC) strains routinely colonize food animals and contaminate the food supply chain serving as a likely link between food-animal E. coli and human UTIs (e.g., Johnson et al., 2005). The scientific team adopted a novel strategy of augmenting fine-scale core-genome phylogenetics with interrogation of accessory host-adaptive MGEs (see Section 1.1). The scientific goal is to accurately estimate the probabilities of E. coli isolates with human and nonhuman host-origins across genetically diverse but related E. coli sequence types (STs).

We restrict our analysis to $N = 2663$ E. coli isolates in a well-defined collection from humans and retail meat obtained over a 12-month period in Flagstaff, Arizona, United States. Each isolate belongs to one of $p_{L} = 133$ different STs (leaves in the phylogenetic tree) that are identified via a multilocus sequence typing scheme based on short-read DNA sequencing. A total of $J = 17$ MGEs were curated and associated with functional annotations. Each ST was represented by at least four isolates. We constructed rooted, maximum-likelihood phylogenies using core-genome SNP data for the 133 STs. Figure 4 shows the estimated phylogenetic tree for the STs where the edge lengths represents the substitution rate in the conserved core genome. Every ST is overlaid in the same row with the empirical frequencies of (1) $J = 17$ MGEs and (2) the observed sources (human clinical or meat samples) that may differ from the true host origin. The observed frequencies of the MGEs vary greatly across lineages. We apply the proposed tree-integrative LCM to (1) estimate the probabilities of unobserved human and non-human host-origins for all E. coli STs with data-driven groupings of the STs for dimension reduction; and (2) to produce isolate-level probabilistic host-origin assignment. The context of the study restricts us to assume the host origin of each isolate is in one of two unobserved class of human versus food animals. A subset of preliminary data is analyzed in this paper for illustrating the proposed method. Inclusion of additional samples and/or MGEs may change findings. The final results and the detailed workflow of MGE discovery will be reported elsewhere.

The empirical frequencies for $J = 17$ MGEs within each ST mapped in the core-genome phylogenetic tree. The red scale bar represents the substitution rate in the conserved core genome. The bars on the right indicate the total number isolates of each ST; the gray and blue bars represent the number of isolates obtained from apparent nonhuman and human sources, respectively. The core-genome phylogenetic tree on the left margin maps $N = 2663$ *E. coli* isolates into $p_{L} = 133$ STs (leaves). This figure appears in color in the electronic version of this article, and any mention of color refers to that version

6.2 |. Data results

The proposed approach produces estimated class-specific response profiles $({\hat{θ}}_{. k}, k = 1,2)$ that exhibit differential enrichment of MGEs (Figure 5b). For example, MGEs 3, 10 to 17 are estimated with probability of between 0.15 and 0.71 being present in class 1, with log odds ratios (LORs; class 1 vs. class 2: $L O R ({\hat{θ}}_{j 1}, {\hat{θ}}_{j 2}))$ greater than 1. The functional annotations of these MGEs reveal that class 1 is likely associated with food-animal hosts. In contrast, MGEs 4–9 are estimated to be present in class 2 with probability between 0.35 to 0.82 with LORs greater than 1 relative to the corresponding estimated response probabilities in class 1. The results suggest the MGEs are highly associated with different types of host-origins.

(a) Data results with estimated leaf groups and latent class probabilities by group. ST names (ST_#_isolates) are aligned to the tips of the circular tree, which are colored by discovered leaf groups. The scale bar represents the substitution rate in the conserved core genome. The circular heatmap shows the estimated latent class probabilities ( ${\hat{π}}_{v}^{dgrp}, v \in 𝒱_{L}$ ); (b) and (c) see the captions of the subfigures. This figure appears in color in the electronic version of this article, and any mention of color refers to that version

The proposed approach discovered 21 ST leaf groups, for which distinct estimated vectors of the latent class probabilities ${\hat{π}}_{v}^{dgrp}$ are shown in Figure 5(a). For many estimated ST groups, the class probabilities are almost entirely dominated by one type of host-origin. For example, the estimated ST Group 1 (38 leaves; 649 samples; class 1 probability 0.98, 95% CrI: (0.97, 0.99)) and Group 3 (31 leaves; 422 samples; class 1 probability 0.97, 95% CrI: (0.96, 0.98)) showed high probabilities of nonhuman (class 1) host-origin of E. coli. The results suggest recent cross-species transmissions were rare among multiple nearby lineages.

We also compared against results based on two fixed and more restrictive leaf groups, (a) classical LCM (one leaf group); (b) four leaf groups selected by the scientific team (Figure S4, Supporting Information). The single LCM (a) estimated the probability of class 1 to be 0.60, 95% CrI: (0.58, 0.62). The ad hoc leaf grouping (b) produced coarser estimates relative to the proposed ${\hat{π}}_{v}^{dgrp}$ that identified four local leaves (ST1141, ST10, ST744 and ST5996) comprising 116 samples that have estimated probability of class 1: 0.74(0.66, 0.82)). This highlights the inability of potentially misspecified leaf groups to uncover subtle local variations in the latent class probabilities. We compared these models via 10-fold cross-validation based on the mean predictive log-likelihood (MPL) of the test data, which is computed by plugging in the estimated latent class probabilities and response probability profiles. Of note, because of small sample sizes in some leaves, a naive cross-validation may by chance result in a training set without any observation in some leaves. We therefore randomly keep two observations per leaf and use one random fold of the remaining samples as test data. The proposed approach (with posterior median node selection) achieves the highest MPL (−2015.48) compared to (a) (−2030.15) and (b) (−2162.45) (Figure S6, Supporting Information). The estimates of response probability profiles are similar.

On an individual isolate level, the proposed model can estimate the probability that an isolate was derived from a particular host. For example, by incorporating additional observed sample source information, we can compute “posterior concordance probability (PCP)” for each observation. In particular, PCP, $r_{i, S_{i}^{(v)}}^{(v)}$ , is defined as the approximate posterior probability of the true host origin agreeing with the observed sample source category $S_{i}^{(v)}$ of the same E. coli isolate (e.g., $S_{i}^{(v)} = 1$ for meat and 2 for human clinical samples). Figure 5(c) shows the histogram of PCPs for all the isolates. Small PCPs, for example, below a user-specified threshold of 0.5, indicate likely recent host jumps that may subject to further examination to estimate the timing of host transmissions based on in vitro stability data of each MGE.

7 |. DISCUSSION

In this paper, we proposed a tree-integrative LCM for analyzing multivariate binary data. We formulated the motivating scientific question in terms of inferring latent class probabilities that may vary in different parts of a tree. We proposed a Gaussian diffusion prior for logistic stick-breaking parameterized latent class probabilities and designed a scalable approximate algorithm for posterior inference. Our E. coli data analysis revealed that multiple MGEs are disproportionately associated with specific host origins. Combined with external sample source information, the model can help identify isolates that underwent recent host jump, paving the way for further isolate-level host origin validation.

Our study has some limitations. First, the MGE data we analyzed may represent a fraction of the host-associated accessory elements. By design, additional accessory elements identified in future studies can be readily integrated and evaluated in the proposed framework. Second, host-associated accessory elements are lost and gained over time as E. coli strains transition across hosts. For infections that were zoonotic in nature, we did not observe how much time had lapsed between the cross-species host jump and the actual infection. Our model partly accounted for these uncertainties by the imperfect positive response probabilities. However, the timings may drive the presence or absence of multiple MGEs, resulting in potential statistical dependence given the true class of host-origin. Deviations from local independence assumption may impact model-based inference (e.g., Albert and Dodd, 2004; Pepe and Janes, 2006). In practice, a subset of samples with ascertained host-origins may provide critical information to estimate the conditional dependence structure.

Further model extensions may improve model applicability. First, when a subset of observations is not mapped in the tree at random, the algorithm can add additional unobserved leaf indicators to be inferred along with other parameters. Second, it is important to note that the tree integrated into LCM in general is estimated with uncertainty in the topological structure. Methods that use an additional layer of prior over the tree space centered around the estimated tree may account for the upstream uncertainty (e.g., Willis and Bell, 2018). Third, E. coli isolates may vary in additional factors such as the hosts’ clinical characteristics. Regression extensions may refine the understanding of variation in latent class probabilities and positive response probabilities that are driven by covariates (e.g., Huang and Bandeen-Roche, 2004). Fourth, LCM is an example of probability tensor decomposition methods (e.g., Johndrow et al., 2017), the tree-integrative LCM motivates extensions to general graph-guided probability tensor decomposition methods. Finally, the truncated stick-breaking formulation in Equation (3) motivates connections to a broader class of covariate-indexed dependent process priors as $K$ approaches infinity (e.g., Ren et al., 2011; Rodriguez and Dunson, 2011). Extensions along this line may also relax the present assumption of identical number of realized classes at additional computational cost.

On computation, without relying on prior-likelihood conjugacy, neuronized priors for Bayesian sparse linear regression has been proposed (Shin and Liu, 2021). Comparative studies against spike-and-slab priors are warranted. In addition, one known drawback of the mean field VI is that it tends to underestimate the marginal posterior variances of parameters. In our simulations, we showed near nominal coverages of the true $π$ with slight undercoverages happening mostly for leaf groups with very small sample sizes. It is an interesting line of work to incorporate the methods of Giordano et al. (2015) to correct the variance-covariance matrices used in the component variational distributions. We leave these topics for future work.

Supplementary Material

supplementary material 2

NIHMS1943067-supplement-supplementary_material_2.pdf^{(345.9KB, pdf)}

supplementary material 1

NIHMS1943067-supplement-supplementary_material_1.tar^{(1.1MB, tar)}

ACKNOWLEDGMENTS

The research is supported in part by a Precision Health Investigator Award from University of Michigan, Ann Arbor (ML, ZW); an award from Wellcome Trust (LBP, MA and CML; award number 201866); and National Institutes of Health (NIH) grants R01AR073208 (ZW), P30CA04659 (ZW), and 1R01AI130066-01A1 (LBP). We also thank the co-editor, associate editor, and two referees whose comments greatly improved the presentation of our work.

Funding information

University of Michigan, Precision Health Initiative; Wellcome Trust, Grant/Award Number: 201866; National Institutes of Health, Grant/Award Numbers: 1R01AI130066-01A1, P30CA04659, R01AR073208

Footnotes

OPEN RESEARCH BADGES

This article has earned an Open Materials badge for making publicly available the components of the research methodology needed to reproduce the reported procedure and analysis. All materials are available at https://github.com/zhenkewu/lotR.

SUPPORTING INFORMATION

Web Appendices and Figures referenced in Sections 4–6, and R programs are available with this paper at the Biometrics website on Wiley Online Library.

Supporting Information.

DATA AVAILABILITY STATEMENT

An R package “lotR” is freely available at https://github.com/zhenkewu/lotR. The data that support the findings in this paper are available from the corresponding author on reasonable request.

REFERENCES

Airoldi EM & Bischof JM (2016) Improving and evaluating topic models and other models of text. Journal of the American Statistical Association, 111, 1381–1403. [Google Scholar]
Albert PS & Dodd LE (2004) A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60, 427–435. [DOI] [PubMed] [Google Scholar]
Avila D, Keiser O, Egger M, Kouyos R, Böni J, Yerly S et al. (2014) Social meets molecular: combining phylogenetic and latent class analyses to understand HIV-1 transmission in Switzerland. American Journal of Epidemiology, 179, 1514–1525. [DOI] [PubMed] [Google Scholar]
Bandeen-Roche K, Miglioretti DL, Zeger SL & Rathouz PJ (1997) Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92, 1375–1386. [Google Scholar]
Bishop CM (2006) Pattern recognition and machine learning. Berlin: Springer. [Google Scholar]
Blei DM, Kucukelbir A & McAuliffe JD (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association, 112, 859–877. [Google Scholar]
Carbonetto P & Stephens M (2012) Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis, 7, 73–108. [Google Scholar]
Dunson D & Xing C (2009) Nonparametric Bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104, 1042–1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
Durante D, Rigon T (2019) Conditionally conjugate mean-field variational Bayes for logistic models. Statistical Science, 34, 472–485. [Google Scholar]
Felsenstein J (1985) Phylogenies and the comparative method. The American Naturalist, 125, 1–15. [DOI] [PubMed] [Google Scholar]
Formann AK (1992) Linear logistic latent class analysis for polytomous data. Journal of the American Statistical Association, 87, 476–486. [Google Scholar]
Ghahramani Z, Jordan MI & Adams RP (2010) Tree-structured stick breaking for hierarchical data. Advances in Neural Information Processing Systems, 23, 19–27. [Google Scholar]
Giordano R, Broderick T, & Jordan M (2015) Linear response methods for accurate covariance estimates from mean field variational Bayes. Proceedings of the 28th International Conference on Neural Information Processing Systems, 1, 1441–1449. [Google Scholar]
Goodman L (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231. [Google Scholar]
Grimmer J (2011) An introduction to Bayesian inference via variational approximations. Political Analysis, 19, 32–47. [Google Scholar]
Huang G-H & Bandeen-Roche K (2004) Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika, 69, 5–32. [Google Scholar]
Hubert L & Arabie P (1985) Comparing partitions. Journal of Classification, 2, 193–218. [Google Scholar]
Jaakkola TS & Jordan MI (2000) Bayesian parameter estimation via variational methods. Statistics and Computing, 10, 25–37. [Google Scholar]
Johndrow JE, Bhattacharya A & Dunson DB (2017) Tensor decompositions and sparse log-linear models. Annals of Statistics, 45,1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson JR, Delavari P, O’Bryan TT, Smith KE & Tatini S (2005) Contamination of retail foods, particularly Turkey, from community markets (Minnesota, 1999–2000) with antimicrobial-resistant and extraintestinal pathogenic Escherichia coli. Food-bourne Pathogens & Disease, 2, 38–49. [DOI] [PubMed] [Google Scholar]
Johnson JR & Russo TA (2002) Extraintestinal pathogenic Escherichia coli: “the other bad E. coli”. Journal of Laboratory and Clinical Medicine, 139, 155–162. [DOI] [PubMed] [Google Scholar]
Lazarsfeld PF (1950) The logical and mathematical foundations of latent structure analysis. In: Stouffer S (Ed.), The American soldier: studies in social psychology in world war II, vol. IV, pp. 362–412. Princeton, NJ: Princeton University Press. [Google Scholar]
Lindsay JA & Holden MT (2004) Staphylococcus aureus: superbug, super genome? Trends in Microbiology, 12, 378–385. [DOI] [PubMed] [Google Scholar]
Liu CM, Stegger M, Aziz M, Johnson TJ, Waits K, Nordstrom L et al. (2018) Escherichia coli ST131-H22 as a foodborne uropathogen. mBio, 9, e00470–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, Urwin R et al. (1998) Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences, 95, 3140–3145. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ormerod JT & Wand MP (2010) Explaining variational approximations. The American Statistician, 64, 140–153. [Google Scholar]
Pepe MS & Janes H (2006) Insights into latent class analysis of diagnostic test performance. Biostatistics, 8, 474–484. [DOI] [PubMed] [Google Scholar]
Price LB, Hungate BA, Koch BJ, Davis GS, & Liu CM (2017) Colonizing opportunistic pathogens (cops): the beasts in all of us. PLoS Pathogens, 13, e1006369. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ranganath R, Tang L, Charlin L & Blei D (2015) Deep exponential families. Proceedings of Machine Learning Research, 38, 762–771. [Google Scholar]
Ren L, Du L, Carin L & Dunson DB (2011) Logistic stick-breaking process. Journal of Machine Learning Research, 12. [PMC free article] [PubMed] [Google Scholar]
Rodriguez A & Dunson DB (2011) Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis (Online), 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roy DM, Kemp C, K MV & Tenenbaum JB (2006) Learning annotated hierarchies from relational data. Advances in Neural Information Processing Systems, 19, 475–482. [Google Scholar]
Scornavacca C, Delsuc F, & Galtier N (2020) Phylogenetics in the Genomic Era. No commercial publisher I Authors open access book.
Shin M & Liu JS (2021) Neuronized priors for Bayesian sparse linear regression. Journal of the American Statistical Association, 1–16. To appear.35757777 [Google Scholar]
Thomas EG, Trippa L, Parmigiani G & Dominici F (2019) Estimating the effects of fine particulate matter on 432 cardiovascular diseases using multi-outcome regression with tree-structured shrinkage. Journal of the American Statistical Association, 115, 1–11.34012183 [Google Scholar]
Titsias M & Lázaro-Gredilla M (2011) Spike and slab variational inference for multi-task and multiple kernel learning. Advances in Neural Information Processing Systems, 24, 2339–2347. [Google Scholar]
Tüchler R (2008) Bayesian variable selection for logistic models using auxiliary mixture sampling. Journal of Computational and Graphical Statistics, 17, 76–94. [Google Scholar]
Willis A & Bell R (2018) Uncertainty in phylogenetic tree estimates. Journal of Computational and Graphical Statistics, 27, 542–552. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary material 2

NIHMS1943067-supplement-supplementary_material_2.pdf^{(345.9KB, pdf)}

supplementary material 1

NIHMS1943067-supplement-supplementary_material_1.tar^{(1.1MB, tar)}

Data Availability Statement

An R package “lotR” is freely available at https://github.com/zhenkewu/lotR. The data that support the findings in this paper are available from the corresponding author on reasonable request.

[R1] Airoldi EM & Bischof JM (2016) Improving and evaluating topic models and other models of text. Journal of the American Statistical Association, 111, 1381–1403. [Google Scholar]

[R2] Albert PS & Dodd LE (2004) A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics, 60, 427–435. [DOI] [PubMed] [Google Scholar]

[R3] Avila D, Keiser O, Egger M, Kouyos R, Böni J, Yerly S et al. (2014) Social meets molecular: combining phylogenetic and latent class analyses to understand HIV-1 transmission in Switzerland. American Journal of Epidemiology, 179, 1514–1525. [DOI] [PubMed] [Google Scholar]

[R4] Bandeen-Roche K, Miglioretti DL, Zeger SL & Rathouz PJ (1997) Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association, 92, 1375–1386. [Google Scholar]

[R5] Bishop CM (2006) Pattern recognition and machine learning. Berlin: Springer. [Google Scholar]

[R6] Blei DM, Kucukelbir A & McAuliffe JD (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association, 112, 859–877. [Google Scholar]

[R7] Carbonetto P & Stephens M (2012) Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis, 7, 73–108. [Google Scholar]

[R8] Dunson D & Xing C (2009) Nonparametric Bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104, 1042–1051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Durante D, Rigon T (2019) Conditionally conjugate mean-field variational Bayes for logistic models. Statistical Science, 34, 472–485. [Google Scholar]

[R10] Felsenstein J (1985) Phylogenies and the comparative method. The American Naturalist, 125, 1–15. [DOI] [PubMed] [Google Scholar]

[R11] Formann AK (1992) Linear logistic latent class analysis for polytomous data. Journal of the American Statistical Association, 87, 476–486. [Google Scholar]

[R12] Ghahramani Z, Jordan MI & Adams RP (2010) Tree-structured stick breaking for hierarchical data. Advances in Neural Information Processing Systems, 23, 19–27. [Google Scholar]

[R13] Giordano R, Broderick T, & Jordan M (2015) Linear response methods for accurate covariance estimates from mean field variational Bayes. Proceedings of the 28th International Conference on Neural Information Processing Systems, 1, 1441–1449. [Google Scholar]

[R14] Goodman L (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231. [Google Scholar]

[R15] Grimmer J (2011) An introduction to Bayesian inference via variational approximations. Political Analysis, 19, 32–47. [Google Scholar]

[R16] Huang G-H & Bandeen-Roche K (2004) Building an identifiable latent class model with covariate effects on underlying and measured variables. Psychometrika, 69, 5–32. [Google Scholar]

[R17] Hubert L & Arabie P (1985) Comparing partitions. Journal of Classification, 2, 193–218. [Google Scholar]

[R18] Jaakkola TS & Jordan MI (2000) Bayesian parameter estimation via variational methods. Statistics and Computing, 10, 25–37. [Google Scholar]

[R19] Johndrow JE, Bhattacharya A & Dunson DB (2017) Tensor decompositions and sparse log-linear models. Annals of Statistics, 45,1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Johnson JR, Delavari P, O’Bryan TT, Smith KE & Tatini S (2005) Contamination of retail foods, particularly Turkey, from community markets (Minnesota, 1999–2000) with antimicrobial-resistant and extraintestinal pathogenic Escherichia coli. Food-bourne Pathogens & Disease, 2, 38–49. [DOI] [PubMed] [Google Scholar]

[R21] Johnson JR & Russo TA (2002) Extraintestinal pathogenic Escherichia coli: “the other bad E. coli”. Journal of Laboratory and Clinical Medicine, 139, 155–162. [DOI] [PubMed] [Google Scholar]

[R22] Lazarsfeld PF (1950) The logical and mathematical foundations of latent structure analysis. In: Stouffer S (Ed.), The American soldier: studies in social psychology in world war II, vol. IV, pp. 362–412. Princeton, NJ: Princeton University Press. [Google Scholar]

[R23] Lindsay JA & Holden MT (2004) Staphylococcus aureus: superbug, super genome? Trends in Microbiology, 12, 378–385. [DOI] [PubMed] [Google Scholar]

[R24] Liu CM, Stegger M, Aziz M, Johnson TJ, Waits K, Nordstrom L et al. (2018) Escherichia coli ST131-H22 as a foodborne uropathogen. mBio, 9, e00470–18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, Urwin R et al. (1998) Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences, 95, 3140–3145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Ormerod JT & Wand MP (2010) Explaining variational approximations. The American Statistician, 64, 140–153. [Google Scholar]

[R27] Pepe MS & Janes H (2006) Insights into latent class analysis of diagnostic test performance. Biostatistics, 8, 474–484. [DOI] [PubMed] [Google Scholar]

[R28] Price LB, Hungate BA, Koch BJ, Davis GS, & Liu CM (2017) Colonizing opportunistic pathogens (cops): the beasts in all of us. PLoS Pathogens, 13, e1006369. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Ranganath R, Tang L, Charlin L & Blei D (2015) Deep exponential families. Proceedings of Machine Learning Research, 38, 762–771. [Google Scholar]

[R30] Ren L, Du L, Carin L & Dunson DB (2011) Logistic stick-breaking process. Journal of Machine Learning Research, 12. [PMC free article] [PubMed] [Google Scholar]

[R31] Rodriguez A & Dunson DB (2011) Nonparametric Bayesian models through probit stick-breaking processes. Bayesian Analysis (Online), 6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Roy DM, Kemp C, K MV & Tenenbaum JB (2006) Learning annotated hierarchies from relational data. Advances in Neural Information Processing Systems, 19, 475–482. [Google Scholar]

[R33] Scornavacca C, Delsuc F, & Galtier N (2020) Phylogenetics in the Genomic Era. No commercial publisher I Authors open access book.

[R34] Shin M & Liu JS (2021) Neuronized priors for Bayesian sparse linear regression. Journal of the American Statistical Association, 1–16. To appear.35757777 [Google Scholar]

[R35] Thomas EG, Trippa L, Parmigiani G & Dominici F (2019) Estimating the effects of fine particulate matter on 432 cardiovascular diseases using multi-outcome regression with tree-structured shrinkage. Journal of the American Statistical Association, 115, 1–11.34012183 [Google Scholar]

[R36] Titsias M & Lázaro-Gredilla M (2011) Spike and slab variational inference for multi-task and multiple kernel learning. Advances in Neural Information Processing Systems, 24, 2339–2347. [Google Scholar]

[R37] Tüchler R (2008) Bayesian variable selection for logistic models using auxiliary mixture sampling. Journal of Computational and Graphical Statistics, 17, 76–94. [Google Scholar]

[R38] Willis A & Bell R (2018) Uncertainty in phylogenetic tree estimates. Journal of Computational and Graphical Statistics, 27, 542–552. [Google Scholar]

PERMALINK

Integrating sample similarities into latent class analysis: a tree-structured shrinkage approach

Mengbing Li

Daniel E Park

Maliha Aziz

Cindy M Liu

Lance B Price

Zhenke Wu

Abstract

1 |. INTRODUCTION

1.1 |. Motivating application

1.2 |. Integrating sample similarities into latent class analysis

1.3 |. Primary contributions

2 |. MODEL

2.1 |. Rooted weighted trees

FIGURE 3.

2.2 |. Latent class models for data on the leaves

Notations

LCM for Data on the Leaves

FIGURE 1.

3 |. PRIOR DISTRIBUTION

Tree-structured prior for latent class probabilities πv

Priors for class-specific response probabilities

Joint distribution

FIGURE 2.

4 |. VARIATIONAL INFERENCE ALGORITHM

4.1 |. Posterior summaries

Prediction

Choice of K

5 |. SIMULATION

5.1 |. Design and performance metrics

5.2 |. Simulation results

6 |. E. COLI DATA APPLICATION

6.1 |. Background and data

FIGURE 4.

6.2 |. Data results

FIGURE 5.

7 |. DISCUSSION

Supplementary Material

ACKNOWLEDGMENTS

Funding information

Footnotes

DATA AVAILABILITY STATEMENT

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Tree-structured prior for latent class probabilities $π_{v}$

Choice of $K$