Abstract
B cells are a critical component of the adaptive immune system, responsible for producing antibodies that help protect the body from infections and foreign substances. Single cell RNA-sequencing (scRNA-seq) has allowed for both profiling of B cell receptor (BCR) sequences and gene expression. However, understanding the adaptive and evolutionary mechanisms of B cells in response to specific stimuli remains a significant challenge in the field of immunology. We introduce a new method, TRIBAL, which aims to infer the evolutionary history of clonally related B cells from scRNA-seq data. The key insight of TRIBAL is that inclusion of isotype data into the B cell lineage inference problem is valuable for reducing phylogenetic uncertainty that arises when only considering the receptor sequences. Consequently, the TRIBAL inferred B cell lineage trees jointly capture the somatic mutations introduced to the B cell receptor during affinity maturation and isotype transitions during class switch recombination. In addition, TRIBAL infers isotype transition probabilities that are valuable for gaining insight into the dynamics of class switching.
Via in silico experiments, we demonstrate that TRIBAL infers isotype transition probabilities with the ability to distinguish between direct versus sequential switching in a B cell population. This results in more accurate B cell lineage trees and corresponding ancestral sequence and class switch reconstruction compared to competing methods. Using real-world scRNA-seq datasets, we show that TRIBAL recapitulates expected biological trends in a model affinity maturation system. Furthermore, the B cell lineage trees inferred by TRIBAL were equally plausible for the BCR sequences as those inferred by competing methods but yielded lower entropic partitions for the isotypes of the sequenced B cell. Thus, our method holds the potential to further advance our understanding of vaccine responses, disease progression, and the identification of therapeutic antibodies.
1. Introduction
B cells play a pivotal role in the adaptive immune response, producing antibodies that neutralize foreign substances and infections [1, 2]. These antibodies, consisting of a heavy chain and a light chain, are initially formed as sequence-specific B cell receptors (BCRs) (Fig. 1a). The generation of specific BCR genes stems from a process known as V(D)J recombination, where DNA segments are rearranged to ensure a wide spectrum of antibodies to counter various pathogens [3]. To enhance their effectiveness, B cells undergo affinity maturation (Fig. 1b) [4], a a micro-evolutionary process involving repeated cycles of somatic hypermutation (SHM) and cellular divisions. SHM introduces mutations in the BCR genes, selecting for B cells expressing high-affinity BCRs, while eliminating those with low affinity. Concurrently, B cells have the ability for class switch recombination (CSR) (Fig.1b) [5], which diversifies their response by altering the antibody’s functional class or isotype.
Understanding the evolutionary history of B cells during these adaptive processes is integral to better understanding B cell response to infection and vaccination. However, the selection pressures applied to B cells during the affinity maturation process necessitate more specialized analytical approaches than those utilized for species phylogeny inference [6–9]. Specifically, Hoehn et al. developed HLP17 [10] and HLP19 [11], which are specialized codon substitution models for use with maximum likelihood inference via IgPhyML. Another important difference between B cell and species evolution is the lower mutation rate and the relatively short length of the BCR sequence (≈ 600 bp). These properties imply that maximum parsimony inference methods are viable [12–14] but typically result in a large solution space of many plausible phylogenies for the same data. In addition, these solutions exhibit additional tree uncertainty in the form of polytomies, i.e., multifurcating nodes with more than two children. This is counter to the underlying cell lineage tree, which is bifurcating as cell division results in exactly two daughter cells.
One approach to resolve phylogenetic uncertainty is the inclusion of additional data. This approach has proven effective in other areas, such as the use of physical location for studying cancer migration and metastasis [15], or geographical location for the inference of gene flow [16]. More related to B cell lineage inference, sequence abundance has been utilized by both GCTree [12] and ClonalTree [17] to discriminate between candidate solutions. However, both methods were originally developed for bulk RNA sequencing data and consequently were restricted to analysis of only the heavy chain sequences. With single cell RNA-sequencing (scRNA-seq), it is now possible to efficiently assemble BCR sequences that includes both heavy and light chain from a population of B cells [18] (Fig. 1c). As a result, the evolution of BCRs during affinity maturation can now be tracked with scRNA-seq with higher fidelity [19].
In addition, scRNA-seq yields another valuable data source in the form of the expressed isotype of the constant region of the heavy chain (Fig. 1c) [19]. Isotype data extracted from scRNA-seq has been helpful in related immunological domains, such as inferring dynamic cellular trajectories [20]. However, isotype expression is an especially useful marker of B cell evolution. When a B cell undergoes class switching from its current isotype to a new isotype, any heavy chain constant region locus between the current isotype and the new isotype in the genome is cut out or removed via a recombination process (Fig. 1b). Consequently, CSR is an irreversible process and the isotype state of a B cell offers a distinct milestone in its evolutionary history. Therefore, the inclusion of isotype information into the problem of B cell lineage inference has the potential to help minimize phylogenetic uncertainty by both reducing the size of the solution space and yielding more refined B cell lineage trees with fewer polytomies.
In this work, we present TRIBAL, which stands for Tree Inference of B cell Clonal Lineages (Fig. 1d). TRIBAL utilizes both the BCR sequence and isotype information from sequenced cells to infer a B cell lineage tree that jointly models the evolutionary processes of SHM and CSR. Additionally, TRIBAL infers the underlying isotype transition probabilities providing valuable insight into the dynamics of CSR (Fig. 1d). We demonstrate the accuracy of TRIBAL on simulated data and show that it is effective on experimental single cell data generated from the 5’ 10x Genomics platform. TRIBAL is available open source and has the potential to improve understanding of vaccine responses, track disease progression, and identify therapeutic antibodies.
2. Problem Statement
Suppose that we have sequenced and subsequently aligned the variable regions of the heavy and light chain of the B cell receptor (BCR) of B cells that descend from the same naive B cell post V(D)J recombination, resulting in a multiple sequence alignment (MSA) composed of columns — the typical alignment length is (Fig. 1c). That is, for each B cell , we are given the concatenated sequence where . In addition, we are given the isotype , determined using tools such as Cell Ranger [21]. For humans, there are isotypes ordered as IgM/D, IgG3, IgG1, IgA1, IgG2, IgG4, IgE and IgA2, whereas for mice there are isotypes ordered as IgM/D, IgG3, IgG1, IgG2b, IgG2c (2a), IgE, IgGA. Finally, we are given sequence and isotype of the ancestral naive B cell of the B cells — note that and is post V(D)J recombination but prior to somatic hypermutation (SHM) and class switch recombination (CSR) and thus . (Fig. 1a).
To study the evolutionary history of these cells, we aim to construct a B cell lineage tree that jointly describes the evolution of the DNA sequences and their isotypes . As such, each node of will be labeled by a sequence and isotype . In particular, the root will be labeled by and while the leaves of will be labeled by sequence and isotype for each . A key property of isotype evolution is that it is irreversible. As such, the isotype of an ancestral cell must be less than or equal to the isotype of its descendants . More formally, we have the following definition of a B cell lineage tree.
Definition 1. A rooted tree whose nodes are labeled by sequences and isotypes is a B cell lineage tree for MSA and isotypes provided (i) has leaves such that each leaf is labeled by sequence and isotype , (ii) the root node of is labeled by sequence and isotype , and (iii) for all nodes , such that is ancestral to it holds that .
In the following, we will refer to B cell lineage trees as lineage trees. Lineage trees typically have shallow depth due to the limited number of mutations introduced during SHM, making parsimony a reasonable optimization criterion [12–14]. Given a lineage tree , the SHM parsimony score is computed as
(1) |
where is the Hamming distance [22] between sequences and . However, one common challenge of using parsimony to model SHM is that it often results in a large number of candidate lineage trees with equal optimal parsimony score. In addition, many inferred lineage trees contain polytomies, or internal nodes with out-degree greater than 2. To overcome these two challenges, we propose to infer lineage trees that optimize both sequence evolution (SHM) and isotype evolution (CSR).
Similarly to SHM, one could model the evolution of CSR using unweighted parsimony. That is, one would prefer lineage trees with isotypes that minimize the number of isotype changes, i.e., . However, there are two issues with this approach. First, it does not appropriately penalize lineage trees that violate the irreversible property of isotype evolution [13]. Second, it does not account for the fact that given an isotype starting state the probability of transitioning to each of the possible isotype states is not necessarily equal. In fact, knowing these probability distributions is useful for researchers looking to gain basic insight into the patterns and casual factors of class switch recombination [23]. Therefore, we seek to develop an appropriate evolutionary model for CSR that captures the irreversible property of class switching and models preferential isotype class transitions.
We propose a state or tree dependence model [24, 25] evolutionary model for CSR, which models the joint probability distribution of a random variable vector under Markov-like assumptions on a given tree (Appendix A). Here, the random variables of interest in this state tree model are the isotypes of each node in lineage tree . This model is parameterized by a probability distribution over the isotype of the root and isotype transition probabilities. As the root of a lineage tree is a naive B cell post V(D)J recombination, the isotype is always 1 (IgM) and the probability distribution of is defined as and 0 otherwise. Intuitively, isotype transition probabilities captures the conditional probability of a descendant isotype given the isotype of its parent subject to irreversible isotype evolution. Next, we give a formal definition of isotype transition probabilities.
Definition 2. An matrix is an isotype transition probability matrix provided for all isotypes it holds that (i), (ii) if , and (iii) for all isotypes .
We define the joint likelihood of the observed isotypes for isotype transition probabilities and any lineage tree whose leaves have isotypes as
(2) |
We consider the problem of simultaneously inferring a lineage tree with nodes labeled by sequences and isotypes given MSA and isotypes that optimizes both and . However, a significant barrier to solving this problem is that isotype transition probabilities are unknown and need to be inferred. While there have been experimental studies that estimate these quantities under specific biological conditions [23], there currently exists no computational methods to infer these probabilities directly from a sequencing experiment. Here, we will leverage that typical single-cell sequencing experiments yield data from a set of diverse B cell lineages, where each lineage is commonly referred to as a clonotype. A common pre-processing step is to first cluster the sequenced cells into k clonotypes, where each clonotype contains cells (Fig. 1c). We follow convention and group B cells based on shared V(D)J alleles in heavy and light chains into a clonotype. Moreover, we reason that under many experimental conditions, the transition probabilities between isotypes will be similar to those within clonotypes. Thus, inferring these isotype transition probabilities for k clonotypes will yield higher accuracy than inferring isotype transition probabilities for a single lineage in isolation. This leads to the following problem statement.
Problem 1 (B cell Lineage Forest Inference (BLFI)). Given MSAs and isotypes for k clonotypes, find isotype transition probabilities for isotypes and lineage trees for whose nodes are labeled by sequences and isotypes , respectively, such that is minimum and then is maximum.
In other words, we prioritize the SHM objective and then among all solutions with minimum SHM score we prefer those that additionally maximize the CSR objective. It is easy to see that the BLFI problem is NP-hard via a simple reduction from the Large Parsimony problem — see Appendix B.1 for more details.
3. Methods
We introduce Tree Inference of B cell Clonal Lineages, or TRIBAL, a method to solve the BLFI problem. This method is based on two key ideas that allow us to effectively solve the BLFI problem.
First, the lexicographical ordering of our two objectives — optimizing for SHM followed by CSR — enables one to use the following two-stage approach (Fig. 1c). In the first stage, we use existing maximum parsimony methods to generate a set of input trees — also called a maximum parsimony forest — for each clonotype such that each tree minimizes the objective . To do so, we provide these methods only the sequence information to enumerate a solution space of trees whose nodes are labeled by sequences . In the second stage, we incorporate isotype information to further operate on the set and additionally optimize in such a manner that maintains optimality of the SHM objective. We note that a lexicographically optimal lineage tree does not necessarily need to be an element of , but instead it suffices that the evolutionary relationships in tree are a refinement of the evolutionary relationships described by some tree among the set of input trees. More specifically, a refinement of tree is obtained by a series of expand operations such that an expand operation of node consists of splitting node into and , joining them with an edge and then redistributing the children of to be descendants of either or . We have the following key proposition and corollary.
Proposition 1. For any tree labeled by sequences and refinement of , there exists a sequence labeling for such that .
Proof. The sequencing labeling is found by setting and during each expand operation. By construction, the new edge has and every original edge maintains its original Hamming distance in . Therefore, . ☐
Corollary 1. Any lineage tree that lexicographically optimizes and then must be a refinement of some tree optimizing only .
Therefore, our sought lineage tree that lexicographically optimizes both objectives must be a refinement of some tree in the set .
The second key idea is that the inference of optimal lineage trees is conditionally independent when given isotype transition probabilities . This motivates the use of a coordinate ascent algorithm where we randomly initialize isotype transition probabilies (Sec. 3.1 and Fig. 2a). Then, at each iteration , we use isotype transition probabilities and the input set of trees to independently infer an optimal lineage tree for each clonotype (Sec. 3.2 and Fig. 2b). This is then followed by estimating updated isotype transition probabilities given trees (Sec. 3.3 and Fig. 2c). We terminate upon convergence of our CSR objective or when exceeding a specified number of maximum iterations.
TRIBAL is implemented in Python 3, is open source (BSD-3-Clause license), and is available at https://github.com/elkebir-group/tribal.
3.1. Input
The input to TRIBAL is a set of k clonotypes with corresponding maximum parsimony forest for each clonotype . In addition, we are given isotypes labeling the leaves of trees for each clonotype (Fig.1c, Fig. 2a). Obtaining this input requires a number of preprocessing steps of a scRNA-seq dataset (Fig. 1), including (i) BCR assembly and isotype calling of each sequenced cell, (ii) clonotyping or clustering the cells based on a shared germline alleles for both the heavy and light chains, (iii) obtaining an MSA for sequences within a clonotype and (iv) finding a parsimony forest for each MSA of a clonotype. There preprocessing steps are not part of TRIBAL.
For our coordinate ascent approach, we also require an initialization for the isotype transition probabilities (Fig. 2a). We set the initial transition probabilities to reflect the observation that under baseline conditions, the probability of a B cell undergoing class switching is lower than the probability of it maintaining its original antibody class. [4, 23]. Thus, we initialize such that for all isotypes and . Let be the probability that a B cell does not class switch, i.e., for each isotype and if . We enforce irreversibility such that , if . We then initialize the remaining parameters uniformly, i.e., where is the total of number isotypes. We conduct multiple restarts, varying in each restart.
3.2. B cell lineage tree inference via refinement
As described above, the inference of optimal lineage trees is conditionally independent given isotype transition probabilities . We therefore focus our discussion on how TRIBAL infers a B cell lineage tree for a single clonotype during iteration given isotype transition probabilities . By Corollary 1, we solve this problem by finding an optimal refinement and corresponding isotype labeling for each tree in the input set and select the one that maximizes our CSR objective (Fig. 2b). Maximizing the log-likelihood of is equivalent to maximizing a weighted parsimony criterion. This leads to the following problem statement.
Problem 2 (Most Parsimonious Tree Refinement (MPTR)). Given a tree on leaves, isotypes and isotype transition probabilities , find a tree with root and isotype labels such that (i) is a refinement of , (ii) , (iii) for each leaf and (iv) is maximum.
We prove in Appendix B.1 that the MPTR problem is NP-hard. Therefore, we convert this problem to a graph problem (Fig. S2). For each node in tree , multiple copies are added to the expansion graph , one for each possible isotype subject to additional constraints depending on the observed isotypes of the leaves descendant from . The edges in this graph have weight . We then solve the MPTR problem by finding a valid, maximum weight subtree in the expansion graph such that with isotype labels is a most parsimonious refinement of . This is achieved via a mixed-integer linear programming (MILP) formulation, similar to that used to solve the Steiner minimal tree problem [26]. See Appendix C.1 for more details on the construction of the expansion graph from tree and leaf isotypes , proof of correctness for this algorithm and the MILP formulation.
To obtain outputs , we set to the tree and corresponding isotype labeling that maximize among all input trees for each clonotype .
3.3. Parameter estimation
In the next step, given tuples of lineage trees and isotype labels for iteration , we seek updated isotype transition probabilities for iteration such that is maximum (Fig. 2c). This is achieved via maximum likelihood estimation. For fixed lineage trees and isotypes , our CSR objective can be rewritten as
(3) |
where is the number of transition from to across all selected trees and clonotypes. Thus, we seek isotype transition probabilities that maximize (3) subject to the constraints that for every isotype . This constrained optimization problem is easily solved with the aid of Lagrange multipliers (see Appendix C.2 for details). Additionally, we add pseudocounts of 1 to avoid overfitting the data in the event of unobserved transitions, yielding the following updated probabilities for iteration .
(4) |
After estimating , we then infer , as discussed in Section 3.2.
4. Results
We evaluated TRIBAL on both in silico experiments (Sec. 4.1) and two sets of experimental scRNA-seq data. The first assessed performance on a model immunological system for affinity maturation (Sec. 4.2) and the second on a study of the relationship between age-associated B cells and autoimmune disorders (Sec. 4.3).
4.1. In silico experiments
We designed in silico experiments to evaluate TRIBAL with known ground-truth isotype transition probabilities and lineage trees labeled by sequences and isotypes . Specifically, we used an existing BCR phylogenetic simulator [13] that models SHM but not CSR. We generated isotype transition probabilities with isotypes (as in mice) under two different models of CSR. Briefly, both CSR models assume the probability of not transitioning is higher than the probability of transitioning, but in the sequential model there is clear preference for transitions to the next contiguous isotype, while in the direct model the probabilities of contiguous and non-contiguous class are similar. Given , we evolved isotype characters down each ground truth lineage tree . We generated 5 replications of each CSR model for clonotypes and cells per clonotype, resulting in 20 in silico experiments, yielding a total of 1500 ground truth lineage trees. In addition to comparing TRIBAL to existing methods including dnapars [7], dnaml [7] and IgPhyML [10], we also compared to a version of TRIBAL without tree refinement, denoted as tribal-No refinement (tribal-nr). To obtain the input set of trees with maximum parsimony for each clonotype , we utilized dnapars [7]. We refer to Appendix D for additional details on the simulations. In the following we focus our discussion on in silico experiments with cells per clonotype (see Fig. S5 for ).
To evaluate accuracy of isotype transition probability inference, we used Kullback–Leibler (KL) divergence [27] to compare the inferred transition probability distribution of each isotype to the simulated ground truth distribution . KL divergence is defined as ; the lower the KL divergence, the more similar the two distributions. Since no existing methods infer isotype transition probabilities, we restricted this analysis to TRIBAL and TRIBAL-NR. Overall, we observed good concordance between simulated and TRIBAL inferred isotype transition probabilities (Fig. 3a). Specifically, TRIBAL had lower median KL divergence than TRIBAL-NR for all isotype starting states, except IgA, which is trivially 0, under both direct and sequential CSR models (direct: median of 0.15 vs. 0.73; sequential: median of 0.099 vs. 0.55). We observed improved performance of TRIBAL (but not for TRIBAL-NR) for cells per clonotype (Fig. S5a and Fig. S6). To assess the sensitivity of TRIBAL to infer isotype transition probabilities with fewer than clonotypes, we downsampled the 75 clonotypes to 25 and 50 clonotypes per experiment. We observed similar trends for experiments with clonotypes, with TRIBAL continuing to outperform TRIBAL-NR while still achieving small KL divergences even as decreases (Fig. S7). These findings demonstrate that tree refinement is key to accurately estimate isotype transition probabilities.
Next, we assessed the accuracy of lineage tree inference using the Robinson-Foulds (RF) distance [28] normalized by the total number of bipartitions in the ground truth and inferred lineage tree (25). Since TRIBAL, TRIBAL-NR and dnapars return multiple optimal solutions, we report the mean of the lineage tree inference metrics over all optimal solutions. We found TRIBAL had the lowest mean normalized RF distance for both direct and sequential CSR models (Fig. 3b). Overall, dnaml (median: 0.5) and IgPhyML (median: 0.49) had the worst performance on normalized RF. Interestingly, even though the starting trees of dnapars are used by TRIBAL, both TRIBAL (median: 0.36) and TRIBAL-NR (median: 0.38) outperformed dnapars (median: 0.39), showing the importance of using isotype information to resolve phylogenetic uncertainty.
While normalized RF distance only assesses the accuracy of the tree topology, it is important to also assess the accuracy of the ancestral sequence reconstruction. To that end, we used a metric called Most Recent Common Ancestor (MRCA) distance (26) introduced by Davidsen and Matsen [13]. For any two simulated B cells (leaves), the MRCA distance is the Hamming distance between the MRCA sequences of these two B cells in both the ground truth and inferred lineage trees. This distance is then averaged over all pairs of simulated B cells — see Appendix D.4 and Fig. S4a for additional details. Again, we report the mean of over all optimal solutions for TRIBAL, TRIBAL-NR and dnapars. We found TRIBAL outperformed all other methods (Fig. 3c), achieving the lowest overall median MRCA distance (3.46 × 10−5), followed by TRIBAL-NR (4.63 × 10−5). IgPhyML had the worst performance with a median of 8.78 × 10−5. Performance trends were consistent between methods across both CSR models.
Lastly, we assessed the accuracy of isotype inference by a new metric called CSR error, which is computed for each B cell and clonotype and is the absolute difference between the number of ground-truth class switches and inferred number of class switches that occurred along its evolutionary path from the root — see Appendix D and Fig S4b for additional details. Since dnaml, dnapars and IgPhyML do not infer isotypes for internal nodes, we pair these methods with the Sankoff algorithm [29] using equals 1 if if and otherwise. We account for the presence of multiple solutions by taking the mean across solutions. All methods had a median CSR error of 0 for both the direct and sequential models (Fig. 3d). Therefore, we utilized the third quartile for a more robust comparison. We found that under the direct model TRIBAL-NR (third quartile: 0) was the best performing method, while dnapars was second best (third quartile 0.8) and all other methods, including TRIBAL had a third quartile of 1. We observed a slight tendency of TRIBAL to overestimate the number of transitions due to the tree refinement step, while other methods tended to underestimate the number of transitions. However, under a sequential model, where refinement is helpful in accurately capturing sequential state transitions, we found that TRIBAL was tied with TRIBAL-NR for the best performance (third quartile: 0). All other methods had similar performance between both CSR models for this metric.
We observed similar trends on these metrics for in silico experiments containing cells per clonotype for clonotypes (Fig. S5). In summary, these results suggests that the incorporation of isotype data and the use of tree refinement are beneficial for both lineage tree inference and ancestral sequence and class switch reconstruction.
4.2. Keyhole limpet haemocyanin (NP-KLH) antigen immune response studies
We applied TRIBAL as well as IgPhyML to 10 × 5′ scRNA-seq data of B cells extracted from mice immunized with nucleoprotein keyhole limpet haemocyanin (NP-KLH), a commonly used antigen in the study of antibody affinity maturation [30]. Our goal was to determine whether these methods recapitulate known patterns of B cell lineage evolution for this well-studied antigen using data from two studies and to compare the lineage trees inferred by each method. The first dataset (NP-KLH-1) was generated from C57BL/6 mice that were immunized with NP-KLH and total germinal center B cells were extracted 14 days after immunization [20]. The other two datasets came from a single study in which C57BL/6 mice were immunized with NP-KLH (NP-KLH-2a and NP-KLH-2b) and NP-specific germinal center B cells were extracted 13 days after immunization [31]. We utilized the standard 10× Cell Ranger [21] single-cell bioinformatics pipeline to generate sequence and isotype for each cell . We used Dandelion [32] to remove doublets, reassign alleles, and cluster the cells into clonotypes. We identified clonotype MSAs based on shared V(D)J alleles for the heavy chain using the dowser package [14]. Finally, we excluded clonotypes with fewer than 5 cells. This yielded a total of sequenced B cells clustered into clonotypes. We exclude methods that rely on sequence abundance as a key signal, such as GCTree [12] and ClonalTree [17] as we observed very few duplicated sequences within each clonotype. Fig. 4a shows the distribution of isotypes by dataset and Table S1 includes a more detailed summary of each dataset.
We used dnapars [7] to infer TRIBAL’s input set for each clonotype . We found that TRIBAL’s use of isotype information significantly reduced the number of optimal solutions identified by dnapars (mean: 31.5 vs. 1.3, max: 4310 vs. 8) — see Fig 4b. While IgPhyML, a maximum likelihood method using the HLP19 codon-substitution model [10, 11], infers only a single tree per clonotype, it is important to note that there might be multiple trees with maximum likelihood in the solution space. Indeed, we found high concordance of HLP19 likelihoods between the TRIBAL and IgPhyML inferred lineage trees, with a small overall mean absolute deviation of 0.97 (Fig. 4c, Fig. S8). We even observed that TRIBAL had a greater likelihood than IgPhyML in 59.3% of the clonotypes. Thus, TRIBAL resulted in a significant reduction in the size of the solution space compared to the maximum parsimony method dnapars with similar (and sometimes better) HLP19 likelikelihood as IgPhyML, illustrating how isotype information can be used to effectively reduce phylogenetic uncertainty.
Next, we assessed whether the lineage trees inferred by TRIBAL recapitulated expected biological trends for the NP-KLH model system. Previous research has indicated that the IGHV1–72*01 (VH186.2) variable gene in the heavy chain is preferentially used in the anti-NP response in C57BL/6 mice [33] via two distinct mutation paths: one in which a W33L mutation greatly increases affinity to NP, and another in which high affinity is achieved through K59R and the accumulation of several other mutations, such as S66N [34, 35]. Consequently, we expect that for clonotypes containing both W33L and K59R, these mutations would tend towards occurring in distinct lineages of the tree. We analyzed the inferred pairwise relationships between W33L & K59R in the 26 lineage trees that contained both mutations. Specifically, we categorized the relationship as K59R → W33L if K59R was ancestral to W33L, W33L → K59R if W33L was ancestral to K59R, incomparable if W33L and K59R occurred on distinct lineages of the tree and same if they were introduced on the same edge of the lineage tree. Indeed, we confirmed the tendency for mutual exclusivity of W33L and K59R by finding that the proportion of pairwise introductions categorized as incomparable was 0.69 and 0.67 for TRIBAL and IgPhyML, respectively (Fig. 4d). Additionally, it has been suggested that W33L mutations appear relatively early during the anti-NP response, whereas the K59R and S66N mutations typically appear later in the evolutionary history [36]. Defining level as the length of the shortest path from the MRCA of all B cells, we observed that W33L occurred at a median level of 1 for both TRIBAL and IgPhyML while both the K59R and S66N mutation occurred at a median level 2 for both methods (Fig. S9). This indicated that W33L was typically introduced earlier in the evolutionary history of a clonotype than K59R and S66N. Thus, both TRIBAL and IgPhyML trees recapitulate expected mutation patterns for this model system.
We next assessed the extent of agreement with isotype information. While TRIBAL infers isotype labels of ancestral nodes, IgPhyML does not have this capability. Therefore, we developed a new metric called average isotype clade entropy, which is computed with respect to the isotype labeling of the leaf set. For this metric, we compute the entropy of clade in tree with respect to all isotype leaf labels that are descendants of node , taking the average entropy over all non-trivial clades, which excludes the root and the leaves (Appendix E.1). As IgPhyML returns bifurcating trees, we collapse edges with zero branch length for a more fair comparison of this metric. We observed lower average isotype clade entropy for TRIBAL (median: 0.82) versus IgPhyML (median 0.91) 4f. Fig. 4e depicts the lineage tree inferred by TRIBAL and IgPhyML for the NP-KLH-2a dataset (clonotype B_34_1_5_41_1_5). The TRIBAL inferred tree for this clonotype had lower isotype clade entropy than IgPhyML (TRIBAL: 0.51 vs IgPhyML: 0.86) while also resulting in a greater HLP19 likelihood (TRIBAL: −366.5 vs IgPhyML: −369.5). Thus, we find that the trees identified by TRIBAL are in better agreement with the leaf isotypes than IgPhyML.
In addition to the inferred B cell lineage trees, TRIBAL also inferred isotype transition probabilities for each dataset (Fig. 4g, Fig S10). All three inferred isotype transition probability matrices more closely matched a CSR model of direct switching as opposed to a strictly sequential model. To compare the consistency of these estimates across datasets, we computed the Jensen-Shannon divergence (JSD) between the distribution of isotype transition probabilities for each isotype starting state IgM through IgG2c for each dataset pair. We observed low JSD (median: 0.029) across a total of 15 pairwise comparisons, suggesting consistent estimates between isotype transition probabilities.
In summary, these analyses show that the inclusion of isotype information and tree refinement has the potential to yield high quality lineage tree inference, even under a simpler model of SHM, i.e., parsimony. Moreover, the TRIBAL inferred lineage trees additionally optimize an evolutionary model of CSR, yielding lower isotype entropy partitions of the leaf set than IgPhyML. Finally, the additional inference of isotype transitions probabilities has the potential to distinguish between direct versus sequential switching events.
4.3. Age-associated B cell (ABC) datasets
We evaluated TRIBAL on three scRNA-seq datasets with V region sequencing that investigated the relationship between age-associated B cells (ABCs) and autoimmune disorders [37]. For each dataset, B cells were extracted from the spleen of a MRL/lpr female mouse and sequenced using 10× 5′ scRNA-seq. The data was processed by the 10× Cell Ranger [21] single-cell bioinformatics pipeline to generate sequence and isotype for each cell . Nickerson et al. [37] identified clonotype MSAs based on shared V(D)J alleles for the heavy chain using the dowser package [14] and inferred B cell lineage trees using IgPhyML for each clonotype. After filtering out clonotypes with fewer than 5 sequences, we retained 599 B cells and 54 clonotypes across the three datasets (Table S2). Fig. 5a shows the proportion of isotypes and annotations by mouse for the retained B cells. Of these 54 clonotypes, 35 had more than one distinct isotype across the sequenced B cells, with a median of 3 distinct isotypes per clonotype.
We ran TRIBAL separately on each of the three mouse datasets, obtaining a maximum parsimony forest for each clonotype via dnapars. Similar to our NP-KLH analysis, we found that TRIBAL effectively utilized the additional isotype data to reduce the number of optimal solutions identified by dnapars (mean: 8.1 vs. 1.3, max: 165 vs. 4) — see Fig. 5b. The HLP19 likelihood of the TRIBAL inferred lineage trees had high concordance with the IgPhyML inferred trees (mean absolute deviation: 0.97), with TRIBAL yielding a higher likelihood for 53% of the clonotypes (Fig. 5c, Fig. S11). The average isotype clade entropy for the 35 clonotypes with more than one distinct isotype was significantly lower for TRIBAL than for IgPhyML (median: 0.49 vs. 0.77) — see Fig. 5d. An example comparison is shown in Fig. 5e,f for clonotype Mouse-1 775. The tree refinement step of TRIBAL yielded a tree with a significantly lower average isotype clade entropy when compared to IgPhyML (0.65 vs. 1.2) while both trees had identical HLP19 likelihoods (−41.4). Finally, we observed that the isotype transition probabilities reveal evidence for both direct and sequential switching of isotypes (Fig. S12).
In summary, both TRIBAL and IgPhyML yield lineage trees with very similar HLP19 likelihoods, giving support for the validity of the TRIBAL inferred lineage trees in terms of sequence evolution. However, TRIBAL jointly optimizes evolutionary models for both SHM and CSR, yielding trees with lower average isotype clade entropy.
5. Conclusion
The development and application of methods for inferring B cell lineage trees and isotype transition probabilities from scRNA-seq data is crucial for advancing our understanding of the immune system. In this work, we introduced TRIBAL, a method to infer B cell lineage trees and isotype transition probabilities from scRNA-seq data. TRIBAL makes use of existing maximum parsimony methods to optimize an evolutionary model for SHM, then incorporates isotype data to find the most parsimonious refinement, i.e., maximizing the CSR likelihood, among the input set of trees. We proved that the subproblem of finding a refinement maximizing the CSR likelihood is NP-hard. We demonstrated the effectiveness of TRIBAL via in silico experiments and on experimental data. On in silico experiments, we showed the importance of tree refinement for both accurately estimating isotype transition probabilities and lineage tree inference. Furthermore, we demonstrated on experimental data that TRIBAL returns lineage trees that have similar HLP19 likelihoods, despite utilizing a less complex model for sequence evolution but yield a reduction in the entropy of the isotype leaf labelings.
There are several directions for future research. First, integration of germline “sterile” transcripts may offer a way to initialize the TRIBAL inferred isotype transition probabilities [38]. Second, many existing B cell lineage inference methods, such as IgPhyML, yield multifurcating trees when zero length branches are collapsed. There exists an opportunity to combine likelihood or distance based inference methods with the tree refinement step of TRIBAL. Third, the MPTR problem has a more general formulation with the potential for wider applications beyond the problem of B cell lineage inference. For example, sample location is useful in refining tumor phylogeny with polytomies [15]. On a related note, we hypothesize that there are special cases of the MPTR problem and its more general formulation that are in P. Such special cases may include a weight matrix with unit costs and an upper triangular weight matrix that adheres to the triangle inequality. Fourth, the assumption that a single isotype transition probability matrix is shared by all clonotypes could be relaxed to allow the inference of multiple matrices per experiment and an assignment of clonotypes to an inferred matrix. Fifth, TRIBAL could be extended to allow for the correction of inaccurately clonotyped B cells. Finally, more robust evolutionary models for SHM are needed to capture the presence of complex mutations, such as indels, introduced during affinity maturation [39, 40].
Supplementary Material
Acknowledgments.
Authors thank Harinder Singh, Ken Hoehn, Mark Shlomchik and Margie Ackerman for insightful discussions. This work was partially supported by NIH grant DP2AI177884 (A.A.K.) and by the National Science Foundation grant CCF-2046488 (M.E-K.). This work used resources, services, and support provided via the Greg Gulick Honorary Research Award Opportunity supported by a gift from Amazon Web Services.
Availability:
TRIBAL is available at https://github.com/elkebir-group/tribal
References
- [1].Murphy K. & Weaver C. Janeway’s immunobiology (Garland science, 2016). [Google Scholar]
- [2].Meffre E., Casellas R. & Nussenzweig M. C. Antibody regulation of B cell development. Nature immunology 1, 379–385 (2000). [DOI] [PubMed] [Google Scholar]
- [3].Tonegawa S. Somatic generation of antibody diversity. Nature 302, 575–581 (1983). [DOI] [PubMed] [Google Scholar]
- [4].Victora G. D. & Nussenzweig M. C. Germinal centers. Annu Rev Immunol 30, 429–457 (2012). [DOI] [PubMed] [Google Scholar]
- [5].Stavnezer J., Guikema J. E. & Schrader C. E. Mechanism and regulation of class switch recombination. Annual review of immunology 26, 261 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Drummond A. J. & Rambaut A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC evolutionary biology 7, 1–8 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Felsenstein J. PHYLIP (phylogeny inference package) version 3.6. distributed by the author. http://www.evolution.gs.washington.edu/phylip.html (2004).
- [8].Nguyen L.-T., Schmidt H. A., Von Haeseler A. & Minh B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32, 268–274 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Stamatakis A. Raxml version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Hoehn K. B., Lunter G. & Pybus O. G. A phylogenetic codon substitution model for antibody lineages. Genetics 206, 417–427 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Hoehn K. B. et al. Repertoire-wide phylogenetic models of b cell molecular evolution reveal evolutionary signatures of aging and vaccination. Proceedings of the National Academy of Sciences 116, 22664–22672 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].DeWitt W. S. III, Mesin L., Victora G. D., Minin V. N. & Matsen F. A. IV Using genotype abundance to improve phylogenetic inference. Molecular Biology and Evolution 35, 1253–1265 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Davidsen K. & Matsen F. A. IV Benchmarking tree and ancestral sequence inference for B cell receptor sequences. Frontiers in immunology 9, 2451 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Hoehn K. B., Pybus O. G. & Kleinstein S. H. Phylogenetic analysis of migration, differentiation, and class switching in b cells. PLoS computational biology 18, e1009885 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].El-Kebir M., Satas G. & Raphael B. J. Inferring parsimonious migration histories for metastatic cancers. Nature Genetics 50, 718–726 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Slatkin M. & Maddison W. P. A cladistic measure of gene flow inferred from the phylogenies of alleles. Genetics 123, 603–613 (1989). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Abdollahi N., Jeusset L., de Septenville A., Davi F. & Bernardes J. S. Reconstructing B cell lineage trees with minimum spanning tree and genotype abundances. BMC bioinformatics 24, 70 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Canzar S., Neu K. E., Tang Q., Wilson P. C. & Khan A. A. BASIC: BCR assembly from single cells. Bioinformatics 33, 425–427 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Neu K. E., Tang Q., Wilson P. C. & Khan A. A. Single-cell genomics: approaches and utility in immunology. Trends in immunology 38, 140–149 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Reiman D. et al. Pseudocell tracer—a method for inferring dynamic trajectories using scRNAseq and its application to B cells undergoing immunoglobulin class switch recombination. PLoS computational biology 17, e1008094 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Zheng G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nature Communications 8, 14049 (2017). URL 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Hamming R. W. Error detecting and error correcting codes. The Bell system technical journal 29, 147–160 (1950). [Google Scholar]
- [23].Horns F. et al. Lineage tracing of human B cells reveals the in vivo landscape of human antibody class switching. Elife 5, e16578 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Ronen O., Rohlicek J. R. & Ostendorf M. Parameter estimation of dependence tree models using the EM algorithm. IEEE Signal Processing Letters 2, 157–159 (1995). [Google Scholar]
- [25].Chow C. & Liu C. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory 14, 462–467 (1968). [Google Scholar]
- [26].Sridhar S., Lam F., Blelloch G. E., Ravi R. & Schwartz R. Mixed integer linear programming for maximum-parsimony phylogeny inference. IEEE/ACM Transactions on Computational Biology and Bioinformatics 5, 323–331 (2008). [DOI] [PubMed] [Google Scholar]
- [27].Kullback S. & Leibler R. A. On information and sufficiency. The annals of mathematical statistics 22, 79–86 (1951). [Google Scholar]
- [28].Robinson D. F. & Foulds L. R. Comparison of phylogenetic trees. Mathematical biosciences 53, 131–147 (1981). [Google Scholar]
- [29].Sankoff D. Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics 28, 35–42 (1975). [Google Scholar]
- [30].Cumano A. & Rajewsky K. Clonal recruitment and somatic mutation in the generation of immunological memory to the hapten NP. The EMBO journal 5, 2459–2468 (1986). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Chen D. et al. Coupled analysis of transcriptome and BCR mutations reveals role of OXPHOS in affinity maturation. Nature Immunology 22, 904–913 (2021). URL 10.1038/s41590-021-00936-y. [DOI] [PubMed] [Google Scholar]
- [32].Stephenson E. et al. Single-cell multi-omics analysis of the immune response in COVID-19. Nature Medicine 27, 904–916 (2021). URL 10.1038/s41591-021-01329-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Allen D., Simon T., Sablitzky F., Rajewsky K. & Cumano A. Antibody engineering for the analysis of affinity maturation of an anti-hapten response. The EMBO journal 7, 1995–2001 (1988). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Weiser A. A. et al. Affinity maturation of B cells involves not only a few but a whole spectrum of relevant mutations. International immunology 23, 345–356 (2011). [DOI] [PubMed] [Google Scholar]
- [35].Furukawa K., Akasako-Furukawa A., Shirai H., Nakamura H. & Azuma T. Junctional amino acids determine the maturation pathway of an antibody. Immunity 11, 329–338 (1999). [DOI] [PubMed] [Google Scholar]
- [36].Jacob J., Przylepa J., Miller C. & Kelsoe G. In situ studies of the primary immune response to (4-hydroxy-3-nitrophenyl) acetyl. iii. the kinetics of V region mutation and selection in germinal center B cells. The Journal of experimental medicine 178, 1293–1307 (1993). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Nickerson K. M. et al. Age-associated B cells are heterogeneous and dynamic drivers of autoimmunity in mice. Journal of Experimental Medicine 220, e20221346 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Ng J. C.. et al. sciCSR infers B cell state transition and predicts class-switch recombination dynamics using single-cell transcriptomic data. bioRxiv 2023–02 (2023). doi: 10.1038/s41592-023-02060-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Tan J. et al. A LAIR1 insertion generates broadly reactive antibodies against malaria variant antigens. Nature 529, 105–109 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Lupo C., Spisak N., Walczak A. M. & Mora T. Learning the statistics and landscape of somatic mutation-induced insertions and deletions in antibodies. PLOS Computational Biology 18, e1010167 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Foulds L. R. & Graham R. L. The Steiner problem in phylogeny is NP-complete. Advances in Applied mathematics 3, 43–49 (1982). [Google Scholar]
- [42].Karp R. M. Reducibility among combinatorial problems (Springer, 2010). [Google Scholar]
- [43].Warnow T. Computational phylogenetics: an introduction to designing methods for phylogeny estimation (Cambridge University Press, 2017). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
TRIBAL is available at https://github.com/elkebir-group/tribal