SURROGATE SELECTION OVERSAMPLES EXPANDED T CELL CLONOTYPES

Peng Yu; Yumin Lian; Cindy L Zuleger; Richard J Albertini; Mark R Albertini; Michael A Newton

doi:10.1101/2023.07.13.548950

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Jul 15:2023.07.13.548950. [Version 1] doi: 10.1101/2023.07.13.548950

SURROGATE SELECTION OVERSAMPLES EXPANDED T CELL CLONOTYPES

Peng Yu ^1,^*, Yumin Lian ², Cindy L Zuleger ^3,⁴, Richard J Albertini ⁵, Mark R Albertini ^3,^4,^6,^†, Michael A Newton ^1,^4,^7,^‡

PMCID: PMC10369934 PMID: 37503118

Abstract

Inference from immunological data on cells in the adaptive immune system may benefit from modeling specifications that describe variation in the sizes of various clonal sub-populations. We develop one such specification in order to quantify the effects of surrogate selection assays, which we confirm may lead to an enrichment for amplified, potentially disease-relevant $T$ cell clones. Our specification couples within-clonotype birth-death processes with an exchangeable model across clonotypes. Beyond enrichment questions about the surrogate selection design, our framework enables a study of sampling properties of elementary sample diversity statistics; it also points to new statistics that may usefully measure the burden of somatic genomic alterations associated with clonal expansion. We examine statistical properties of immunological samples governed by the coupled model specification, and we illustrate calculations in surrogate selection studies of melanoma and in single-cell genomic studies of $T$ cell repertoires.

Keywords: Bayes’s rule, clonal expansion, diversity statistic, enrichment, exchangeable birth-death processes, experimental design, single cell sequencing, size bias, somatic mutation, Yule-Simon law

1. Introduction.

1.1. Overview.

With thymic-derived lymphocytes (i.e., T cells) sampled from peripheral blood or some other tissue compartment (e.g., tumor-infiltrating lymphocytes), any techniques that would enrich the sample for disease-relevant cells could be useful, considering the complexity of a typical $T$ cell population and the potential for an improved understanding of the immune response to disease. For example, at writing we have no effective biomarkers to predict how a melanoma patient will respond to immune checkpoint inhibition therapy, though responses among similar patients may vary from morbid toxicity to full recovery (e.g., Ganesan and Mehnert, 2020^;Shum, Larkin and Turajlic, 2022).

Surrogate selection restricts a lymphocyte sample in vitro to cells whose somatic ancestors had acquired and thus transmitted to them specific, selectable mutations. Selection assays based on mutations of the hypoxanthine-guanine phosphoribosyltransferase (HPRT) gene are most well studied, though the approach applies to any mutations that are neutral with respect to the immune response (Kaitz et al., 2022). As an immune-system probe, HPRT surrogate selection has been used to study a variety of environmental effects and disease processes (Albertini, Castle and Borcherding, 1982; Albertini, 2001; Kaitz et al., 2022). With continued focus on disease studies, we examine the sampling effects of surrogate selection; selected cells may represent in vivo amplified clones that are more likely to be disease relevant than clones of randomly sampled cells, and we seek a more thorough understanding of this enrichment phenomenon for the sake of improved experimental design and data analysis.

The idea that surrogate selection can enrich for clonally amplified $T$ cells has provided a rationale in many studies, though quantitative treatments of this experimental-design strategy remain very limited. Statistical procedures have been deployed to test from sequence data the null hypothesis that enrichment is absent, and the mounting evidence supports the alternative (e.g., Pei et al., 2014^;Zuleger et al., 2020). Considering cell growth dynamics, one would predict an increased prevalence of various somatic mutations in cells within an actively proliferating clone compared to a relatively quiescent one. Then conditioning on the presence of some such mutation in a sampled cell, Bayes’s rule would imply that the cell is more likely to be from the proliferating than the quiescent clone. Surrogate selection thus relies on the biological consequences of in vivo clonal proliferation to enrich for activated T cells in individuals with ongoing immunological response to disease. Understanding this enrichment effect is complicated by the enormous complexity of $T$ cell population and properties of the distribution of clone sizes, but resolving these complications will inform investigations of surrogate selection as a mechanistic probe for fundamental biological/immunological processes. The main contribution of the present work is to quantify the enrichment effect of surrogate selection in an idealized but structurally relevant setting, and to leverage basic stochastic-process theory to confirm and characterize the enrichment phenomenon in this model. Our formulation also enables a study of distributional properties of elementary diversity statistics, of the type often used in experimental studies. We show that samples identified using surrogate selection have lower expected sample diversity, in agreement with empirical studies.

Our theoretical analysis exposes an interesting statistical prediction concerning somatic mutations that are unrelated to any selection assay. From contemporary single-cell genomic studies, we associate $T$ cell clone sizes with estimates of somatic mutation burden, and thereby provide a new measure of somatic burden of a $T$ cell receptor.

1.2. Immunological setting.

Consider a person’s $T$ cell repertoire, comprised of perhaps 10¹¹ or more CD4+ and CD8+ naive, effector, and memory $T$ cells, and partitioned into clonotypes within each of which the $T$ cell receptor (TCR) sequence of the cells is constant (e.g., Nikolich-Žugich, Slifka and Messaoudi, 2004^;Pennock et al., 2013^;van den Broek, Borghans and van Wijk, 2018). The number of $T$ cells in each clonotype fluctuates over time and usefully may be viewed as a stochastic process (Currie et al., 2012; Hodgkin, Dowling and Duffy, 2014; Desponds, Mora and Walczak, 2016; Gaimann et al., 2020; Smith et al., 2020; Molina-París and Lythe, 2021). Notably, a $T$ cell receptor’s cognate antigen may induce cell division and expansion of the associated clonotype when appropriate costimulatory molecules are present. Complexity of the adaptive immune response warrants highly detailed stochastic-model dynamics, perhaps accounting for clonal competition or adaptation (e.g., Stirk, Molina-París and van den Berg, 2008^;Lythe and Molina-París, 2018^;Rane et al., 2018^;Duque et al., 2020). However, even structurally simple models can support certain lines of investigation and can guide statistical analysis in the growing number of empirical studies. T cell receptor repertoire analysis has been critical in studies investigating antitumor responses as well as immune-related toxicity following treatment with immune-checkpoint blockade (e.g., Fairfax et al., 2020^;Valpione et al., 2020^;Lozano et al., 2022^;Valpione et al., 2021).

1.3. Surrogate selection.

In the absence of an assay to measure the proliferation history of a sampled $T$ cell, surrogate selection provides an indirect measurement through the lens of neutral somatic mutation. The most well-studied case leverages an assay to score somatic mutations of hypoxanthine-guanine phosphoribosyltransferase (HPRT) (Albertini et al., 1990; Albertini, 2001). Other assays rely on an efficient approach to screen mutations in phosphoinositolglycan class A (PIG-A) genes (Peruzzi et al., 2010; Dobrovolsky et al., 2017). Coding an enzyme within the purine salvage pathway, HPRT normally helps to recycle nucleotide bases from degraded DNA. Its post-translational modifications also confer cytotoxicity to purine analogs, including 6-thioguanine (6TG). Cultured lymphocytes are thus unable to grow in the presence of 6TG unless they have incurred an inactivating HPRT mutation. Each surviving $T$ cell in an HPRT assay reports that an HPRT mutation occurred in that $T$ cell or in one of its somatic ancestors. The assay has been used to monitor somatic mutations in many settings, including, for example, in Chernobyl liquidators (Jones et al., 2002), in Iraq war veterans (Nicklas et al., 2015), and in studies of environmental exposures. Kaitz et al. (2022) reviews the implicit model for surrogate selection and the literature using HPRT surrogate selection in autoimmune diseases, cardiac transplantation, infectious diseases, a hematological disease, and cancer.

1.4. Summary of findings.

The rationale for surrogate selection in disease studies is that it provides an enrichment for relevant $T$ cell clonotypes. Some care is required in this argument, since while a large, expanded clonotype has higher sampling probability than any smaller clonotype, the vast diversity within a typical $T$ -cell repertoire means that even large clonotypes remain a small fraction of the total population; indeed, most sampled cells come from small clonotypes. Basic stochastic process theory guides our effort to balance these factors. We find that if at any time point the vector of clonotype sizes in a repertoire is exchangeable, and if the temporal development of any one clonotype follows a sufficiently regular birth-death process, then surrogate selection via neutral somatic mutation enriches the sampled cells for those of larger clonotypes. We examine the impact of surrogate-selection on the expected value of sample diversity statistics. In empirical validations, we re-examine single-cell data from publicly available $T$ cell repertoire samples that were obtained via 10x Genomics sequencing; in doing so we compute cell-level somatic burden statistics and associate this burden with clonotype size. We also review sample diversity statistics from available surrogate-selection studies.

2. One developing clonotype.

2.1. Model set up.

Our calculations begin by considering one clonotype of the many within an individual subject’s T cell repertoire. For definiteness, we label this clonotype $σ$ , recognizing that $σ$ resides in a large finite label set $𝒮$ , which we associate with the set of possible $T$ cell receptor sequences. At time $t \geq 0$ relative to some reference time point $t = 0$ (e.g., birth), clonotype $σ$ consists of $N_{σ} (t)$ cells. If clonotype $σ$ is ever non-empty, then there is some origin time, say $τ_{σ}$ , such that $N_{σ} (t) = 0$ for $t < τ_{σ}$ and $N_{σ} (t) > 0$ only at times $t \geq τ_{σ}$ . We suppose that $N_{σ} (τ_{σ}) = 1$ ; that is, the clonotype originates upon successful completion of receptor-forming recombination events (Elhanati et al., 2018). After positive and negative selection induce thymocyte maturation, clonotype cells egress from the thymus and distribute themselves throughout the body; we expect this all occurs on a short time scale compared to the timing of typical observations, which might be from a mature subject’s peripheral blood or tumor-infiltrating lymphocytes, for example.

The stochastic process $\{N_{σ} (t) : t \geq 0\}$ fluctuates in response to all sorts of cell-biological factors affecting cells in the clonotype, and must reflect a complex birth-death process (e.g., den Braber et al., 2012^;Desponds, Mora and Walczak, 2016^;Zhan et al., 2017). For example, in the presence of appropriate cytokines, $T$ cell receptor interaction with cognate antigen triggers cell proliferation, while apoptotic signals can induce cell death. Our understanding of repertoire maintenance further supports the notion that if $N_{σ} (s) = 0$ at time $s > τ_{σ}$ , then $N_{σ} (t) = 0$ for all $t \geq s$ . This is analogous to the infinite-alleles assumption in population genetics; here it means that a clonotype can only emerge once.

2.2. The branching tree.

Following clonotype $σ$ over time from $τ_{σ}$ , there is a series of event times at which cells in the clonotype either divide or die. Were we able to trace ‘s complete history, we would record a binary tree, such as in Figure 1. At some observation time $t_{o b s}$ , each leaf of the tree is an extant cell that has experienced a number of cell divisions since $τ_{σ}$ . This division number is also called the depth of the leaf node. For a cell randomly sampled from the clonotype, let $D_{σ}$ denote this division number; it has a probability distribution induced both by the stochastic development of $σ$ and by the random selection of the extant cell. Fortunately, this distribution has been the subject of extensive study in the context of random binary trees (e.g., Lynch, 1965^;Mahmoud, 1992^;Aldous, 1996^;Steel and McKenzie, 2001^;Mahmoud and Neininger, 2003).

Fig 1. — Binary tree formed by a developing clonotype, showing examples of cell division, cell death and mutation, and noting the number $d$ of cell divisions experienced by each extant cell at time $t_{o b s}$ . Green circles (extant cells 5 and 6) denote mutant $T$ cells. Empty circles (1, 2, 3, 4 and 7) denote wild type $T$ cells. Green lines denote evolution of mutant cells. Short vertical lines denote cell death.

In the Yule model for trees, each cell division acts on a random cell, as if by a pure-birth process without cell death. This symmetry over cell identity allows various explicit computations. In fact, the probability generating function (p.g.f.) of $D_{σ}$ is

G_{n} (z) = E \{z^{D_{σ}} ∣ N_{σ} (t_{obs}) = n\} = \frac{⟨ 2 z ⟩_{n - 1}}{n!},

(1)

which is the formulation presented in (Mahmoud, 1992, Page 71–74), Eq. (2.4). ¹ Here $⟨ x ⟩_{n} = x (x + 1) (x + 2) \dots (x + n - 1)$ is the rising factorial, which is conveniently expressed in terms of Gamma and Beta functions $Γ$ and $B$ as:

\frac{⟨ x ⟩_{n - 1}}{n!} = \frac{Γ (x + n - 1)}{Γ (x) Γ (n + 1)} = \frac{1}{(x + n) (x + n - 1)} \cdot \frac{1}{B (x, n + 1)} .

The p.g.f. $G_{n}$ helps us connect the $T$ cell repertoire with surrogate-selection dynamics. Before pursuing that calculation, we note that the expectation and variance of $D_{σ}$ are also available, with both well approximated by twice the natural logarithm of $n$ , and that as $n$ increases, $\{D_{σ} - 2 l o g (n)\} / \sqrt{2 l o g (n)}$ converges in distribution to a standard normal variate (Brown and Shubert, 1984; Mahmoud and Neininger, 2003). Roughly, a randomly sampled cell from a randomly proliferating clonotype of current size $n$ (and ignoring cell death) has experienced about $2 l o g (n)$ cell divisions since receptor formation in the thymus. Sampling from the conditional distribution of $D_{σ} ∣ N_{σ} (t_{o b s}) = n$ is reported in Figure 2, revealing this proliferation effect for a handful of clonotype sizes. For completeness, we note the p.m.f. of $D_{σ}$ is, as derived in Lynch (1965),

P \{D_{σ} = d ∣ N_{σ} (t_{obs}) = n\} = \frac{2^{d}}{n!} S (n - 1, d), d = 0, 1, \dots, n - 1,

(2)

where $S (n - 1, d)$ is the unsigned Stirling number of first kind.

Fig 2. — Proliferation effect: Shown are violin plots of the division number $D_{σ}$ for cells in randomly developed binary trees, having various sizes, $n$ , at observation time. We used $R$ packages **ape**, to simulate Yule trees, and **adephylo**, to count divisions (Paradis and Schliep, 2019; Jombart, Balloux and Dray, 2010). Each plot summarizes 100,000 simulated $D_{σ}$ values. Empirical medians (white) and asymptotic means $2 l o g (n)$ (grey) are shown.

2.3. Neutral mutations.

Surrogate selection aims to use neutral genomic mutations – mutations that do not affect clonotype growth dynamics – as probes to report on these very same dynamics. Uncorrected mitotic errors or other mutagenic effects are expected to occur at some rate throughout the developing repertoire. We focus on mitotic mutations that affect a single daughter cell, that are irreversible, and that occur independently across cell divisions. Less prevalent mechanisms may induce mutations in both daughter cells (e.g., double-stranded breaks) or separately from mitosis (e.g., ionizing radiation), and statistical formulations may be adapted to these cases (e.g., Kendall, 1960^;Roshan, Jones and Greenman, 2014). We use $θ \in (0, 1 / 2)$ to denote the relative frequency of mutations at a given locus (e.g., HPRT) per daughter cell; i.e., $2 θ$ is the mutation frequency per cell division.

Consider the thought experiment to sample a single cell uniformly at random from the extant clonotype $σ$ at time $t_{o b s}$ , and let $M_{σ}$ be the binary (0/1) indicator that the sampled cell harbors a mutation at the locus in question. We recognize that $M_{σ}$ really indicates that a mutation event occurred somewhere in the ancestral lineage of the cell, and thus

P \{M_{σ} = 1 ∣ D_{σ} = d, N_{σ} (t_{o b s}) = n\} = 1 - (1 - θ)^{d}

(3)

where $D_{σ}$ is the division number for this random cell. (The cell is not mutant if none of the $d$ opportunities for mutation yield such.) Incidentally, (3) implies that $M_{σ}$ and $N_{σ} (t_{obs})$ are conditionally independent given $D_{σ}$ . Our first finding concerns the rate of mutant genotype in clonotypes of a given size, and is obtained by marginalizing the distribution of $D_{σ}$ . With neutral mutations in a Yule tree model, define $ψ_{n} : = P \{M_{σ} = 1 ∣ N_{σ} (t_{obs}) = n\}$ , and note,

\begin{array}{l} ψ_{n} = \sum_{d = 0}^{\infty} P (M_{σ} = 1 ∣ D_{σ} = d) P \{D_{σ} = d ∣ N_{σ} (t_{o b s}) = n\} \\ = \sum_{d = 0}^{\infty} \{1 - (1 - θ)^{d}\} P \{D_{σ} = d ∣ N_{σ} (t_{o b s}) = n\} \\ = 1 - G_{n} (1 - θ) \\ = 1 - \frac{Γ (n + 1 - 2 θ)}{Γ (n + 1) Γ (2 - 2 θ)} \approx 1 - \frac{1}{n^{2 θ} Γ (2 - 2 θ)}, \end{array}

(4)

with the approximation on the last line improving for increasing $n$ . Result (4) quantifies the intuition that proliferating clonotypes provide a greater number of chances for mutation. With $> 0$ , ${l i m}_{n \to \infty} ψ_{n} = 1$ , and so an ever-proliferating clonotype is eventually dominated by mutant cells. This matches limit theory for birth-death processes in which the growth rate of mutant cells is no less than that of wild-type cells (e.g., Cheek and Antal, 2018).

We are not too concerned with the total number of mutant cells in the clonotype, whose expected value is $n$ time the per cell rate in (4), though our diversity calculations in Section 3.5 rely on this distribution. That total mutant count is interesting in other settings, and is governed by the Luria-Delbrück distribution; see Angerer (2001) or Roshan, Jones and Greenman (2014) for the exact, non-asymptotic formulation. The reader may check that our formula (4) matches the first-moment formula from Roshan, Jones and Greenman (2014), Theorem 3.3, taking $n = k$ and $μ_{1} = 1 - μ_{0} = 2 θ$ ; interestingly, a quite different approach is taken in that paper.

2.4. Enrichment and Bayes rule.

The development so far has emphasized probabilities that condition in some way on clonotype size. Next we layer in a distribution on that size itself; the stochastic evolution of a specific clonotype $σ$ induces a distribution on the size $N_{σ} (t_{obs})$ at observation time. For example, the linear pure-birth model leads to the Geometric $\{e x p (- λ_{σ} t_{obs})\}$ distribution,

P \{N_{σ} (t_{o b s}) = n\} = e^{- λ_{σ} t_{o b s}} {(1 - e^{- λ_{σ} t_{o b s}})}^{n - 1}, n \geq 1

(5)

where $λ_{σ}$ is the birth rate (rate of cell division). Further, compounding over $λ_{σ}$ gives the Yule-Simon law, with parameter $ρ > 0$ ,

P \{N_{σ} (t_{o b s}) = n\} = ρ B (n, ρ + 1) = \frac{ρ Γ (ρ + 1) Γ (n)}{Γ (n + ρ + 1)} \approx \frac{ρ Γ (ρ + 1)}{n^{ρ + 1}},

(6)

where the approximation improves with increasing $n$ . This is approximately a power-law, or Zipf distribution, which has been found to fit many T-cell repertoires (e.g., Bolkhovskaya, Zorin and Ivanchenko, 2014^;Desponds, Mora and Walczak, 2016^;Koch et al., 2018^;Gaimann et al., 2020^;de Greef et al., 2020), with exponents $ρ$ in the range 0.05 to 0.2 . Other marginal distributions on $N_{σ} (t_{obs})$ may be induced by more complex stochastic dynamics, such those modeling competition and thymic pressure (Lythe and Molina-París, 2018).

Combining the forward, mutant-genotype model (4) with a size model $P \{N_{σ} (t_{obs}) = n\}$ , we have by conditioning:

\begin{array}{l} P \{N_{σ} (t_{o b s}) = n ∣ M_{σ} = 1\} = \frac{P \{M_{σ} = 1 ∣ N_{σ} (t_{o b s}) = n\} P \{N_{σ} (t_{o b s}) = n\}}{P (M_{σ} = 1)} \\ = \frac{P \{N_{σ} (t_{o b s}) = n\}}{P (M_{σ} = 1)} \{1 - \frac{Γ (n + 1 - 2 θ)}{Γ (n + 1) Γ (2 - 2 θ)}\} . \end{array}

(7)

This Bayesian inversion of (4) quantifies surrogate selection’s enrichment effect in the pure-birth case. One setting is shown in Figure 3, which illustrates the suppression of probability on small clonotypes and inflation for larger ones. In that example, the median of the unconditional Geometric distribution is 6931 cells, while after conditioning on $M_{σ} = 1$ , the median clonotype size shifts up to 8139 cells. This effect is not limited to the marginal Geometric law. Figures 4 show the result for a Logarithmic distribution (p.m.f. proportional to $p^{n} / n$ ) and a Yule-Simon law (6), respectively. Summarizing the findings for a single, developing clonotype, we have:

Fig 3. — $P \{N_{σ} (t_{o b s}) = n ∣ M_{σ} = 1\}$ (red) when the marginal distribution (blue) is a Geometric distribution with parameter $e^{- λ t_{o b s}} = 10^{- 4}$ and the mutation frequency $θ = 10^{- 6}$ . The crossover point $n_{c r o s s}$ is 5624 cells.

Fig 4. — $P \{N_{σ} (t_{o b s}) = n ∣ M_{σ} = 1\}$ (red) when the marginal clonotype size distribution (blue) is a Logarithmic distribution (left) or a Yule-Simon distribution (right), with parameters $p = 1 - 10^{- 5}$ for Logarithmic distribution and $ρ = 0.1$ for Yule-Simon distribution. Mutation frequency $θ = 10^{- 6}$ in both cases. The crossover point $n_{c r o s s}$ equals to 326 cells under Logarithmic distribution, and $n_{c r o s s} = 14270$ under Yule-Simon distribution.

Proposition 1.

Suppose that, regardless of the marginal distribution of $N_{σ} (t_{o b s})$ , each cell division in the developing clonotype $σ$ increases the clonotype size by 1 and occurs on a random extant cell, that a non-mutant dividing cell produces one mutant descendant (w.p. $2 θ$ ) or no mutant descendants (w.p. $1 - 2 θ$ ), that descendants of a mutant dividing cell are both mutants, that there are no cell deaths, and that $σ$ began with a single non-mutant cell. If $M_{σ}$ indicates that a randomly sampled cell from $σ$ at time $t_{o b s}$ is mutant, then the enrichment ratio $ϕ_{n} : = P \{N_{σ} (t_{o b s}) = n ∣ M_{σ} = 1\} / P \{N_{σ} (t_{o b s}) = n\}$ is:

ϕ_{n} = \frac{1}{P (M_{σ} = 1)} \{1 - \frac{Γ (n + 1 - 2 θ)}{Γ (n + 1) Γ (2 - 2 θ)}\} .

Further, $ϕ_{n}$ is strictly increasing and approaches $1 / P (M_{σ} = 1) > 1$ as $n ⟶ \infty$ .

Two immediate corollaries assure that: (1) there exists a crossover point $n_{cross}$ with $ϕ_{n} < 1$ when $n < n_{cross}$ and $ϕ_{n} > 1$ when $n > n_{cross}$ , and (2) the conditional distribution is stochastically larger than the marginal distribution, which is another perspective on the notion that mass is pushed towards larger clonotypes. In fact, monotonicity of $ϕ_{n}$ amounts to saying that the marginal and conditional distributions satisfy the monotone likelihood ratio ordering, which is stronger than stochastic ordering of c.d.f.’s: $P \{N_{σ} (t_{o b s}) \geq n ∣ M_{σ} = 1\} \geq P \{N_{σ} (t_{obs}) \geq n\}$ (see Pfanzagl, 1964). Among other things, it also follows that the conditional distribution of $N_{σ} (t_{o b s})$ given $M_{σ} = 1$ has larger expected value than the marginal distribution. Conceptually, learning that the sampled cell is mutant tells us that the clonotype is probably larger than we would have guessed otherwise.

2.5. Beyond pure birth.

Relaxing the no-cell-death assumption makes quantifying enrichment more difficult. Explicit calculations in one example (Appendix A) show that conditioning on $M_{σ} = 1$ does not necessarily enrich for larger clonotypes. That highly stylized example captures features of clonal expansion followed by rapid clonal decline. The intuition is that having sampled a mutant cell, we may only know that its containing clonotype is relatively old, rather than knowing this clonotype is relatively large. These two features are equivalent in the pure-birth model. To develop this intuition further, we pursue calculations in a well-behaved but general class of birth-death processes, and we find conditions within this class which assure the enrichment-for-larger-clonotypes phenomena.

At times $τ_{1} < τ_{2} < \dots$ after $τ_{σ}$ , changes $A_{1}, A_{2}, \dots$ occur that either increase the clonotype size $(A_{i} = 1)$ or decrease the clonotype size $(A_{i} = - 1)$ , in the first case by division of a random cell, and in the latter by death of a random cell. Then at time $t$ , the clonotype size $N_{σ} (t) = 1 + \sum_{i = 1}^{I (t)} A_{i}$ where $τ_{I (t)} \leq t < τ_{I (t) + 1}$ . We suppose this size process $N_{σ} (t)$ is not explosive, and thus only a finite number of $τ_{j}$ ‘s can occur in any finite time interval. We ask that $\{A_{i}\}$ be independent of event times $τ_{1} < τ_{2} < \dots$ so that the discrete clonal history may be treated separately from questions of temporal rates of change. Further, we do not require a Markov condition, though we are mindful that having $A_{i}$ conditionally independent of past changes given $ν_{i - 1} = 1 + \sum_{j = 1}^{i - 1} A_{j}$ provides for a Markovian jump chain $ν_{1}, ν_{2}, \dots$ , with $N_{σ} (t) = ν_{I (t)}$ (e.g., Grimmett and Stirzaker, 2001, pg 265). Considering mutation status along the jump chain, we introduce

Ψ (a_{1}, a_{2}, \dots, a_{i}) : = P [M_{σ} = 1 ∣ 𝒜_{i}, I (t_{obs}) = i]

where $𝒜_{i} = \cap_{j = 1}^{i} (A_{j} = a_{j})$ tracks the specific birth-death steps; thus $Ψ$ is the conditional mutant frequency of a cell sampled from $σ$ just after the $i$ birth-death steps indicated by $𝒜_{i}$ . Obviously we cannot sample a cell from an empty clonotype, so we furthermore condition on non-extinction, i.e. $ν_{i} \geq 1$ for all $i$ . The $Ψ$ function generalizes the pure-birth $ψ_{n}$ sequence (4), which we recover with $i = (n - 1)$ and all $a_{j} = 1$ , for example.

Proposition 2.

In a birth-death process as defined above, $Z_{i} : = Ψ (A_{1}, A_{2}, \dots, A_{i})$ is non-decreasing in $i$ . If with probability one $\sum_{j = 1}^{i} 1 [A_{j} = 1] / (j + 1)$ diverges as $i \to \infty$ , then $Z_{i}$ converges almost surely to the limit 1 , and also $E (Z_{i}) = P [M_{σ} = 1 ∣ I (t_{o b s}) = i]$ converges to 1. Additionally, if $ξ_{n, i} : = E (Z_{i} ∣ ν_{i} = n)$ is non-decreasing in $i \in {n - 1, n + 1, n + 3, \dots}$ for each $n$ , then $P [M_{σ} = 1 ∣ N_{σ} (t_{o b s}) = n] \geq ψ_{n}$ .

In a linear birth-death process for example, and ignoring extinction for the moment, the $A_{i}$ ‘s are i.i.d., with $P (A_{i} = 1) = λ / (λ + μ)$ for birth rate $λ > 0$ and death rate $μ \geq 0$ . It is well known that extinction is almost sure when $λ \leq μ$ , but also that extinction occurs with probability $μ / λ$ as long as $λ > μ$ (e.g., Grimmett and Stirzaker, 2001, pg 272). We would meet the requirements of Proposition 2 in this case; conditioning on non-extinction conditions on an event of positive probability. Note too that the divergence requirement follows immediately from the three-series theorem (e.g., Billingsley, 1995, pg 290). We have a recursive formula for $ξ_{n, i} = E (Z_{i} ∣ ν_{i} = n)$ ; namely under the Markov condition for $ν_{1}, ν_{2}, \dots$ ,

ξ_{n, i} = P \{M_{σ} = 1 ∣ N (t_{o b s}) = n, I (t_{o b s}) = i\} = w_{n, i} ξ_{n + 1, i - 1} + (1 - w_{n, i}) \{ξ_{n - 1, i - 1} (1 - \frac{2 θ}{n}) + \frac{2 θ}{n}\}

where $w_{n, i} = P (A_{i} = - 1 ∣ ν_{i} = n)$ . We have not identified conditions assuring this $ξ_{n, i}$ sequence is non-decreasing in $i$ for each $n$ (a requirement for Proposition 2); but numerical experiments in the linear birth-death model (Figure S1) give us confidence that this condition holds in relevant settings. The final lower-bound result in Proposition 2 means that conditioning on mutant status does enrich for larger clonotypes, thus extending Proposition 1 . In any case, the monotonicity of $E (Z_{i})$ indicates that such conditioning enriches for older clonotypes regardless of properties of $ξ_{n, i}$ .

3. Sampling from the repertoire.

3.1. Model set up and size bias.

Calculations so far refer to the random development of a single clonotype and its internal mutation rate. More relevant to experimental data are calculations that allow for sampling from the full repertoire, and thus the simultaneous development of many clonotypes. We eschew detailed, cell-biological considerations, though we do provide necessary structural elements to allow for a distributional comparison of diversity statistics computed either from wild type or mutant $T$ cell fractions. First we address a curious size-biased sampling effect that emerges in considering the full repertoire, in contrast to the single clonotype from Sections 2.4 and 2.5.

We focus on a single observation time $t_{obs}$ , at which point the repertoire $𝒮$ is comprised of non-empty clonotypes $σ_{1}, σ_{2}, \dots, σ_{ℵ_{clo}}$ , of sizes $𝒩 = (N_{σ_{1}}, N_{σ_{2}}, \dots N_{σ_{ℵ_{c l o}}})$ , with $ℵ_{cel} = \sum_{j = 1}^{ℵ_{clo}} N_{σ_{j}}$ equal to the overall number of cells in the repertoire. We treat $ℵ_{clo}$ and $ℵ_{cel}$ as large constants, and, considering this snapshot of the repertoire, here we appreciate but do not emphasize with notation anything about the temporal, stochastic development of the clonotypes; for instance we ignore the multitude of receptors that are not extant at $t_{obs}$ , and we therefore have $N_{σ_{j}} > 0$ for all $j$ . We allow that some more primitive generative stochastic process may underlie the clonotype counts, but we focus on their conditional joint distribution given the total number of cells $ℵ_{c e l}$ and the total number of extant clonotypes $ℵ_{c l o}$ , which in adult humans may be on the order of 10¹¹ and 10⁸, respectively. The same technical device was used by Rothman and Templeton (1980) in studying statistical properties of other assemblages, where additionally the assumption of finite exchangeability is helpful in revealing interesting system properties. We also adopt the finite exchangeability assumption for the joint mass function,

f_{joint} (n_{1}, n_{2}, \dots, n_{ℵ_{clo}}) = P (N_{σ_{1}} = n_{1}, N_{σ_{2}} = n_{2}, \dots, N_{σ_{ℵ_{clo}}} = n_{ℵ_{clo}})

(8)

for counts $n_{j} \geq 1$ , which not only simplifies the specification, but also means that joint probability masses depend on the frequency spectrum holding the counts-of-counts: $C (k) = \sum_{σ} 1 [N_{σ} = k]$ . Figure 5 realizes a small synthetic example.

Fig 5. — Simulated repertoire of $ℵ_{c e l} = 1000$ cells comprising $ℵ_{c l o} = 100$ non-empty clonotypes (encasing circles). The 287 mutant cells are orange/rust, and the remaining 713 wild-type cells are grey, giving a realized mutant frequency 0.287. As predicted mathematically, the larger clonotypes have an over-representation of mutant cells. Sampling uniformly among clonotypes, the average extant clonotype size is 10.0 cells; given the sampled clonotype contains a mutant cell, the average clonotype size is 16.0 cells. On the other hand, sampling uniformly among cells, the average clonotype size of the sampled cell (i.e., with size bias) is 23.0 cells. The average clonotype size when sampling mutant cells, however, is even larger, at 27.7 cells. This synthetic data was simulated from a Bose-Einstein clone-size model and a Luria-Delbrück mutation model, with mutation frequency $θ = 0.05$ .

To appreciate the size-bias issue, consider sampling a single cell uniformly from the repertoire, and let $S \in 𝒮$ denote its clonotype identifier. We recognize that $N_{S}$ , the size of the clonotype holding the sampled cell, is random owing to both the random development of the repertoire, as governed at least at the observation time by (8), and owing to the sampling of a cell from the repertoire. Under exchangeability, for $n \geq 1$ :

\begin{array}{l} P (N_{S} = n) = \sum_{σ \in 𝒮} P (N_{S} = n, S = σ) = \sum_{σ \in 𝒮} P (N_{σ} = n, S = σ) \\ = \sum_{σ \in 𝒮} P (S = σ ∣ N_{σ} = n) P (N_{σ} = n) = \sum_{σ \in 𝒮} (\frac{n}{ℵ_{cel}}) P (N_{σ} = n) \\ = n P (N_{σ_{1}} = n) (\frac{ℵ_{clo}}{ℵ_{cel}}) . \end{array}

(9)

Size bias is reflected in the multiplication by $n$ in (9). It conveys the fact that sampling a cell uniformly at random from a randomly developing repertoire is different (i.e., is biased towards larger clonotypes) than sampling a cell uniformly at random from a randomly developing clonotype. In any case, surrogate selection aims to further bias distributions towards larger clonotypes than would be obtained marginally. Before studying this enrichment, it is helpful to investigate a few exchangeable models and their relationship to well-known marginal distributions.

3.2. Joint assemblages and limiting margins: examples.

By various compounding and conditioning operations applied to a collection of independent Poisson variates, Rothman and Templeton (1980) obtained an interesting exchangeable specification that we reconsider for (8):

f_{joint} (n_{1}, n_{2}, \dots, n_{ℵ_{clo}}) \propto \prod_{j = 1}^{ℵ_{clo}} \frac{Γ (n_{j} + α)}{Γ (n_{j} + 1)},

(10)

where the system-defining parameter $α > 0$ reflects dynamics of the assemblage. By modifying limiting regimes for $ℵ_{cel}, ℵ_{clo}$ , and $α$ , Rothman and Templeton (1980), inter alia, recovered reference marginal distributions distinguished especially by tail behavior. For example, setting $α = 1$ is the Bose-Einstein case. Sending $ℵ_{clo} / ℵ_{cel} \to γ_{0} \in (0, 1)$ as both the numerator and denominator diverge in this case, the marginal limiting distribution of any one clonotype size is Geometric $(γ_{0})$ , as in (5), which matches the pure-birth Yule tree model, with $γ_{0} = e^{- λ_{σ} t_{obs}}$ . Similarly, if $α \to 0$ , the limiting margin is the Logarithmic distribution, with p.m.f. proportional to $γ_{0}^{n} / n$ ; and if the limit of $ℵ_{clo} / ℵ_{cel}$ itself has a $B e t a (ρ, 1)$ distribution, then the limiting margin is the Yule-Simon power law (6). Empirical size distributions from the Bose-Einstein simulation conform nicely to these theoretical predictions (Figure S3). These intriguing relationships provide a modeling framework allowing us to elaborate single-clonotype calculations (Section 2) into the context of full-repertoire sampling. In particular, where various conditions on the joint assemblage give rise to different limiting marginal distributions for a given clonotype’s $N_{σ}$ , we can similarly deduce the size-biased distribution of $N_{S}$ . Details are provided in Appendix B; summarizing here, the size-biased version of the Geometric (5) has p.m.f. $n γ_{0}^{2} {(1 - γ_{0})}^{n - 1}$ , and the size-biased version of the Yule-Simon (6) has the p.m.f. $ρ n B (n, ρ + 2)$ ; see also Fig S2. We are not using these distributions for any sort of model-based inference from data; rather, we are exercising them primarily to explore implications of single versus multi-clonal analysis.

3.3. Enrichment.

Size bias attributable to repertoire versus single-clonotype sampling does not alter the basic enrichment properties revealed in Propositions 1 and 2, except for a slight change in constants. For example, with the mutation model as in Section 2.4, and such that within each clonotype the stochastic process meets the conditions of Proposition 1, we have:

\frac{P (N_{S} = n ∣ M_{S} = 1)}{P (N_{S} = n)} = \frac{1}{P (M_{S} = 1)} \{1 - \frac{Γ (n + 1 - 2 θ)}{Γ (n + 1) Γ (2 - 2 θ)}\}

which is also a strictly increasing function of $n$ that approaches limit $1 / P (M_{S} = 1)$ . The result follows from the single-clonotype sampling result (4), Bayes’s rule, and the equality:

\begin{array}{l} P (M_{S} = 1 ∣ N_{S} = n) = \sum_{σ \in 𝒮} P (M_{S} = 1, S = σ ∣ N_{S} = n) \\ = \sum_{σ \in 𝒮} P (M_{σ} = 1 ∣ N_{σ} = n, S = σ) P (S = σ ∣ N_{S} = n) \\ = P (M_{σ} = 1 ∣ N_{σ} = n) for any σ \in 𝒮 . \end{array}

(11)

By analogy, Proposition 2 may also be extended to sampling from the full repertoire. In summary,

Proposition 3.

If clonotype sizes at observation time $t_{o b s}$ are exchangeable, as in (8), and if each individual clonotype evolves to its size at $t_{o b s}$ according to the dynamics in Proposition 1 or Proposition 2, then conditional on mutation $M_{S} = 1$ of a cell randomly drawn from the full repertoire, the enrichment ratio $P (N_{S} = n ∣ M_{S} = 1) / P (N_{S} = n)$ eventually exceeds 1 for sufficiently large $n$ .

The enrichment phenomenon is illustrated in the synthetic repertoire in Figure 5, which shows mutant and wild-type subclones of various clonotypes, and highlights how sampling the mutant fraction would bias towards larger clonotypes.

3.4. Mutant Frequency.

A random cell from the repertoire is more likely to be mutant than a random cell from any specific, randomly developing clonotype: $(M_{S} = 1) > P (M_{σ} = 1)$ , which we confirm in the Appendix C by a calculation similar to (9). This mutant frequency $P (M_{S} = 1)$ is of independent interest, and can be estimated by various dilution assays. As reviewed in Kaitz et al. (2022), the mutant frequency is different from the mutation frequency $θ$ . The former considers the rate at which mutant cells are found in a sample from the repertoire; the latter is the rate that mutations emerge among cell divisions in a developing clonotype. Table S2 offers some numerical results for the Bose-Einstein assemblage.

3.5. Diversity statistics.

An important motivation for the preceding theoretical calculations is to understand the impact of surrogate selection on statistics from a random sample from a repertoire. Suppose the amount of sampled material from one subject is a fraction $ϵ = n_{samp} / ℵ_{cel}$ of the entire repertoire, and let $X_{σ}$ record the number of cells within the sample of $n_{samp}$ cells that have receptor $σ$ . Conditional upon the clonotype sizes, we treat this empirical frequency as Poisson distributed, considering typical experimental settings and the relative rarity of individual clonotypes (e.g., Sepúlveda, Paulino and Carneiro, 2010). Thus,

X_{σ} ∣ 𝒩 ~ Poisson \{ϵ N_{σ}\} .

(12)

The number of clonotypes represented by $k$ cells in the sample is $Y_{k} = \sum_{σ} 1 [X_{σ} = k]$ ; most diversity statistics are computed from these occupancy counts, $\{Y_{k}\}$ (e.g., Lande, 1996^;Zhang and Zhou, 2010^;Chiffelle et al., 2020). The most simple one is $𝒟 = \sum_{k = 1}^{n_{samp}} Y_{k}$ , which is the number of distinct clonotypes observed in the sample. Note also $n_{samp} = \sum_{k} k Y_{k}$ . Recognizing $𝒟 = \sum_{σ} 1 [X_{σ} > 0]$ , it is immediate from exchangeability that:

E (𝒟) = ℵ_{clo} \{1 - \sum_{n \geq 1} e^{- n ϵ} P (N_{σ} = n)\}, for any one σ .

(13)

Using characteristic functions, we may compute expected diversity directly for the reference marginals. For example, taking the limiting Geometric margin for $P (N_{σ} = n)$ noted in Section 3.2,

E (𝒟) = ℵ_{clo} \{1 - \frac{γ_{0}}{e^{ϵ} - (1 - γ_{0})}\} .

(14)

If $N_{σ} ~ l o g (p)$ , then,

E (𝒟) = ℵ_{clo} \{1 - \frac{l o g (1 - p e^{- ϵ})}{l o g (1 - p)}\} .

(15)

For Yule-Simon marginal distribution with parameter $ρ$ , we get,

E (𝒟) = ℵ_{c l o} \{1 - \frac{ρ e^{- ϵ}}{ρ + 1}_{2} F_{1} (1, 1; ρ + 2; e^{- ϵ})\}

(16)

where $_{2} F_{1} (a, b; c; z)$ is the Gaussian hypergeometric function. In typical repertoires, we expect parameter settings assuring high diversity, such that $E (𝒟)$ is relatively close to $n_{samp}$ .

Surrogate selection enables direct sampling from the mutant fraction, and our formalism allows a quantitative assessment of the selection effect on expected sample properties. By enriching for larger clonotypes, surrogate selection would seem to lead to fewer cells from very small clonotypes, and thus less diverse samples. Here we confirm that property. Set $\tilde{ϵ} = n_{samp} / [ℵ_{cel} P (M_{S} = 1)]$ , which is an amount larger than $ϵ$ that is sufficient to produce, in expectation, $n_{samp}$ mutant cells from the repertoire. These cells arise from the clonotypes according to sample counts ${\tilde{X}}_{σ}$ , which, given the total numbers of mutant counts across the repertoire, $\tilde{𝒩} = {{\tilde{N}}_{σ}}$ , then satisfy

{\tilde{X}}_{σ} ∣ \tilde{𝒩} ~ P o i s s o n \{\tilde{ϵ} {\tilde{N}}_{σ}\} .

(17)

The mutant sample, which in expectation has the same number of mutant cells as the total number of cells in the full-repertoire sample, has its own diversity, $\tilde{𝒟} = \sum_{σ} 1 [{\tilde{X}}_{σ} > 0]$ . By manipulating the probability generating function of the Luria-Delbrück distribution, and also leveraging results in Roshan, Jones and Greenman (2014), we find explicit formulas for the expected diversity among mutant-sampled cells.

Proposition 4.

In the pure-birth, Yule tree model for clonotype development, with a Geometric $(γ_{0})$ distribution for each clonotype size at observation time, and with mutation frequency $θ$ as in Proposition 1, the mutant sample has expected diversity:

E (\tilde{𝒟}) = ℵ_{clo} \{1 - \frac{γ_{0}}{(1 - e^{\tilde{ϵ}}) {\{1 - e^{- \tilde{ϵ}} (1 - γ_{0})\}}^{2 θ} + e^{\tilde{ϵ}} \{1 - e^{- \tilde{ϵ}} (1 - γ_{0})\}}\} .

Alternatively, in case the clonotype-size distribution is Logarithmic $(p)$ , then the expected diversity is:

E (\tilde{D}) = ℵ_{clo} \{1 - \frac{2 θ l o g (1 - p e^{- \tilde{ϵ}}) - l o g [(1 - e^{\tilde{ϵ}}) {(1 - p e^{- \tilde{ϵ}})}^{2 θ} + e^{\tilde{ϵ}} - p]}{- (1 - 2 θ) l o g (1 - p)}\} .

In either case, $E (\tilde{𝒟}) < E (𝒟)$ as long as $θ \in (0, ϵ / 2)$ .

Thus in two reference models, Proposition 4 expresses the precise effect of surrogate selection on repertoire sample diversity; Figure 6 provides a numerical illustration. The result extends to more general distributions by mixing. For example, if conditional upon $γ_{0}$ the clonotype sizes are $G e o m e t r i c (γ_{0})$ , and if $γ_{0} = e x p (- W)$ for $W ~ E x p (ρ)$ , then marginally the clonotype size is Yule-Simon distributed with parameter $ρ$ , and the expected diversity bound carries through the expectation: $E \{E (𝒟 - \tilde{𝒟} ∣ γ_{0})\} > 0$ .

Fig 6. — Comparison of expected diversity scores between sampling from whole repertoire or just the mutant fraction, under various Geometric (left) and Logarithmic (right) distributions. The range of Geometric parameter $γ_{0}$ and logarithmic parameter $p$ is determined to match a clonotype of approximately 10² to 10⁵ cells, in expectation. Other parameters are fixed as sampling fraction $ϵ = 10^{- 4}$ , overall number of clonotypes $ℵ_{clo} = 10^{7}$ and mutation probability in each division $θ = 10^{- 6}$ . Expected diversity is always lower in the mutant fraction, in line with Proposition 4

3.6. Somatic burden.

Our calculations emphasize mutation status at some special locus (like HPRT) for which experimental assays provide for ready sampling of cells within that mutant fraction of the repertoire. Yet the calculations also inform an analysis of more general mutational signatures carried by sampled $T$ cells. Intuitively, there may be a lot of information, for example about prior antigen exposure, that is recorded in present genomic state of sampled $T$ cells, whether or not we consider mutations for an in vitro selection assay.

A $T$ cell sampled randomly from the repertoire resides in a random clonotype $S$ of size $N_{S}$ . At any genomic locus $g$ within a host of measurable sites $𝒢$ , this cell has mutation status $M_{S, g}$ relative to its prethymic state. We are thinking

M_{S, g} = 1 [locus g in sampled cell has incurred a somatic mutation],

which opens us up to a genome-wide spectrum of mutations, rather than changes at a single, surrogate-selection-driving locus. To this end, we define a sampled cell’s somatic burden $L$ to be the summation of $M_{S, g}$ over all $g \in 𝒢$ . We find it convenient to consider a sequence of collections $𝒢^{1}, 𝒢^{2}, \dots$ , approaching $𝒢$ , with $𝒢^{m}$ containing $m$ loci, and for which at step $m, P (M_{S, g}^{m} = 1 ∣ N_{S} = n) = ψ_{n} (θ_{g}^{m})$ for locus-specific mutation frequency $θ_{g}^{m}$ , and with $ψ_{n}$ as in (3) but now highlighting its dependence on mutation frequency. This formula works in the pure-birth model structure thanks to Proposition 1 and the exchangeability in (8). Within this framework, we have the step- $m$ burden $L^{m} = \sum_{g \in 𝒢^{m}} M_{S, g}^{m}$ .

Proposition 5.

If clonotypes satisfy the regularity conditions in Proposition 1, if clonotype sizes are exchangeable as in (8), and if $λ^{m} = \sum_{g \in 𝒢^{m}} θ_{g}^{m} ⟶ λ$ as $m ⟶ \infty$ for some $λ > 0$ , then

\underset{m \to \infty}{l i m} E (L^{m} ∣ N_{S} = n) = 2 λ (H_{n} - 1) = λ ψ_{n}^{'} (0)

(18)

where $H_{n}$ is the $n^{t h}$ harmonic number and $ψ_{n}^{'} (θ) = d ψ_{n} (θ) / d θ$ .

Put another way, the expected number of post-thymic somatic mutations in a $T$ cell is approximately proportional to the logarithm of that cell’s clonotype size, at least under the stated regularity conditions. Single-cell sequencing studies provide a means to measure $L$ on sampled cells, and also to associate that somatic burden with clonotype size, as we investigate next.

4. Empirical studies.

4.1. Somatic burden.

Single-cell sequencing technologies provide an exciting window into the dynamics of the $T$ cell repertoire. Here we reanalyze publicly available data reported by 10x Genomics on samples from 7 different $T$ cell repertoires, including 5 peripheral blood mononuclear cell (PMBC) samples from healthy human donors, a melanoma patient and a lung cancer patient. Supplementary Material, Appendix F, summarizes the data resources and provides additional details on our analysis pipeline. In every case, the repertoire sampling and prior analysis provided both the T cell receptor (TCR) sequence and single cell whole-transcriptome RNA-seq on thousands of cells. The TCR sequence information allows us to cluster cells into clonotypes. Our interest in somatic burden puts quite different demands on the RNA-seq data than the original studies. Rather than derive transcript abundance, we repurpose the RNA-seq reads to report on underlying somatic mutations that must have emerged in the genomic DNA. Following the workflow in Edwards et al. (2022), and using the GATK pipeline for genomic-variant calling (McKenna et al., 2010; Auwera and O’Connor, 2020), we computed single-cell-expressed single-nucleotide-variant calls (sce-SNVs) from the aligned read data using Mutect2 (Cibulskis et al., 2013; DePristo et al., 2011), applied consistently across the different repertoires. Details for SNV calling are in Appendix F, but we note here that to focus better on post-thymic somatic variants, we filtered any calls that would have appeared in more than one clonotype. In total over the 7 repertoires, we measured 30257 cells that resided in 27758 clonotypes, and which altogether presented 1609 post-thymic sce-SNVs.

Figure 7 summarizes average somatic burden as a function of clonotype size for one repertoire. Though not statistically significant, it shows an intriguing increase in estimated mean burden with increasing clonotype size, just as predicted by Proposition 5. Not all data sets show as clear a trend (Table 1), though in a meta-analysis which combines the 7 repertoires, we see stronger evidence of an increase in expected burden with clonotype size. We applied a linear model to cell-level data, with response the measured burden, and with an adjusted clonotype size predictor, where the adjustment accounts for the different sampling rates across the repertoires. We estimate $\hat{β} = 0.6$ SNVs per unit increase in logarithm of clonotype size. A stratified permutation, which shuffles cells between clonotypes within repertoires, gives a modest p-value of 0.02 on this clonotype-size effect. Further details are in Figure S5.

Fig 7. — Association of average somatic burden with clonotype size, in the PBMC3 repertoire. There are 5659 singleton clonotypes, 278 duplexes, and a total of 22 clonotypes with sizes greater than 2 . The largest clonotype contains 41 cells. Clonotypes of size 3 to 20 cells are combined together as a single class considering the small sample size. Pointwise 95% confidence intervals are computed from a quasi-Poisson generalized linear model.

Table 1. Somatic burden of cells by clonotype size (rows), derived from seven T cell repertoire samples (columns) made publicly available by 10x Genomics.

Details of the data resources are in Supplementary Table S3. We repurposed the single-cell RNA-seq reads to infer somatic variants and compute somatic burden counts per cell (average burden in upper table, SNVs/cell); and we used the reported TCR sequences to partition cells into clonotypes (numbers of clonotypes in bottom table).

Clonotype size	20K	10K	SC5K	PBMC3	Controller	Melanoma	Lung

1	0.018	0.017	0.076	0.042	0.019	0.057	0.390
2	0.002	0.005	0.103	0.043	0.029	0.121	0.245
3	0	0	0	0	0	0.035	0.407
4	0	0	-	0.042	0	0.278	0.667
5	0	0	0	0	0	0	0.400
6	0	0	0	2.167	0	-	0.292
7	0	-	0	0	0	-	0.429
8	0	0	3.000	0	-	-	0.875
9	0	-	0	-	-	0	0.444
10	0	-	-	0	0	-	0.200
11	0	-	0	-	0	0	0.455
12	-	0	-	-	0	0	0.292
13	-	-	-	-	-	0	-
14	0	-	0.429	-	-	0	-
17	-	-	-	0	-	-	0.588
19	0	-	-	-	-	0	-
[20, 40]	0.100	0	-	-	0	-	1.283
> 40	0	-	0.170	0.171	-	-	-
Clonotype size	20K	10K	SC5K	PBMC3	Controller	Melanoma	Lung

1	8395	4211	1643	5659	4118	1097	1315
2	239	111	39	278	123	66	108
3	39	35	8	33	23	19	27
4	13	6	-	6	6	9	12
5	15	5	1	4	5	3	3
6	7	2	2	1	1	-	4
7	5	-	2	2	2	-	2
8	6	1	1	4	-	-	1
9	2	-	1	-	-	1	2
10	1	-	-	2	1	-	2
11	2	-	1	-	1	1	1
12	-	1	-	-	1	1	2
13	-	-	-	-	-	1	-
14	1	-	1	-	-	1	-
17	-	-	-	2	-	-	1
19	1	-	-	-	-	2	-
[20, 40]	1	1	-	-	1	-	2
> 40	1	-	1	1	-	-	-

Open in a new tab

4.2. Melanoma case studies.

We reconsider surrogate selection data presented in Zuleger et al. (2020), and we focus here (Table 2) on a metastatic melanoma patient for whom repertoire sampling was performed repeatedly over the course of what turned out to be a successful immunotherapy treatment. As the table shows, the HPRT wild-type (WT) samples have greater sample diversity than the HPRT mutant (MT) samples, which have passed in vitro selection.

Table 2.

Empirical repertoire diversity in wild-type and HPRT mutant fractions, derived from sequencing TCR cDNAs from mass cultures obtained at 5 time-points on one melanoma patient

Time point	Total reads 2	WT unique / reads	MT unique / reads

1	108722	2840 / 58896	158 / 49826
2	111652	4587 / 53435	182 / 58217
3	98834	2709 / 49799	156 / 49035
4	87804	2091 / 52277	84 / 35527
5	98286	2209 / 51711	133 / 46575

Open in a new tab

The mass culture conditions and cDNA sequencing approach used by Zuleger et al. (2020) affect the distribution of counts in Table 2, making them over-dispersed compared to ideal cell counts. Assays based upon single-cell-derived isolates precisely count wild-type and HPRT mutant cells, rather than cDNAs, and are not subject to additional variance caused by in-vitro growth effects. However they are more labor intensive than mass cultures and provide less overall sequencing data. Table 3 summarizes such data from the peripheral blood of 11 subjects studied in Zuleger et al. (2011). In all cases the HPRT surrogate selected samples are less diverse than the wild-type cells, as predicted by the enrichment calculations in Section 3.5.

Table 3. Empirical repertoire diversity in wild-type and HPRT mutant fractions, derived from single-cell isolate data on seven melanoma patients and four healthy donors.

Subjects 1, 2, 3, 5, 6, 9, 13 are melanoma patiens; Subjects 26, 29, 30, 32 are healthy donors. Subjects are sorted by the number sequenced T cell receptors.

Subject	# T cells	WT unique / cells	MT unique / cells

5	122	19 / 19	102 / 103
2	114	49 / 49	61 / 65
1	101	31 / 32	45 / 69
32	95	54 / 54	30 / 41
26	81	36 / 36	44 / 45
3	79	17 / 17	55 / 62
30	69	39 / 39	29 / 30
13	69	23 / 23	43 / 46
29	56	36 / 36	19 / 20
9	50	11 / 11	23 / 39
6	26	18 / 18	8 / 8

Open in a new tab

5. Concluding Remarks.

Gaining a better understanding of the adaptive immune system is a central focus of contemporary biomedical research, considering that system’s role in health and disease. We seek clinically useful methods to identify $T$ cells that may be responding to antigens presented by melanoma, but it is challenging to recognize a patient’s disease-specific antigens, and it is also difficult predict the antigens to which a given $T$ cell receptor will bind. Research on both these frontiers is important and will capitalize on advances in the data sciences (e.g., Lu et al., 2021^;Li et al., 2021). In any case, techniques that could readily enrich a lymphocyte sample for $T$ cells responsive to disease-relevant antigens would have a variety of practical applications. The present work provides a statistical basis to the use of surrogate selection, which aims to enrich lymphocyte samples for disease-relevant cells by recognizing that prior clonal expansions may be associated with the accumulation of neutral somatic alterations. Relatively straightforward assays, like HPRT and PIG-A, are available to filter cells having incurred some convenient somatic alteration. Earlier studies have compared selected and unselected cell populations, using both standard and novel statistical tools to account for sources of variation affecting cell phenotypes (e.g., Pei et al., 2014^;Zuleger et al., 2020). No prior studies have considered the stochastic basis of surrogate selection itself, and this problem has been the central focus of the present paper.

We treat the stochastic development of a single clonotype and demonstrate that conditioning on a mutant sampled cell enriches for larger clonotypes in a class of birth-death processes (Propositions 1 and 2). We extend the development to exchangeable collections of clonotypes (Proposition 3), accounting for the size bias and complexity of real repertoires. We study the effects of selection on the sampling distribution of a commonly computed diversity statistic (Proposition 4). Looking beyond selection, we investigate the accumulation of neutral somatic mutations across the genome, and show how the same modeling calculations demonstrate that cells in older, expanded clonotypes are expected to carry a greater mutation burden. All these theoretical predictions are accompanied by empirical results both from surrogate selection studies and recent single-cell sequencing projects. If there would be a single take-home message it would be that we have resolved the sampling phenomenon exemplified in the simulated data of Figure 5. Interestingly, cells sampled from this synthetic repertoire are associated with larger clonotypes when we condition on them being mutant, even though mutation events are completely neutral. Moreover, we hope that the quantitative characterizations developed here will provide a basis for more informed statistical analysis of $T$ cell data sets and the planning of immunological experiments.

Supplementary Material

Supplement 1

NIHPP2023.07.13.548950v1-supplement-1.pdf^{(1.2MB, pdf)}

Funding.

This research was supported in part by the National Science Foundation (grant 2023239-DMS), and by grants from the National Institutes of Health: R01 GM102756, P01 CA022443, P01 CA250972, P50 CA278595, UL1 TR002373, P50 CA269011, and P30 CA014520. This work was also supported by resources at the William S. Middleton Memorial Veterans Hospital, Madison, WI, USA, and the UW Carbone Comprehensive Cancer Center. Additional support was provided by Ann’s Hope Foundation, Taking on Melanoma, the Tim Eagle Memorial, and the Jay Van Sloan Memorial from the Steve Leuthold Family Foundation, philanthropic support in the USA. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the views of the Dept. of Veterans Affairs or the United States Government.

Footnotes

In Mahmoud (1992), a binary tree is assumed to contain $n$ internal nodes, and Eq. (2.4) cares about the $n + 1$ external nodes (leaves) of the corresponding extended binary tree. In Steel and McKenzie (2001), following Mahmoud (1992), the Yule tree is said to contain $n + 1$ leaves. Our notation is slightly different as we use $n$ to denote leaf numbers. In our setting $n \geq 1$ and $D_{σ} \geq 0$ .

REFERENCES

Albertini R. J. (2001). HPRT mutations in humans: biomarkers for mechanistic studies. Mutation Research/Reviews in Mutation Research 489 1–16. [DOI] [PubMed] [Google Scholar]
Albertini R. J., Castle K. L. and Borcherding W. R. (1982). T-cell cloning to detect the mutant 6-thioguanine-resistant lymphocytes present in human peripheral blood. Proceedings of the National Academy of Sciences 79 6617–6621. [DOI] [PMC free article] [PubMed] [Google Scholar]
Albertini R. J., Nicklas J. A., O’Neill J. P. and Robison S. H. (1990). In vivo somatic mutations in humans: measurement and analysis. Annual review of genetics 24 305–326. [DOI] [PubMed] [Google Scholar]
Aldous D. (1996). Probability Distributions on Cladograms. In Random Discrete Structures 1–18. Springer. [Google Scholar]
Angerer W. P. (2001). An explicit representation of the Luria-Delbrück distribution. Journal of mathematical biology 42 145–174. [DOI] [PubMed] [Google Scholar]
Auwera G. V. D. and O’connor B. D. (2020). Genomics in the cloud: using Docker, GATK, and WDL in Terra, 1st ed. O’Reilly Media, Sebastopol, CA. [Google Scholar]
Billingsley P. (1995). Probability and Measure, 3rd ed. John Wiley & Sons. [Google Scholar]
Bolkhovskaya O. V., Zorin D. Y. and Ivanchenko M. V. (2014). Assessing T cell clonal size distribution: a non-parametric approach. PLoS One 9 e 108658. [DOI] [PMC free article] [PubMed] [Google Scholar]
Brown G. G. and Shubert B. O. (1984). On random binary trees. Mathematics of Operations Research 9 43–65. [Google Scholar]
Cheek D. and Antal T. (2018). Mutation frequencies in a birth-death branching process. The Annals of Applied Probability 28 3922–3947. [Google Scholar]
Chiffelle J., Genolet R., Perez M. A., Coukos G., Zoete V. and Harari A. (2020). T-cell repertoire analysis and metrics of diversity and clonality. Current Opinion in Biotechnology 65 284–295. [DOI] [PubMed] [Google Scholar]
Cibulskis K., Lawrence M. S., Carter S. L., Sivachenko A., Jaffe D., Sougnez C., Gabriel S., Meyerson M., Lander E. S. and Getz G. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature biotechnology 31 213–219. [DOI] [PMC free article] [PubMed] [Google Scholar]
Currie J., Castro M., Lythe G., Palmer E. and Molina-París C. (2012). A stochastic T cell response criterion. Journal of The Royal Society Interface 9 2856–2870. [DOI] [PMC free article] [PubMed] [Google Scholar]
Danecek P., Bonfield J. K., Liddle J., Marshall J., Ohan V., Pollard M. O., Whitwham A., Keane T., Mccarthy S. A., Davies R. M. and Li H. (2021). Twelve years of SAMtools and BCFtools. GigaScience 10 giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Greef P. C., Oakes T., Gerritsen B., Ismail M., Heather J. M., Hermsen R., Chain B. and de Boer R. J. (2020). The naive T-cell receptor repertoire has an extremely broad distribution of clone sizes. Elife 9 e49900. [DOI] [PMC free article] [PubMed] [Google Scholar]
den Braber I., Mugwagwa T., Vrisekoop N., Westera L., Mögling R., De Boer A. B., Willems N., Schrijver E. H., Spierenburg G., Gaiser K. et al. (2012). Maintenance of peripheral naive T cells is sustained by thymus output in mice but not humans. Immunity 36 288–297. [DOI] [PubMed] [Google Scholar]
Depristo M. A., Banks E., Poplin R., Garimella K. V., Maguire J. R., Hartl C., Philippakis A. A., Del Angel G., Rivas M. A., Hanna M. et al. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 43 491–498. [DOI] [PMC free article] [PubMed] [Google Scholar]
Desponds J., Mora T. and Walczak A. M. (2016). Fluctuating fitness shapes the clone-size distribution of immune repertoires. Proceedings of the National Academy of Sciences 113 274–279. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dobrovolsky V. N., Revollo J., Petibone D. M. and Heflich R. H. (2017). In vivo rat T-lymphocyte Pig-a assay: detection and expansion of cells deficient in the GPI-anchored CD48 surface marker for analysis of mutation in the endogenous Pig-a gene. In Drug Safety Evaluation 143–160. Springer. [DOI] [PubMed] [Google Scholar]
Duque D. F. L., Molina-Paris C., Lythe G., Garcia M. L., Thomas P. G. and Gaevert J. (2020). Stochastic modelling of the $T$ cell repertoire with epitope affinity.
Edwards N., Dillard C., Prashant N. M., Hongyu L., Yang M., Ulianova E. and Horvath A. (2022). SCExecute: custom cell barcode-stratified analyses of scRNA-seq data. Bioinformatics 39. btac768. [DOI] [PMC free article] [PubMed] [Google Scholar]
Elhanati Y., Sethna Z., Callan C. G. Jr, Mora T. and Walczak A. M. (2018). Predicting the spectrum of TCR repertoire sharing with a data-driven model of recombination. Immunological reviews 284 167–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fairfax B. P., Taylor C. A., Watson R. A., Nassiri I., Danielli S., Fang H., Mahé E. A., Cooper R., Woodcock V., Traill Z., Al-Mossawi M. H., Knight J. C., Klenerman P., Payne M. and Middleton M. R. (2020). Peripheral CD8+ T cell characteristics associated with durable responses to immune checkpoint blockade in patients with metastatic melanoma. Nature Medicine 26 193–199. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gaimann M. U., Nguyen M., Desponds J. and Mayer A. (2020). Early life imprints the hierarchy of T cell clone sizes. Elife 9 e61639. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ganesan S. and MeHnert J. (2020). Biomarkers for Response to Immune Checkpoint Blockade. Annual Review of Cancer Biology 4 331–351. [Google Scholar]
Grimmett G. and Stirzaker D. (2001). Probability and Random Processes, 3rd ed. Oxford University Press. [Google Scholar]
Hill B. M. (1970). Zipf’s Law and Prior Distributions for the Composition of a Population. Journal of the American Statistical Association 65 1220–1232. [Google Scholar]
Hodgkin P. D., Dowling M. R. and DufFy K. R. (2014). Why the immune system takes its chances with randomness. Nature reviews Immunology 14 711–711. [DOI] [PubMed] [Google Scholar]
Jombart T., Balloux F. and Dray S. (2010). Adephylo: new tools for investigating the phylogenetic signal in biological traits. Bioinformatics 26 1907–1909. [DOI] [PubMed] [Google Scholar]
Jones I. M., Galick H., Kato P., Langlois R. G., Mendelsohn M. L., Murphy G. A., PleShanov P., Ramsey M. J., Thomas C. B., Tucker J. D. et al. (2002). Three somatic genetic biomarkers and covariates in radiation-exposed Russian cleanup workers of the Chernobyl nuclear reactor 6–13 years after exposure. Radiation research 158 424–442. [DOI] [PubMed] [Google Scholar]
Kaitz N. A., Zuleger C. L., Yu P., Newton M. A., Albertini R. J. and Albertini M. R. (2022). Molecular Characterization of Hypoxanthine Guanine Phosphoribosyltransferase Mutant $T$ cells in Human Blood: The Concept of Surrogate Selection for Immunologically Relevant Cells. Mutation Research/Reviews in Mutation Research 789 108414. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kendall D. G. (1960). Birth-and-death processes, and the theory of carcinogenesis. Biometrika 47 13–21. [Google Scholar]
Kendall M. G. and Stuart A. (1977). The Advanced Theory of Statistics: Distribution theory, 4 ed. The Advanced Theory of Statistics. Macmillan. [Google Scholar]
Koch H., Starenki D., Cooper S. J., Myers R. M. and Li Q. (2018). powerTCR: A model-based approach to comparative analysis of the clone size distribution of the T cell receptor repertoire. PLoS computational biology 14 e1006571. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lande R. (1996). Statistics and partitioning of species diversity, and similarity among multiple communities. Oikos 5–13. [Google Scholar]
Li G., Iyer B., Prasath V. B. S., Ni Y. and Salomonis N. (2021). DeepImmuno: deep learning-empowered prediction and generation of immunogenic peptides for T-cell immunity. Briefings in Bioinformatics 22. bbab160. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lozano A. X., Chaudhuri A. A., Nene A., Bacchiocchi A., Earland N., Vesely M. D., Usmani A., Turner B. E., Steen C. B., Luca B. A., Badri T., Gulati G. S., Vahid M. R., Khameneh F., Harris P. K., Chen D. Y., Dhodapkar K., Sznol M., Halaban R. and NewMAN A. M. (2022). T cell characteristics associated with toxicity to immune checkpoint blockade in patients with melanoma. Nature Medicine 28 353–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu T., Zhang Z., Zhu J., Wang Y., Jiang P., Xiao X., Bernatchez C., Heymach J. V., Gibbons D. L., Wang J. et al. (2021). Deep learning-based prediction of the T cell receptor-antigen binding specificity. Nature machine intelligence 3 864–875. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lynch W. C. (1965). More combinatorial properties of certain trees. The Computer Journal 7 299–302. [Google Scholar]
Lythe G. and MolinA-PARís C. (2018). Some deterministic and stochastic mathematical models of naïve T-cell homeostasis. Immunological reviews 285 206–217. [DOI] [PubMed] [Google Scholar]
Mahmoud H. M. (1992). Evolution of random search trees. Wiley-Interscience series in discrete mathematics and optimization. Wiley, New York. [Google Scholar]
Mahmoud H. M. and Neininger R. (2003). Distribution of distances in random binary search trees. The Annals of Applied Probability 13 253–276. [Google Scholar]
McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M. et al. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research 20 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Molina-París C. and Lythe G. (2021). Mathematical, Computational and Experimental T Cell Immunology. Springer. [Google Scholar]
Nicklas J. A., Albertini R. J., Vacek P. M., Ardell S. K., Carter E. W., McDiarmid M. A., Engelhardt S. M., Gucer P. W. and Squibb K. S. (2015). Mutagenicity monitoring following battlefield exposures: Molecular analysis of HPRT mutations in Gulf War I veterans exposed to depleted uranium. Environmental and molecular mutagenesis 56 594–608. [DOI] [PubMed] [Google Scholar]
Nikolich-Žugich J., Slifka M. K. and Messaoudi I. (2004). The many important facets of T-cell repertoire diversity. Nature Reviews Immunology 4 123–132. [DOI] [PubMed] [Google Scholar]
Paradis E. and Schliep K. (2019). ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35 526–528. [DOI] [PubMed] [Google Scholar]
Pei Q., Zuleger C. L., Macklin M. D., Albertini M. R. and Newton M. A. (2014). A conditional predictive p-value to compare a multinomial with an overdispersed multinomial in the analysis of T-cell populations. Biostatistics 15 129–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pennock N. D., White J. T., Cross E. W., Cheney E. E., Tamburini B. A. and Kedl R. M. (2013). T cell responses: naïve to memory and everything in between. Advances in Physiology Education 37 273–283. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peruzzi B., Araten D. J., Notaro R. and Luzzatto L. (2010). The use of PIG-A as a sentinel gene for the study of the somatic mutation rate and of mutagenic agents in vivo. Mutation Research/Reviews in Mutation Research 705 3–10. [DOI] [PubMed] [Google Scholar]
Pfanzagl J. (1964). On the topological structure of some ordered families of distributions. The Annals of Mathematical Statistics 35 1216–1228. [Google Scholar]
Rane S., Hogan T., Seddon B. and Yates A. J. (2018). Age is not just a number: Naive T cells increase their ability to persist in the circulation over time. PLoS biology 16 e 2003949. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roshan A., Jones P. and Greenman C. (2014). Exact, time-independent estimation of clone size distributions in normal and mutated cells. Journal of The Royal Society Interface 11 20140654. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rothman E. D. and Templeton A. R. (1980). A class of models of selectively neutral alleles. Theoretical Population Biology 18 135–150. [Google Scholar]
Sepúlveda N., Paulino C. D. and Carneiro J. (2010). Estimation of T-cell repertoire diversity and clonal size distribution by Poisson abundance models. Journal of immunological methods 353 124–137. [DOI] [PubMed] [Google Scholar]
Shum B., Larkin J. and Turajlic S. (2022). Predictive biomarkers for response to immune checkpoint inhibition. In Seminars in cancer biology 79 4–17. Elsevier. [DOI] [PubMed] [Google Scholar]
Smith C. J., Venturi V., Quigley M. F., Turula H., Gostick E., Ladell K., Hill B. J., Himelfarb D., Quinn K. M., Greenaway H. Y. et al. (2020). Stochastic Expansions Maintain the Clonal Stability of CD8+ T Cell Populations Undergoing Memory Inflation Driven by Murine Cytomegalovirus. The Journal of Immunology 204 112–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
Steel M. and Mckenzie A. (2001). Properties of phylogenetic trees generated by Yule-type speciation models. Mathematical Biosciences 170 91–112. [DOI] [PubMed] [Google Scholar]
Stirk E. R., Molina-París C. and van den Berg H. A. (2008). Stochastic niche structure and diversity maintenance in the $T$ cell repertoire. Journal of theoretical biology 255 237–249. [DOI] [PubMed] [Google Scholar]
Tavaré S. (2021). The magical Ewens sampling formula. Bulletin of the London Mathematical Society 53 1563–1582. [Google Scholar]
Valpione S., Galvani E., Tweedy J., Mundra P. A., Banyard A., Middlehurst P., Barry J., Mills S., Salih Z., Weightman J., Gupta A., Gremel G., Baenke F., Dhomen N., Lorigan P. C. and Marais R. (2020). Immune awakening revealed by peripheral T cell dynamics after one cycle of immunotherapy. Nature Cancer 1 210–221. [DOI] [PMC free article] [PubMed] [Google Scholar]
Valpione S., Mundra P. A., Galvani E., Campana L. G., Lorigan P., de Rosa F., Gupta A., Weightman J., Mills S., Dhomen N. and Marais R. (2021). The T cell receptor repertoire of tumor infiltrating T cells is predictive and prognostic for cancer survival. Nature Communications 12 40–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
van den Broek T., Borghans J. A. and van Wijk F. (2018). The full spectrum of human naive T cells. Nature Reviews Immunology 18 363–373. [DOI] [PubMed] [Google Scholar]
Zhan Y., Carrington E. M., Zhang Y., Heinzel S. and Lew A. M. (2017). Life and Death of Activated T Cells: How Are They Different from Naïve T Cells? Frontiers in Immunology 8 1809. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Z. and Zhou J. (2010). Re-parameterization of multinomial distributions and diversity indices. Journal of Statistical Planning and Inference 140 1731–1738. [Google Scholar]
Zuleger C. L., Macklin M. D., Bostwick B. L., Pei Q., Newton M. A. and Albertini M. R. (2011). In vivo 6-thioguanine-resistant T cells from melanoma patients have public TCR and share TCR beta amino acid sequences with melanoma-reactive T cells. Journal of Immunological Methods 365 76–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zuleger C. L., Newton M. A., Ma X., Ong I. M., Pei Q. and Albertini M. R. (2020). Enrichment of melanoma-associated T cells in 6-thioguanine-resistant T cells from metastatic melanoma patients. Melanoma research 30 52. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

NIHPP2023.07.13.548950v1-supplement-1.pdf^{(1.2MB, pdf)}

[R1] Albertini R. J. (2001). HPRT mutations in humans: biomarkers for mechanistic studies. Mutation Research/Reviews in Mutation Research 489 1–16. [DOI] [PubMed] [Google Scholar]

[R2] Albertini R. J., Castle K. L. and Borcherding W. R. (1982). T-cell cloning to detect the mutant 6-thioguanine-resistant lymphocytes present in human peripheral blood. Proceedings of the National Academy of Sciences 79 6617–6621. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Albertini R. J., Nicklas J. A., O’Neill J. P. and Robison S. H. (1990). In vivo somatic mutations in humans: measurement and analysis. Annual review of genetics 24 305–326. [DOI] [PubMed] [Google Scholar]

[R4] Aldous D. (1996). Probability Distributions on Cladograms. In Random Discrete Structures 1–18. Springer. [Google Scholar]

[R5] Angerer W. P. (2001). An explicit representation of the Luria-Delbrück distribution. Journal of mathematical biology 42 145–174. [DOI] [PubMed] [Google Scholar]

[R6] Auwera G. V. D. and O’connor B. D. (2020). Genomics in the cloud: using Docker, GATK, and WDL in Terra, 1st ed. O’Reilly Media, Sebastopol, CA. [Google Scholar]

[R7] Billingsley P. (1995). Probability and Measure, 3rd ed. John Wiley & Sons. [Google Scholar]

[R8] Bolkhovskaya O. V., Zorin D. Y. and Ivanchenko M. V. (2014). Assessing T cell clonal size distribution: a non-parametric approach. PLoS One 9 e 108658. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Brown G. G. and Shubert B. O. (1984). On random binary trees. Mathematics of Operations Research 9 43–65. [Google Scholar]

[R10] Cheek D. and Antal T. (2018). Mutation frequencies in a birth-death branching process. The Annals of Applied Probability 28 3922–3947. [Google Scholar]

[R11] Chiffelle J., Genolet R., Perez M. A., Coukos G., Zoete V. and Harari A. (2020). T-cell repertoire analysis and metrics of diversity and clonality. Current Opinion in Biotechnology 65 284–295. [DOI] [PubMed] [Google Scholar]

[R12] Cibulskis K., Lawrence M. S., Carter S. L., Sivachenko A., Jaffe D., Sougnez C., Gabriel S., Meyerson M., Lander E. S. and Getz G. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature biotechnology 31 213–219. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Currie J., Castro M., Lythe G., Palmer E. and Molina-París C. (2012). A stochastic T cell response criterion. Journal of The Royal Society Interface 9 2856–2870. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Danecek P., Bonfield J. K., Liddle J., Marshall J., Ohan V., Pollard M. O., Whitwham A., Keane T., Mccarthy S. A., Davies R. M. and Li H. (2021). Twelve years of SAMtools and BCFtools. GigaScience 10 giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] de Greef P. C., Oakes T., Gerritsen B., Ismail M., Heather J. M., Hermsen R., Chain B. and de Boer R. J. (2020). The naive T-cell receptor repertoire has an extremely broad distribution of clone sizes. Elife 9 e49900. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] den Braber I., Mugwagwa T., Vrisekoop N., Westera L., Mögling R., De Boer A. B., Willems N., Schrijver E. H., Spierenburg G., Gaiser K. et al. (2012). Maintenance of peripheral naive T cells is sustained by thymus output in mice but not humans. Immunity 36 288–297. [DOI] [PubMed] [Google Scholar]

[R17] Depristo M. A., Banks E., Poplin R., Garimella K. V., Maguire J. R., Hartl C., Philippakis A. A., Del Angel G., Rivas M. A., Hanna M. et al. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 43 491–498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Desponds J., Mora T. and Walczak A. M. (2016). Fluctuating fitness shapes the clone-size distribution of immune repertoires. Proceedings of the National Academy of Sciences 113 274–279. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Dobrovolsky V. N., Revollo J., Petibone D. M. and Heflich R. H. (2017). In vivo rat T-lymphocyte Pig-a assay: detection and expansion of cells deficient in the GPI-anchored CD48 surface marker for analysis of mutation in the endogenous Pig-a gene. In Drug Safety Evaluation 143–160. Springer. [DOI] [PubMed] [Google Scholar]

[R20] Duque D. F. L., Molina-Paris C., Lythe G., Garcia M. L., Thomas P. G. and Gaevert J. (2020). Stochastic modelling of the $T$ cell repertoire with epitope affinity.

[R21] Edwards N., Dillard C., Prashant N. M., Hongyu L., Yang M., Ulianova E. and Horvath A. (2022). SCExecute: custom cell barcode-stratified analyses of scRNA-seq data. Bioinformatics 39. btac768. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Elhanati Y., Sethna Z., Callan C. G. Jr, Mora T. and Walczak A. M. (2018). Predicting the spectrum of TCR repertoire sharing with a data-driven model of recombination. Immunological reviews 284 167–179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Fairfax B. P., Taylor C. A., Watson R. A., Nassiri I., Danielli S., Fang H., Mahé E. A., Cooper R., Woodcock V., Traill Z., Al-Mossawi M. H., Knight J. C., Klenerman P., Payne M. and Middleton M. R. (2020). Peripheral CD8+ T cell characteristics associated with durable responses to immune checkpoint blockade in patients with metastatic melanoma. Nature Medicine 26 193–199. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Gaimann M. U., Nguyen M., Desponds J. and Mayer A. (2020). Early life imprints the hierarchy of T cell clone sizes. Elife 9 e61639. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Ganesan S. and MeHnert J. (2020). Biomarkers for Response to Immune Checkpoint Blockade. Annual Review of Cancer Biology 4 331–351. [Google Scholar]

[R26] Grimmett G. and Stirzaker D. (2001). Probability and Random Processes, 3rd ed. Oxford University Press. [Google Scholar]

[R27] Hill B. M. (1970). Zipf’s Law and Prior Distributions for the Composition of a Population. Journal of the American Statistical Association 65 1220–1232. [Google Scholar]

[R28] Hodgkin P. D., Dowling M. R. and DufFy K. R. (2014). Why the immune system takes its chances with randomness. Nature reviews Immunology 14 711–711. [DOI] [PubMed] [Google Scholar]

[R29] Jombart T., Balloux F. and Dray S. (2010). Adephylo: new tools for investigating the phylogenetic signal in biological traits. Bioinformatics 26 1907–1909. [DOI] [PubMed] [Google Scholar]

[R30] Jones I. M., Galick H., Kato P., Langlois R. G., Mendelsohn M. L., Murphy G. A., PleShanov P., Ramsey M. J., Thomas C. B., Tucker J. D. et al. (2002). Three somatic genetic biomarkers and covariates in radiation-exposed Russian cleanup workers of the Chernobyl nuclear reactor 6–13 years after exposure. Radiation research 158 424–442. [DOI] [PubMed] [Google Scholar]

[R31] Kaitz N. A., Zuleger C. L., Yu P., Newton M. A., Albertini R. J. and Albertini M. R. (2022). Molecular Characterization of Hypoxanthine Guanine Phosphoribosyltransferase Mutant $T$ cells in Human Blood: The Concept of Surrogate Selection for Immunologically Relevant Cells. Mutation Research/Reviews in Mutation Research 789 108414. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Kendall D. G. (1960). Birth-and-death processes, and the theory of carcinogenesis. Biometrika 47 13–21. [Google Scholar]

[R33] Kendall M. G. and Stuart A. (1977). The Advanced Theory of Statistics: Distribution theory, 4 ed. The Advanced Theory of Statistics. Macmillan. [Google Scholar]

[R34] Koch H., Starenki D., Cooper S. J., Myers R. M. and Li Q. (2018). powerTCR: A model-based approach to comparative analysis of the clone size distribution of the T cell receptor repertoire. PLoS computational biology 14 e1006571. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Lande R. (1996). Statistics and partitioning of species diversity, and similarity among multiple communities. Oikos 5–13. [Google Scholar]

[R36] Li G., Iyer B., Prasath V. B. S., Ni Y. and Salomonis N. (2021). DeepImmuno: deep learning-empowered prediction and generation of immunogenic peptides for T-cell immunity. Briefings in Bioinformatics 22. bbab160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Lozano A. X., Chaudhuri A. A., Nene A., Bacchiocchi A., Earland N., Vesely M. D., Usmani A., Turner B. E., Steen C. B., Luca B. A., Badri T., Gulati G. S., Vahid M. R., Khameneh F., Harris P. K., Chen D. Y., Dhodapkar K., Sznol M., Halaban R. and NewMAN A. M. (2022). T cell characteristics associated with toxicity to immune checkpoint blockade in patients with melanoma. Nature Medicine 28 353–362. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Lu T., Zhang Z., Zhu J., Wang Y., Jiang P., Xiao X., Bernatchez C., Heymach J. V., Gibbons D. L., Wang J. et al. (2021). Deep learning-based prediction of the T cell receptor-antigen binding specificity. Nature machine intelligence 3 864–875. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Lynch W. C. (1965). More combinatorial properties of certain trees. The Computer Journal 7 299–302. [Google Scholar]

[R40] Lythe G. and MolinA-PARís C. (2018). Some deterministic and stochastic mathematical models of naïve T-cell homeostasis. Immunological reviews 285 206–217. [DOI] [PubMed] [Google Scholar]

[R41] Mahmoud H. M. (1992). Evolution of random search trees. Wiley-Interscience series in discrete mathematics and optimization. Wiley, New York. [Google Scholar]

[R42] Mahmoud H. M. and Neininger R. (2003). Distribution of distances in random binary search trees. The Annals of Applied Probability 13 253–276. [Google Scholar]

[R43] McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M. et al. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research 20 1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Molina-París C. and Lythe G. (2021). Mathematical, Computational and Experimental T Cell Immunology. Springer. [Google Scholar]

[R45] Nicklas J. A., Albertini R. J., Vacek P. M., Ardell S. K., Carter E. W., McDiarmid M. A., Engelhardt S. M., Gucer P. W. and Squibb K. S. (2015). Mutagenicity monitoring following battlefield exposures: Molecular analysis of HPRT mutations in Gulf War I veterans exposed to depleted uranium. Environmental and molecular mutagenesis 56 594–608. [DOI] [PubMed] [Google Scholar]

[R46] Nikolich-Žugich J., Slifka M. K. and Messaoudi I. (2004). The many important facets of T-cell repertoire diversity. Nature Reviews Immunology 4 123–132. [DOI] [PubMed] [Google Scholar]

[R47] Paradis E. and Schliep K. (2019). ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35 526–528. [DOI] [PubMed] [Google Scholar]

[R48] Pei Q., Zuleger C. L., Macklin M. D., Albertini M. R. and Newton M. A. (2014). A conditional predictive p-value to compare a multinomial with an overdispersed multinomial in the analysis of T-cell populations. Biostatistics 15 129–139. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] Pennock N. D., White J. T., Cross E. W., Cheney E. E., Tamburini B. A. and Kedl R. M. (2013). T cell responses: naïve to memory and everything in between. Advances in Physiology Education 37 273–283. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] Peruzzi B., Araten D. J., Notaro R. and Luzzatto L. (2010). The use of PIG-A as a sentinel gene for the study of the somatic mutation rate and of mutagenic agents in vivo. Mutation Research/Reviews in Mutation Research 705 3–10. [DOI] [PubMed] [Google Scholar]

[R51] Pfanzagl J. (1964). On the topological structure of some ordered families of distributions. The Annals of Mathematical Statistics 35 1216–1228. [Google Scholar]

[R52] Rane S., Hogan T., Seddon B. and Yates A. J. (2018). Age is not just a number: Naive T cells increase their ability to persist in the circulation over time. PLoS biology 16 e 2003949. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] Roshan A., Jones P. and Greenman C. (2014). Exact, time-independent estimation of clone size distributions in normal and mutated cells. Journal of The Royal Society Interface 11 20140654. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] Rothman E. D. and Templeton A. R. (1980). A class of models of selectively neutral alleles. Theoretical Population Biology 18 135–150. [Google Scholar]

[R55] Sepúlveda N., Paulino C. D. and Carneiro J. (2010). Estimation of T-cell repertoire diversity and clonal size distribution by Poisson abundance models. Journal of immunological methods 353 124–137. [DOI] [PubMed] [Google Scholar]

[R56] Shum B., Larkin J. and Turajlic S. (2022). Predictive biomarkers for response to immune checkpoint inhibition. In Seminars in cancer biology 79 4–17. Elsevier. [DOI] [PubMed] [Google Scholar]

[R57] Smith C. J., Venturi V., Quigley M. F., Turula H., Gostick E., Ladell K., Hill B. J., Himelfarb D., Quinn K. M., Greenaway H. Y. et al. (2020). Stochastic Expansions Maintain the Clonal Stability of CD8+ T Cell Populations Undergoing Memory Inflation Driven by Murine Cytomegalovirus. The Journal of Immunology 204 112–21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] Steel M. and Mckenzie A. (2001). Properties of phylogenetic trees generated by Yule-type speciation models. Mathematical Biosciences 170 91–112. [DOI] [PubMed] [Google Scholar]

[R59] Stirk E. R., Molina-París C. and van den Berg H. A. (2008). Stochastic niche structure and diversity maintenance in the $T$ cell repertoire. Journal of theoretical biology 255 237–249. [DOI] [PubMed] [Google Scholar]

[R60] Tavaré S. (2021). The magical Ewens sampling formula. Bulletin of the London Mathematical Society 53 1563–1582. [Google Scholar]

[R61] Valpione S., Galvani E., Tweedy J., Mundra P. A., Banyard A., Middlehurst P., Barry J., Mills S., Salih Z., Weightman J., Gupta A., Gremel G., Baenke F., Dhomen N., Lorigan P. C. and Marais R. (2020). Immune awakening revealed by peripheral T cell dynamics after one cycle of immunotherapy. Nature Cancer 1 210–221. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] Valpione S., Mundra P. A., Galvani E., Campana L. G., Lorigan P., de Rosa F., Gupta A., Weightman J., Mills S., Dhomen N. and Marais R. (2021). The T cell receptor repertoire of tumor infiltrating T cells is predictive and prognostic for cancer survival. Nature Communications 12 40–98. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] van den Broek T., Borghans J. A. and van Wijk F. (2018). The full spectrum of human naive T cells. Nature Reviews Immunology 18 363–373. [DOI] [PubMed] [Google Scholar]

[R64] Zhan Y., Carrington E. M., Zhang Y., Heinzel S. and Lew A. M. (2017). Life and Death of Activated T Cells: How Are They Different from Naïve T Cells? Frontiers in Immunology 8 1809. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] Zhang Z. and Zhou J. (2010). Re-parameterization of multinomial distributions and diversity indices. Journal of Statistical Planning and Inference 140 1731–1738. [Google Scholar]

[R66] Zuleger C. L., Macklin M. D., Bostwick B. L., Pei Q., Newton M. A. and Albertini M. R. (2011). In vivo 6-thioguanine-resistant T cells from melanoma patients have public TCR and share TCR beta amino acid sequences with melanoma-reactive T cells. Journal of Immunological Methods 365 76–86. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] Zuleger C. L., Newton M. A., Ma X., Ong I. M., Pei Q. and Albertini M. R. (2020). Enrichment of melanoma-associated T cells in 6-thioguanine-resistant T cells from metastatic melanoma patients. Melanoma research 30 52. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

SURROGATE SELECTION OVERSAMPLES EXPANDED T CELL CLONOTYPES

Peng Yu

Yumin Lian

Cindy L Zuleger

Richard J Albertini

Mark R Albertini

Michael A Newton

Abstract

1. Introduction.

1.1. Overview.

1.2. Immunological setting.

1.3. Surrogate selection.

1.4. Summary of findings.

2. One developing clonotype.

2.1. Model set up.

2.2. The branching tree.

Fig 1.

Fig 2.

2.3. Neutral mutations.

2.4. Enrichment and Bayes rule.

Fig 3.

Fig 4.

Proposition 1.

2.5. Beyond pure birth.

Proposition 2.

3. Sampling from the repertoire.

3.1. Model set up and size bias.

Fig 5.

3.2. Joint assemblages and limiting margins: examples.

3.3. Enrichment.

Proposition 3.

3.4. Mutant Frequency.

3.5. Diversity statistics.

Proposition 4.

Fig 6.

3.6. Somatic burden.

Proposition 5.

4. Empirical studies.

4.1. Somatic burden.

Fig 7.

Table 1. Somatic burden of cells by clonotype size (rows), derived from seven T cell repertoire samples (columns) made publicly available by 10x Genomics.

4.2. Melanoma case studies.

Table 2.

Table 3. Empirical repertoire diversity in wild-type and HPRT mutant fractions, derived from single-cell isolate data on seven melanoma patients and four healthy donors.

5. Concluding Remarks.

Supplementary Material

Funding.

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases