An empirical Bayes approach for constructing confidence intervals for clonality and entropy

Zhongren Chen; Lu Tian; Richard A Olshen

doi:10.1080/02664763.2025.2496724

. Author manuscript; available in PMC: 2025 Sep 16.

Published before final editing as: J Appl Stat. 2025 Apr 30:10.1080/02664763.2025.2496724. doi: 10.1080/02664763.2025.2496724

An empirical Bayes approach for constructing confidence intervals for clonality and entropy

Zhongren Chen ^a, Lu Tian ^b, Richard A Olshen ^b,^†

PMCID: PMC12435542 NIHMSID: NIHMS2106193 PMID: 40958882

Abstract

This paper is motivated by the need to quantify human immune responses to environmental challenges. Specifically, the genome of the selected cell population from a blood sample is amplified by the PCR process, producing a large number of reads. Each read corresponds to a particular rearrangement of so-called V(D)J sequences. The observed data consist of a set of integers, representing numbers of reads corresponding to different V(D)J sequences. The underlying relative frequencies of distinct V(D)J sequences can be summarized by a probability vector, with the cardinality being the number of distinct V(D)J rearrangements. The statistical question is to make inferences on a summary parameter of this probability vector based on a multinomial-type observation of a large dimension. Popular summaries of the diversity include clonality and entropy. A point estimator of the clonality based on multiple replicates from the same blood sample has been proposed previously. Therefore, the remaining challenge is to construct confidence intervals of the parameters to reflect their uncertainty. In this paper, we propose to couple the Empirical Bayes method with a resampling-based calibration procedure to construct a robust confidence interval for different population diversity parameters. The method is illustrated via extensive numerical studies and real data examples.

Keywords: Empirical Bayes, clonality, entropy, confidence interval, resampling method

1. Introduction

This paper is motivated by the need to construct confidence intervals (CIs) for parameters summarizing the diversity of a cell population consisting of cells of different types. We first introduce the sources of data and give some details here. More biological and statistical background in the case of point estimation can be found in a previous paper [22]. The problem has its biomedical origin in attempting to quantify human immune responses to an environmental challenge; more specifically, to quantify the adaptive immunologic response to any antigen, e.g. vaccination against the COVID-19 virus. Briefly, blood is sampled from a patient. This blood sample may be divided, as equally as possible, into several parts, i.e. replicates. The genome of the selected cell subpopulation in each replicate is amplified by the well-known PCR process of successive heating and cooling. One resultant from each replicate is then chosen randomly for sequencing, producing a large number of reads. They number roughly 30,000 to 300,000 per replicate. Each read corresponds to a particular rearrangement of the so-called V(D)J sequence.

In the end, the observation from a particular replicate consists of a set of numbers of reads for different V(D)J sequences. Mathematically, each observation can then be thought of as a finite dimensional random vector $Z = {(Z_{1}, Z_{2}, \dots, Z_{C})}^{'}$ reflecting the underlying relative frequency of different cell subpopulations with particular V(D)J rearrangements in the entire circulation system from which the blood was sampled originally. The underlying relative frequencies can be summarized by a probability vector $p = {(p_{1}, p_{2}, \dots, p_{C_{0}})}^{'}$ , where $C_{0}$ , the cardinality of the vector $p$ , represents the total number of different V(D)J rearrangements in the circulation. While $C_{0}$ is finite, its value is unknown, since some of the V(D)J rearrangements may not be observed in the replicate due to their rarity. In other words, $C$ , the number of distinct observed V(D)J rearrangements could be substantially smaller than the number of distinct V(D)J rearrangements in the blood. Therefore, the observed frequency vector $Z$ consists of only positive components of the full frequency vector $\tilde{Z} = ({\tilde{Z}}_{1}, \dots, {\tilde{Z}}_{C_{0}})$ corresponding to the underlying probability vector $p$ . While the exact value of $C_{0}$ is unknown, we will show later that it can be estimated by an EM algorithm and it only has a limited effect on the parameter of interest. While we know from criticisms that the PCR process may favor some rearrangements over others, that distinction is ignored here.

Statistical concerns are making inferences on a summary parameter based on this single multinomial-type observation, $Z$ . One popular summary of the diversity of a cell population is its clonality, $G_{C} (p) = ‖ p ‖_{2}^{2}$ , the squared $l_{2}$ norm of vector $p \in R^{C_{0}}$ , which varies between ${C_{0}}^{- 1}$ and 1. A lower clonality value indicates a more uniform probability distribution of different V(D)J rearrangements in the blood, corresponding to higher diversity in the cell population. We report a study of point estimation for clonality based on multiple replicates for each blood sample in a previous paper [22]. After obtaining a point estimator of a particular function of $p$ such as clonality characterizing the composition of the cell population, the remaining challenge is to construct a confidence interval for the relevant parameter to appropriately reflect the uncertainty in this point estimator. Consequently, in what follows, attention is devoted to interval estimation. Specifically, we propose to study point and interval estimations for clonality without requiring multiple replicates. Another extremely popular measure of population diversity is entropy, which is defined as

G_{E} (p) = \sum_{j = 1}^{C_{0}} (- p_{j} log p_{j})

(see, for example, [14]). A higher entropy value also reflects a more uniform probability distribution of V(D)J rearrangements, indicating increased diversity. Many authors prefer to summarize the variability of $p$ by its entropy. Of course, the notion of entropy has a storied history in communications, indeed in many aspects of engineering [24]. If $C_{0}$ is small, e.g. 10, and all $p_{i} s$ are not too close to zero, then it is possible to estimate the entire probability vector $p$ accurately with a large number of reads. In this setting, a plug-in estimator of the clonality or entropy would serve as a reasonable approximation to the true parameter value. Then, in this case, one can readily apply the delta method or the parametric bootstrap to construct a 95% confidence interval for clonality or entropy. The difficulty arises when $C_{0}$ is very big. In such a case, the estimation of the entropy itself is very difficult [8]. Especially, we are unaware of careful attempts to form confidence intervals for entropy. Existing methods based on the nonparametric maximum likelihood estimator or the Horvitz-Thompson estimator [8,19] are not directly applicable in very high-dimensional settings. The method proposed in this paper covers point and interval estimations for parameters such as entropy or other functions of the probability vector $p$ . Lastly, different summaries for clone distributions have been proposed. For example, [15] discussed multiple diversity indices, including species richness, entropy, the Simpson index, and the Berger-Parker index. In ecology, methods for estimating the number of distinct species have been well studied, as seen in the influential work of [6,7]. In a related work, Efron aimed to infer the number of words known by Shakespeare but not used in his work [11]. Estimating the proportions of unseen clones or species has been explored in [1]. In this paper, we primarily focus on making rigorous statistical inferences on clonality and entropy.

We begin with an explanation of matters that bear upon parametric approaches in Section 2. The key is that we assume that these $C_{0}$ components of the probability vector $p$ are realizations from a parametric prior distribution [2], which can be estimated based on observed data, although there is oftentimes not sufficient information in estimating the values of all individual components. With the estimated prior distribution generating the individual probability components, point and interval estimates for the function of interest can be obtained based on the posterior distribution derived from the Bayesian theorem [2]. This is essentially an Empirical Bayes (EB) approach, where we estimate the parameters of the parametric prior distribution based on observed data, which in turn generates the probability vector of interest. However, such a naive interval may fail to cover the true value of the function with the desired coverage level due to the simple fact that the uncertainty in estimating the prior distribution is not considered in this EB approach [4,5]. Therefore, we propose an additional calibration step to correct the under-coverage of this naive EB confidence interval.

We realize that a good part of understanding the performance of the constructed confidence interval should be summarized by extensive computations on simulated clinical or other suitable data. Some computations are given in the Numerical Study Section (Section 3) and Real Data Example Section (Section 4), that follow the Method Section (Section 2). Additional suggestions for further research are topics of our Discussion Section (Section 5).

2. Method

2.1. The general framework

The complete data consist of $n$ pairs of observations $\{(Z_{i}, Y_{i}), i = 1, \dots, n\}$ . Let

(Z_{i}, Y_{i}) \overset{i.i.d.}{~} p (z, y ∣ θ_{0}), i = 1, \dots, n,

where $p (z, y ∣ θ_{0})$ is the density function for the joint distribution with respect to $(Z_{i}, Y_{i}), i = 1, \dots, n$ . Suppose that we only observe $Z = \{Z_{1}, \dots, Z_{n}\}, Y = \{Y_{1}, \dots, Y_{n}\}$ is missing but correlated with $Z = \{Z_{1}, \dots, Z_{n}\}$ , and $θ_{0}$ is a fixed but unknown parameter. Our aim is to construct a confidence interval covering $G (Y) = G (Y_{1}, \dots, Y_{n})$ with a specified probability based on observed data $Z$ only, where $G (\cdot)$ is a given function. Note that, different from the conventional setting, where the parameter of interest is a deterministic population parameter, $G (Y)$ is a random variable that varies from data to data. We first present a simple confidence interval of $G (Y)$ in Section 2.2. We then discuss its limitations and propose a calibration method based on parametric bootstrap in Section 2.3.

2.2. A naive approach

Assuming that a point estimator $\hat{θ} (Z)$ for $θ_{0}$ based on $Z$ is available, we can derive the conditional distribution of $Y_{i} ∣ Z_{i}, θ_{0} = \hat{θ} (Z), i = 1, \dots, n$ , and simulate multiple copies of $G^{*} = G (Y)$ directly from the conditional distribution of $G (Y) ∣ Z, θ_{0} = \hat{θ} (Z)$ . A confidence interval for $G (Y)$ can then be constructed based on sampled $G^{*} s$ . See Algorithm 1 for this naive approach.

Algorithm 1.

A Naive Algorithm to Construct a 95% Confidence Interval for $G (Y_{1}, Y_{2}, \dots, Y_{n})$

1:	Find a point estimator of $θ_{0}, \hat{θ} (Z)$ .
2:	Simulate $Y_{i}^{*}$ from the conditional distribution $Y_{i} ∣ Z_{i}, θ_{0} = \hat{θ} (Z)$ , for $i = 1, \dots, n$ .
3:	Compute $G^{} = G (Y_{1}^{}, \dots, Y_{n}^{*})$ .
4:	Repeat steps 2 and 3 a large number of times, obtain $G_{1}^{}, \dots, G_{B}^{}$ , and return the 2.5th and 97.5th percentiles of the empirical distribution of $\{G_{1}^{}, \dots, G_{B}^{}\}$ as a 95% confidence interval for $G (Y)$ . Here, $B$ is a positive integer large enough so that the 2.5th and 97.5th percentiles of the empirical distribution ( $G_{1}^{}, \dots, G_{B}^{}$ ) can accurately approximate the corresponding quantiles of $G^{*}$ .

Open in a new tab

Algorithm 1 was first proposed by [18] in a different setting and is well-known. However, this simple confidence interval doesn’t account for the uncertainty in estimating $θ_{0}$ . As a result, it often fails to cover the parameter of interest at the nominal level [16]. To make the constructed intervals robust to the variability of $\hat{θ} (Z)$ , we propose to improve this naive confidence interval by coupling a method explicitly incorporating the variance of $\hat{θ} (Z)$ in deriving the posterior distribution with a calibration step using the parametric bootstrap [12].

2.3. Calibration

To improve the naive confidence interval, we first obtain an estimator of $θ_{0}, \hat{θ} (Z)$ , and a variance estimator of $\hat{θ} (Z), \hat{J} (Z)$ . For example, when the point estimator $\hat{θ} (Z)$ is the MLE, its variance can be estimated by the inverse of the observed information matrix [13], which is consistent and widely used in practice [10]. Expecting that $\hat{θ} (Z)$ behaves as an M-estimator [23], and thus follows a multivariate Gaussian, we attempt to construct the confidence interval based on an adjusted posterior distribution:

G (Y) ∣ Z, θ_{0} = θ^{*}; θ^{*} ~ N \{\hat{θ} (Z), \hat{J} (Z)\},

(1)

which, in general, has a larger variability than the naive posterior distribution

G (Y) ∣ Z, θ_{0} = \hat{θ} (Z),

and results in a wider confidence interval.

In addition, we also consider a calibration step based on parametric bootstrap [12] similar to the correction introduced in [3]. To be specific, we simulate the ‘observed’ data from an assumed model with $θ_{0} = \hat{θ} (Z)$ , simulate realizations of $G (Y)$ from the posterior distribution given above, and construct the $100 (1 - α) %$ confidence intervals for different $α$ based on the quantiles of samples drawn from the posterior distribution. After repeating this simulation a large number of times, we examine the empirical coverage level of the constructed confidence intervals with respect to the true $G (Y)$ in the simulated data, anticipating that the empirical coverage level of $100 (1 - α) %$ confidence intervals may differ from $(1 - α)$ . We find a value $α_{0.95}$ such that the corresponding $100 (1 - α_{0.95}) %$ confidence interval has an empirical coverage level of 95%. (Here, 95% may be replaced by any other nominal coverage level chosen by the researchers.) This $α_{0.95}$ level will then be used to construct the 95% confidence interval based on the posterior distribution (1) from the original data. The mathematical rationale is that

P \{G (Y) < q_{α} (Z) ∣ θ_{0}\} \approx P \{G (Y) < q_{α} (Z) ∣ \hat{θ} (Z)\},

where $q_{α} (Z)$ is the $α$ th quantile of the posterior distribution (1), and the parametric bootstrap is used to estimate $P \{G (Y) < q_{α} (Z) ∣ \hat{θ} (Z)\}$ . This step can be viewed as a calibration step based on a bootstrap method [12]. In the end, if denoting the resulting confidence interval as $[\hat{L} (Z), \hat{U} (Z)]$ , we expect that

P {\hat{L} (Z) \leq G (Y) \leq \hat{U} (Z)} = 0.95,

where the probability is with respect to the joint distribution of $(Y, Z)$ . The detailed steps are outlined in algorithm 2.

2.4. Interval estimates of clonality and entropy

We now apply the general algorithm introduced in Section 2.3 to construct confidence intervals for clonality and entropy. Consider a parametric model for the number of cells from a clone, i.e. cells with a particular V(D)J rearrangement:

λ_{i} ~ Gamma (a_{0}, b_{0}), i = 1, \dots, C_{0},

Z_{i} ~ Poisson (λ_{i}), i = 1, \dots, C_{0},

where $C_{0}$ is the total number of clones. ${\{λ_{i}\}}_{i = 1}^{C_{0}}$ and $(a_{0}, b_{0})$ are unknown to us, but we observe $Z_{i}$ if $Z_{i} > 0$ , which is the number of reads corresponding to the $i$ th V(D)J rearrangement. Therefore, the observed data consist of the truncated independent Poisson with parameters $λ_{1}, \dots, λ_{C_{0}}$ :

\{Z_{i} ∣ Z_{i} > 0, 1 \leq i \leq C_{0}\} .

Note that the set of clones without any read is unobserved. To maintain consistency with the notation used in Section 2.4, we will later show that $C_{0}$ can be estimated by $\hat{C}$ , and the vector ( $Z_{1}, \dots, Z_{\hat{C}}$ ) can be treated as the ‘observed data’ $Z$ defined in Subsection 2.3. Now, we define clonality as $G_{C}$ :

G_{C} : = \sum_{i = 1}^{C_{0}} {(\frac{λ_{i}}{\sum_{j = 1}^{C_{0}} λ_{j}})}^{2},

and entropy as $G_{E}$ :

G_{E} : = - \sum_{i = 1}^{C_{0}} \{\frac{λ_{i}}{\sum_{j = 1}^{C_{0}} λ_{j}} log (\frac{λ_{i}}{\sum_{j = 1}^{C_{0}} λ_{j}})\} .

Both are functions of $\{λ_{1}, \dots, λ_{C_{0}}\}$ . Our aim is to obtain point estimates as well as confidence intervals for $G_{C}$ and $G_{E}$ .

Algorithm 2.

A General Algorithm to Construct a 95% Confidence Interval for $G (Y)$

Obtain an point estimator

\hat{θ} (Z)

θ_{0}

Compute a consistent variance estimator of

\hat{θ} (Z), \hat{J} (Z)

for

b = 1,2, \dots, B

, do

Simulate

{\overline{θ}}_{1 (b)}, \dots, {\overline{θ}}_{n (b)} \overset{i . i . d .}{~} N (\hat{θ} (Z), \hat{J} (Z))

Simulate

Y_{1 (b)}^{*} ~ p_{Y} (y ∣ Z_{1}, {\tilde{θ}}_{1 (b)}), \dots, Y_{n (b)}^{*} ~ p_{Y} (y ∣ Z_{n}, {\tilde{θ}}_{n (b)})

, where

p_{Y} (y ∣ z, θ)

is the density function of

Y

conditional on

Z = z

and

θ_{0} = θ

Compute

G_{b}^{*} (Z) = G (Y_{1 (b)}^{*}, \dots, Y_{n (b)}^{*})

end for

for

i = 1,2, \dots, R

, do

Simulate a new dataset

{\{Y_{i j}^{*}, Z_{i j}^{*}\}}_{j = 1}^{n} \overset{i, i, d .}{~} p (y, z ∣ \hat{θ} (Z))

. Let

Z_{j}^{*} = {\{Z_{i j}^{*}\}}_{j = 1}^{n}

and

Y_{j}^{*} = {\{Y_{i j}^{*}\}}_{j = 1}^{n}

10:

Calculate the random variable to be estimated:

G_{i}^{*} = G (Y_{i 1}^{*}, \dots, Y_{i n}^{*}) = G (Y_{i}^{*})

11:

Obtain the point estimator

\hat{θ} (Z_{i}^{*})

\hat{θ}

based on the generated data

Z_{i}^{*}

12:

Obtain a variance estimator of

\hat{θ} (Z_{i}^{*}), \hat{J} (Z_{i}^{*})

13:

Using Steps 3–6 to obtain

\{G_{1} (Z_{i}^{*}), \dots, G_{B} (Z_{i}^{*})\}

14:

Construct the

100 (1 - α) %

confidence interval as the

100 α / 2 th

percentile and

100 (1 - α / 2) th

percentile of

\{G_{1} (Z_{i}^{*}), \dots, G_{B} (Z_{i}^{*})\}

, denoted by

{\hat{C I}}_{i} (1 - α)

15:

end for

16:

Calculate the empirical coverage level of

{\hat{C I}}_{i} (1 - α)

\frac{1}{R} \sum_{i = 1}^{R} I (G_{i}^{*} \in {\hat{C I}}_{i} (1 - α)),

where

I (\cdot)

is an indicator function.

17:

Determine the value of

α_{0.95}

such that the empirical coverage level of

{\hat{C I}}_{i} (1 - α_{0.95})

is 95%, i.e.

\frac{1}{R} \sum_{i = 1}^{R} I (G_{i}^{*} \in {\hat{C I}}_{i} (1 - a_{0.95})) = 0.95 .

18:

Return the 95% confidence interval for

G (Y)

based on observed data as the interval between the

100 α_{0,95} / 2 th

and

100 (1 - α_{0,95} / 2) th

quantiles of

\{G_{1}^{*} (Z), \dots, G_{B}^{*} (Z)\}

Open in a new tab

To fit into the general framework described in Section 2.4, $Z$ corresponds to $(Z_{1}, \dots, Z_{\hat{C}})$ and $Y$ corresponds to $(λ_{1}, \dots, λ_{\hat{C}}) . θ_{0}$ corresponds to $(a_{0}, b_{0})$ , the shape and rate parameters of the gamma distribution. Below, we outline the algorithm step by step.

2.4.1. Estimation of $(a_{0}, b_{0})$ and $C_{0}$

Marginally, $Z_{i}$ follows a negative binomial distribution, i.e.

P (Z_{i} = z) = p_{Z} (z ∣ a_{0}, b_{0}) = \int_{0}^{\infty} \frac{e^{- λ} λ^{z}}{z!} \frac{λ^{a_{0} - 1} e^{- b_{0} λ} b_{0}^{a_{0}}}{Γ (a_{0})} d λ = (\binom{z + a_{0} - 1}{z}) {(\frac{b_{0}}{b_{0} + 1})}^{a_{0}} {(\frac{1}{b_{0} + 1})}^{z},

where $p_{Z} (\cdot ∣ a, b)$ represents the density function of the distribution of $Z_{i}$ . We aim to first estimate the parameters $a_{0}$ and $b_{0}$ in the gamma distribution based on observed data consisting of positive reads

Z_{O} = \{Z_{i} ∣ Z_{i} > 0, i = 1, \dots, C_{0}\} .

The resulting log-likelihood function in terms of $a_{0}, b_{0}$ is

l (a, b) = \sum_{z_{i} \in Z_{O}} log \{\frac{p_{Z} (z_{i} ∣ a, b)}{1 - p_{Z} (0 ∣ a, b)}\},

(2)

which can be maximized via an EM algorithm [9,17].

If $C_{0}$ is known, i.e. we observed clones with $Z_{i} = 0$ , the full log-likelihood function becomes

\sum_{z_{i} \in Z_{O}} log \{p (z_{i} ∣ a_{0}, b_{0})\} + \sum_{z_{i} = 0} log \{p (0 ∣ a_{0}, b_{0})\} .

(3)

Therefore, the EM algorithm consists of the following E and M steps.

E-step: For a given estimator $\hat{a}$ and $\hat{b}$ , we have
${\hat{n}}_{0} = E [C_{0} - C ∣ \hat{a}, \hat{b}] = \frac{p_{Z} (0 ∣ \hat{a}, \hat{b})}{1 - p_{Z} (0 ∣ \hat{a}, \hat{b})} C$
based on the relationship
$p_{Z} (0 ∣ \hat{a}, \hat{b}) (E [C_{0} - C ∣ \hat{a}, \hat{b}] + C) = E [C_{0} - C ∣ \hat{a}, \hat{b}],$
where $C$ is the number of observed nonzero $Z_{i} s$ . Since
$p_{Z} (0 ∣ a_{0}, b_{0}) = {(\frac{b_{0}}{b_{0} + 1})}^{a_{0}},$
the expected difference between $C_{0}$ and $C$ can be expressed as
${\hat{n}}_{0} = E [C_{0} - C ∣ \hat{a}, \hat{b}] = \frac{{\hat{b}}^{\hat{a}} C}{(\hat{b} + 1)^{\hat{a}} - {\hat{b}}^{\hat{a}}} .$

M-step: Update the MLE of

a_{0}, b_{0}

by maximizing

\sum_{z_{i} \in Z_{O}} log \{p_{Z} (z_{i} ∣ a_{0}, b_{0})\} + {\hat{n}}_{0} log \{p_{Z} (0 ∣ a_{0}, b_{0})\} .

This optimization can be achieved via an Inner EM algorithm treating

\{λ_{i}\}

as missing variables:

E-step: For given $(\hat{a}, \hat{b})$
$E [λ_{i} ∣ \hat{a}, \hat{b}] = \frac{Z_{i} + \hat{a}}{1 + \hat{b}},$

$E [log (λ_{i}) ∣ \hat{a}, \hat{b}] = Ψ (Z_{i} + \hat{a}) - log (1 + \hat{b}), i = 1, \dots, C,$

$E [λ_{0} ∣ \hat{a}, \hat{b}] = \frac{\hat{a}}{1 + \hat{b}},$

$E [log (λ_{0}) ∣ \hat{a}, \hat{b}] = Ψ (\hat{a}) - log (1 + \hat{b}),$
where $λ_{0}$ corresponds to the Poisson rate of unobserved clones with $Z_{i} = 0$ and $Ψ (\cdot)$ is the digamma function, the derivative of the log-transformed Gamma function, i.e. $Ψ (x) = {[log \{\int_{0}^{\infty} t^{x - 1} e^{- t} d t\}]}^{'}$ .

M-step: Maximize the log-likelihood function

l (a_{0}, b_{0}) = (a_{0} - 1) [\sum_{i = 1}^{C} E [log (λ_{i}) ∣ \hat{a}, \hat{b}] + {\hat{n}}_{0} E [log (λ_{0}) ∣ \hat{a}, \hat{b}]] - (C + {\hat{n}}_{0}) [log Γ (a_{0}) - a_{0} log (b_{0})] - b_{0} [\sum_{i = 1}^{C} E [λ_{i} ∣ \hat{a}, \hat{b}] + {\hat{n}}_{0} E [λ_{0} ∣ \hat{a}, \hat{b}]] = (a_{0} - 1) [\sum_{i = 1}^{C} \{Ψ (z_{i} + \hat{a}) - log (1 + \hat{b})\} + {\hat{n}}_{0} {Ψ (\hat{a}) - log (1 + \hat{b})}] - (C + {\hat{n}}_{0}) [log Γ (a_{0}) - a_{0} log (b_{0})] - b_{0} \frac{\sum_{i = 1}^{C} (z_{i} + \hat{a}) + {\hat{n}}_{0} \hat{a}}{1 + \hat{b}} .

In practice, we iterate the inner EM algorithm to maximize

\sum_{z_{i} \in Z_{O}} log \{p_{Z} (z_{i} ∣ a_{0}, b_{0})\} + {\hat{n}}_{0} log \{p_{Z} (0 ∣ a_{0}, b_{0})\}

for a given

{\hat{n}}_{0}

and iterate the outer EM algorithm until

{\hat{n}}_{0}

converges. The final convergence of (

\hat{a}, \hat{b}, {\hat{n}}_{0}

) can be assessed by the relative change of the log-likelihood:

\sum_{z_{i} \in Z_{O}} log \{\frac{p_{Z} (z_{i} ∣ \hat{a}, \hat{b})}{1 - p_{Z} (0 ∣ \hat{a}, \hat{b})}\},

whose value should increase with each iteration of the outer EM algorithm.

2.4.2. Estimation of the variance of $(\hat{a}, \hat{b})$

After obtaining the MLE $(\hat{a}, \hat{b})$ for $(a_{0}, b_{0})$ using the EM algorithm, we can estimate the variance of $(\hat{a}, \hat{b})$ by the inverse of the observed information matrix, which can be calculated by taking the second derivative of the log-likelihood function w.r.t. $a$ and $b$ , i.e.

\hat{J} (Z_{O}) = {{(\begin{array}{l} \frac{\partial^{2} l (a, b)}{\partial a^{2}} & \frac{\partial^{2} l (a, b)}{\partial a \partial b} \\ \frac{\partial^{2} l (a, b)}{\partial b \partial a} & \frac{\partial^{2} l (a, b)}{\partial b^{2}} \end{array})}^{- 1}|}_{(a, b) = (\hat{a}, \hat{b})} .

2.4.3. Sampling from the posterior distribution

In constructing the 95% confidence interval, we apply the log transformation to $(\hat{a}, \hat{b})$ to ensure positivity, and generate

(\binom{a^{*}}{b^{*}}) ~ exp \{N ((\binom{log \hat{a}}{log \hat{b}}), \hat{A} (Z_{O}) \hat{J} (Z_{O}) \hat{A} (Z_{O}))\},

where we apply the delta method and $\hat{A} (Z_{O}) = diag ({\hat{a}}^{- 1}, {\hat{b}}^{- 1})$ is the Jacobian matrix. The posterior distribution of $λ_{i}^{*}$ is then

λ_{i}^{*} ~ λ_{i} ∣ Z_{i}, a^{*}, b^{*},

which is a gamma distribution with the shape and rate parameters being $a^{*} + Z_{i}$ and $b^{*} + 1$ , respectively. This approach ensures that sampled $a^{*}$ and $b^{*}$ are always positive. Operationally, we pretend that $C_{0} = \hat{C}$ and let

Z = \{Z_{1}, Z_{2}, \dots, Z_{C}, Z_{C + 1} = 0, \dots, Z_{\hat{C}} = 0\} = Z_{O} \cup {0, \dots, 0} .

We then define

Y = \{λ_{1}, λ_{2}, \dots, λ_{\hat{C}}\} .

The confidence intervals ${\hat{C I}}_{C} (Z_{O})$ and ${\hat{C I}}_{E} (Z_{O})$ are calibrated such that for clonality,

P \{\sum_{i = 1}^{\hat{C}} {(\frac{λ_{i}}{\sum_{j = 1}^{\hat{C}} λ_{j}})}^{2} \in {\hat{C I}}_{C} (Z_{O})| \hat{a}, \hat{b}\} = 0.95,

and for entropy,

P \{\sum_{i = 1}^{\hat{C}} - \frac{λ_{i}}{\sum_{j = 1}^{\hat{C}} λ_{j}} log {(\frac{λ_{i}}{\sum_{j = 1}^{\hat{C}} λ_{j}})}^{2} \in {\hat{C I}}_{E} (Z_{O})| \hat{a}, \hat{b}\} = 0.95 .

The complete algorithm is provided in Algorithm 3. Note that the algorithm for constructing confidence intervals for entropy is the same as that for clonality, except that we replace all $G_{C} (Y) = \sum_{i = 1}^{C_{0}} {(\frac{λ_{i}}{\sum_{j = 1}^{C_{0}} λ_{j}})}^{2}$ by $G_{E} (Y) = - \sum_{i = 1}^{C_{0}} \{\frac{λ_{i}}{\sum_{j = 1}^{C_{0}} λ_{j}} log (\frac{λ_{i}}{\sum_{j = 1}^{C_{0}} λ_{j}})\}$ .

Algorithm 3.

The Algorithm to Construct a 95% Confidence Interval for Clonality

Use the proposed EM algorithm to obtain the MLE for

a_{0}, b_{0}

, and

C_{0}

, denoted by

\hat{a}, \hat{b}

and

\hat{C}

, respectively.

Compute

\hat{J} (Z_{O})

as the inverse of the observed information matrix for

a_{0}

and

b_{0}

and

\hat{A} (Z_{O})

Let

Z = \{Z_{1}, \dots, Z_{C}, Z_{C + 1} = 0, \dots, Z_{\hat{C}} = 0\}

for

b = 1, \dots, B

, do

Simulate

(\binom{a_{i (b)}^{*}}{b_{i (b)}^{*}}) \overset{i . i . d .}{~} exp [N ((\binom{log \hat{a}}{log \hat{b}}), \hat{A} (Z_{O}) \hat{J} (Z_{O}) \hat{A} (Z_{O}))\}

for

i = 1, \dots, \hat{C}

Simulate

λ_{i (b)}^{*} ~ Gamma (a_{i (b)}^{*} + z_{i}, b_{i (b)}^{*} + 1)

for

i = 1, \dots, \hat{C}

Compute

G_{C (b)}^{*} (Z_{O}) = \sum_{i = 1}^{\hat{C}} {(\frac{λ_{i (b)}^{*}}{\sum_{j = 1}^{\hat{C}} λ_{i (b)}^{*}})}^{2}

end for

for

i = 1, \dots, R

, do

10:

Simulate

(λ_{i 1}^{*}, \dots λ_{i \hat{C}}^{*}) ~ Gamma (\hat{a}, \hat{b})

and let

Λ_{i}^{*} : = \{λ_{i j}^{*} ∣ j = 1, \dots, \hat{C})

11:

Compute

G_{C} (Λ_{i}^{*}) = \sum_{j = 1}^{\hat{C}} {(\frac{λ_{i j}^{*}}{\sum_{k = 1}^{\hat{C}} λ_{i k}^{*}})}^{2}

12:

Simulate a new set of data

Z_{i}^{*} = (Z_{i j}^{*} ∣ j = 1, \dots, \hat{C})

where

Z_{i j}^{*} ~ Poisson (λ_{i j}^{*})

13:

Induce the observed data

Z_{i O}^{*} = \{Z_{i j}^{*} ∣ Z_{i j} > 0, j = 1, \dots, \hat{C})

14:

Repeat steps 4–5 for

Z_{i O}^{*}

and obtain

(G_{C (1)}^{*} (Z_{i O}^{*}), \dots, G_{C (B)}^{*} (Z_{i O}^{*}))

15:

Let

{\hat{C I}}_{α} (Z_{i O}^{*})

be the interval between the

100 α / 2 th

and

100 (1 - α / 2) th

percentiles of

\{G_{C (1)}^{*} (Z_{i O}^{*}) \dots, G_{C (B)}^{*} (Z_{i O}^{*})\}

16:

end for

17:

Determine the value of

{\hat{α}}_{0}

such that the proportion of

G_{C} (Λ_{i}^{*})

falls in

{\hat{C I}}_{{\hat{α}}_{0}} (Z_{i O}^{*})

is closest to 95%, i.e.

\frac{1}{R} \sum_{i = 1}^{n} I \{G_{C} (Λ_{i}^{*}) \in {\hat{C I}}_{{\hat{α}}_{0}} (Z_{i O}^{*})\} \approx 0.95 .

18:

Return the

100 α / 2 th

and

100 (1 - α / 2) th

percentiles of

\{G_{C (b)}^{*} (Z_{O}), b = 1, \dots, B\}

, which is the 95% confidence interval for the clonality.

Open in a new tab

Remark 1: This proposal is closely related to [3,16]. Specifically, in order to construct valid confidence intervals for clonality and entropy, we combine the ‘Type III bootstrap’ in [16] with the calibration method in [3]. In other words, the calibration is applied to a confidence interval that already partially accounted for the randomness in estimated hyperparameters of the prior distribution (instead of the naive confidence interval from Algorithm 1). In addition, while [3,16] offer very general frameworks for EB inference, all of their examples are relatively simple. In the current application, the quantities of interest are complex functions of a very large number of parameters. We also need to deal with the challenges of missing data by deploying the appropriate EM algorithm, since only a truncated version of the reads of all V(D)J rearrangements is observable.

3. Simulation study

We conduct a comprehensive simulation study to examine the empirical performance of the constructed confidence intervals for both clonality and entropy. For each given set of $(a_{0}, b_{0}, C_{0})$ , we repeat the experiments 500 times to compute the empirical coverage level of the constructed confidence intervals. In constructing the confidence interval, we set the number of resampling for the calibration to be $R = 200$ . The number of posterior sampling is $B = 500 . C_{0}$ , the number of clones is set at 10, 000. Here, $C_{0}$ partially governs the number of observed clones in the observed dataset. More specifically, a higher $C_{0}$ is expected to reduce the variability of clonality or entropy and renders the effect of the calibration less sensitive to the accuracy of the estimated prior distribution. In addition, a higher $C_{0}$ also helps to estimate $(a_{0}, b_{0})$ , the parameters of the prior distribution. Thus, we expect that the performance of the proposed confidence interval improves for larger $C_{0}$ in general. The values of $a_{0}$ and $b_{0}$ are fixed at their maximum likelihood estimators based on data examples in Section 4 to mimic real practice. For comparison purposes, we also construct the confidence intervals based on the naive Bayesian procedure given in Algorithm 1 and the confidence intervals based on the posterior distribution (1) without the calibration step.

The simulation results are summarized in Table 1. The empirical coverage level of our proposed confidence intervals is close to its nominal level. On the other hand, the confidence intervals based on the Naive EB approach undercover the true parameter, which is expected considering the fact that the variability of $(\hat{a}, \hat{b})$ is ignored when constructing the confidence interval. The coverage level of confidence intervals without calibration is higher than the nominal level, suggesting that these intervals are too conservative. This could be due to the fact that this method did not consider the fact that $(\hat{a}, \hat{b})$ is also dependent on $Z_{O}$ and thus is correlated with the quantiles of the posterior distribution. This observation confirms the essential role played by the calibration step, which is also the most computationally intensive part of the proposed algorithm. To illustrate this visually, we plot confidence intervals for clonality constructed by three methods in six trials of the simulation study with $(a_{0}, b_{0}, C_{0}) = (0.732,0.882,10,000)$ in Figure 1. From the plot, the confidence intervals constructed using the Naive EB method (blue) are the narrowest. In contrast, the confidence intervals constructed using EB without calibration (green) are too wide, underscoring the importance of the calibration step, which produces the most appropriate confidence intervals (red). Based on this result and our limited experience, confidence intervals constructed via the EB method without calibration perform better than naive confidence intervals but may still fail to achieve the desired coverage level. Therefore, the calibration step is recommended as long as the computational cost of the parametric bootstrap remains manageable.

Table 1.

The simulation results on empirical coverage level of constructed confidence intervals.

( $a_{0}, b_{0}$ )	Method	Clonality Coverage	Entropy Coverage
(0.732, 0.882)	EB w Calibration	94.0%	92.2%
	EB w/o Calibration	100%	100%
	Naive EB	62.8%	45.6%
(0.414, 0.335)	EB w Calibration	91.2%	95.8%
	EB w/o Calibration	100%	100%
	Naive EB	78.0%	69.4%
(0.596, 0.960)	EB w Calibration	98.4%	96.6%
	EB w/o Calibration	100%	100%
	Naive EB	67.2%	40.6%
(0.551, 0.775)	EB w Calibration	97.0%	96.0%
	EB w/o Calibration	100%	100%
	Naive EB	68.4%	44.8%
(0.171, 0.301)	EB w Calibration	98.4%	99.2%
	EB w/o Calibration	100%	100%
	Naive EB	88.6%	70.6%
(0.126, 0.132)	EB w Calibration	95.8%	98.2%
	EB w/o Calibration	100%	100%
	Naive EB	94.0%	83.8%
(0.0860, 0.111)	EB w Calibration	94.4%	99.6%
	EB w/o Calibration	100%	100%
	Naive EB	94.2%	84.8%
(0.113, 0.142)	EB w Calibration	95.6%	98.6%
	EB w/o Calibration	100%	100%
	Naive EB	91.2%	83.2%

Open in a new tab

Note: EB w/ Calibration refers to the proposed method. EB w/o Calibration is based on the posterior distribution in Equation (1) without the calibration step. Naive EB denotes the confidence interval constructed using the naive EB procedure described in Algorithm 1.

Figure 1. — 95% confidence intervals constructed by the Naive EB, EB with calibration, and EB without calibration for clonality. From the 500 experimental trials, we randomly selected six trials to generate the plot.

Additional to results in Table 1, we plot the confidence intervals for the clonality and their corresponding true clonalities from 500 data simulated with $(a_{0}, b_{0}, C_{0}) = (0.086, 0.111, 10, 000)$ in Figure 2. The confidence intervals are sorted according to the size of true clonalities. Notably, true clonalities have quite substantial variation relative to the width of the 95% confidence intervals, highlighting the importance of treating $G (Y)$ as a random quantity.

Figure 2. — 95% confidence intervals for the clonality based on 500 simulated datasets; The intervals are sorted based on the size of their true clonalities (from the smallest to the largest).

Next, we examine the empirical performance of the proposed confidence intervals when the underlying rates $λ_{i}, i = 1, \dots, C_{0}$ do not follow a gamma distribution, i.e. the assumed Poisson-Gamma model is misspecified in characterizing the data generation process. Specifically, we simulate $λ_{i}$ from a log-normal distribution $exp \{N (μ_{0}, σ_{0}^{2})\}$ with chosen $μ_{0}$ and $σ_{0}$ and generate the observed reads $Z_{O}$ . Using the same steps described above, we construct confidence intervals based on 500 simulated datasets and calculate the empirical coverage level of these confidence intervals. The results are summarized in Table 2. The empirical coverage level of entropy is quite close to 95% even though the Poisson-Gamma parametric model is misspecified, implying the robustness of the proposed confidence intervals. On the other hand, the constructed confidence interval for clonality severely underestimates the true parameter, suggesting the importance of assuming an appropriate distribution of the Poisson rates.

Table 2.

The simulation results when the underlying model for rate is log-normal rather than Gamma.

$(μ_{0}, σ_{0}^{2})$	Coverage for Clonality	Coverage for Entropy
(−1.38, 1.64²)	80.4%	98.2%
(−1.27, 1.72²)	86.8%	97.8%
(−1.22, 1.50²)	80.2%	98.4%
(−1.02, 1.62²)	82.7%	95.8%

Open in a new tab

4. Real data studies

In this section, we illustrate our approach by applying it to analyze a recent study conducted by [20]. The objective is to investigate human T cell receptor (TCR) diversity. Specifically, we are interested in measuring TCR diversity by clonality and entropy. In the study conducted by [20], five replicate TCR libraries of CD4 naive T cells and CD4 memory T cells are sequenced from each of the seven participants. The total number of reads varied from 8.9 × 10⁴ to 7.4 × 10⁵. First, we count the total number of reads in each clone across five replicates, and Figure 3 shows the observed cumulative proportions of clones sorted from the largest to the smallest for both CD4 naive and memory T cells. From the figure, it is clear that large clones contain a higher proportion of CD4 memory T cells than naive CD4 T cells, reflecting the relative evenness of the distribution of clone sizes of naive T cells.

Figure 3. — Cumulative proportion of cells from clones, sorted by clone size from largest (left) to smallest (right).

Next, we apply the proposed method to construct confidence intervals of entropy and clonality for the naive cells and the memory cells based on data from five replicates per patient, which are plotted in Figures 4 and 5. The confidence intervals based on different replicates of the same participant are fairly consistent in general, suggesting a low within-person variation relative to between-person variation, supporting the validity of the experiment result. In addition, the clonality of CD4 memory T cells is substantially greater than that of CD4 naive T cells, again confirming observations on the evenness of the distribution of the naive T cell clones in Figure 3.

Figure 4. — 95% confidence intervals for clonality based on five replicates per patient for naive cells and memory cells ( $x$ -axis displayed on a logarithmic scale for better visualization). A lower clonality value (further to the left on the $x$ -axis) indicates greater diversity in the cell population.

Figure 5. — 95% confidence intervals for entropy based on five replicates per patient for naive cells and memory cells. A higher entropy value (further to the right on the $x$ -axis) indicates greater diversity in the cell population.

Lastly, we estimate the ‘average’ (log-transformed) clonality and entropy of naive CD4 T cells for three participants younger than 40 and four participants older than 70, separately based on a random effects model, which is well-known from meta analysis. We then compare the clonality and entropy between younger and older participants. The average log-transformed clonality for memory T cells is −9.45[−9.66 to −9.24] for young participants and −7.79[−9.09 to −6.50] for old participants with a two-sided p-value of 0.014 for testing the null hypothesis that old participants have the same average clonality as young participants. The average entropy is 10.97[10.76 to 11.17] for young participants and 10.37[9.88 to 10.87] for old participants with a two-sided p-value of 0.029 for testing the null hypothesis that old participants have the same average entropy as young participants. These results suggest that the immune diversity in old participants is lower than that in young participants, as anticipated, implying the aging effect on the human immune system.

5. Discussion

In this paper, we discuss a method for constructing confidence intervals for entropy and clonality: both are functions of a high dimensional probability vector. The primary challenges stem from the curse of dimensionality, where traditional estimators often struggle to maintain accuracy due to sparse observations in high-dimensional spaces. In particular, the probability vector can not be estimated well, and the corresponding plug-in estimators of entropy and clonality are very poor in the high dimensional setting. EB is a natural approach because components of the high dimensional probability vector can be viewed as random samples from a prior distribution and such a distribution is estimable under appropriate parametric assumptions. When the dimension of the probability vector is high, i.e. the number of random samples is large, one may expect that the value of entropy or clonality is driven by the simple prior distribution generating the probability vector and the inference on entropy or clonality can be made accordingly. Therefore, our method is developed within a general EB framework, coupling the adjustment of the uncertainty in estimating the prior distribution with a parametric bootstrap-based calibration step. Both components are important, since the former determines the location of the confidence interval and the latter ensures the correct coverage level. Based on our numerical study, the proposed confidence interval can achieve reasonable performance when the parametric model for the prior distribution is correctly specified.

While the proposed method demonstrates some robustness against model misspecification (an average coverage of 82.5% for clonality and 97.6% for entropy across the four experimental settings for model misspecification), its validity is not entirely immune to such issues. Especially, the proposed calibration is not a substitute for imposing a good parametric model for the prior distribution in the first place. As illustrated by results in Table 2, the performance of the confidence intervals could be poor if the prior distribution is severely misspecified. Replacing the gamma distribution by a more flexible model could be a promising direction for future research. In particular, it is appealing to consider distributions from a nonparametric exponential family: $p (λ ∣ η) \propto p_{0} (λ) exp \{B (λ)^{'} η\}$ , where $p (\cdot ∣ η)$ is the density function of the intensity rate and $B (λ)$ is a set of flexible basis functions given a priori, such as $B (λ) = {(λ, λ^{2}, λ^{3})}^{'}$ [21]. Lastly, in the proposed approach, the actual number of distinct clones is replaced by its estimator, which may affect the performance of the subsequent point and interval estimation. It is conceivable that the impact is greater for some functions, such as entropy, which is more sensitive to small-size clones than other functions, such as clonality, which is robust to small-size clones. However, estimating the number of distinct clones is analogous to estimating the number of unseen species, which is a difficult problem and in the current case depends on the parametric assumption for the intensity rate [11]. Therefore, it is important to study the impact of this estimator on the construction of the confidence interval for different diversity parameters.

Acknowledgments

This work was completed when Zhongren Chen was at Stanford University. Dr. Lu Tian and Dr. Richard Olshen are supported by National Institutes of Health. Dr. Richard A. Olshen passed away on November 8, 2023. Author’s Original Manuscript: https://arxiv.org/abs/2211.14755. While preparing this manuscript, we utilized ChatGPT 4.0 and Claude 3.5 Sonnet for sentence-level edits, such as fixing grammar and rewording sentences.

Footnotes

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

[1].Aldrich RJ, GCHQ: the Uncensored Story of Britain’s Most Secret Intelligence Agency, HarperPress, London, 2010. [Google Scholar]
[2].Box GE and Tiao GC, Bayesian Inference in Statistical Analysis, John Wiley & Sons, Hoboken, NJ, 2011. [Google Scholar]
[3].Carlin BP and Gelfand AE, Approaches for empirical bayes confidence intervals, J. Am. Stat. Assoc 85 (1990), pp. 105–114. [Google Scholar]
[4].Carlin BP and Louis TA, Empirical bayes: Past, present and future, J. Am. Stat. Assoc 95 (2000), pp. 1286–1289. [Google Scholar]
[5].Casella G, An introduction to empirical bayes data analysis, Am. Stat 39 (1985), pp. 83–87. [Google Scholar]
[6].Chao A, Estimating the population size for capture-recapture data with unequal catchability, Biometrics 43 (1987), pp. 783–791. [PubMed] [Google Scholar]
[7].Chao A, Estimating population size for sparse data in capture-recapture experiments, Biometrics 45 (1989), pp. 427–438. [Google Scholar]
[8].Chao A and Shen TJ, Nonparametric estimation of shannon’s index of diversity when there are unseen species in sample, Environ. Ecol. Stat 10 (2003), pp. 429–443. [Google Scholar]
[9].Dempster AP, Laird NM, and Rubin DB, Maximum likelihood from incomplete data via the em algorithm, J. R. Stat. Soc.: Ser. B Methodol 39 (1977), pp. 1–22. [Google Scholar]
[10].Efron B and Hinkley DV, Assessing the accuracy of the maximum likelihood estimator: Observed versus expected fisher information, Biometrika 65 (1978), pp. 457–483. [Google Scholar]
[11].Efron B and Thisted R, Estimating the number of unseen species: How many words did shakespeare know?, Biometrika 63 (1976), pp. 435–447. [Google Scholar]
[12].Efron B and Tibshirani RJ, An Introduction to the Bootstrap, CRC press, New York, 1994. [Google Scholar]
[13].Fisher RA, Theory of statistical estimation, in Mathematical proceedings of the Cambridge philosophical society, Vol. 22. Cambridge University Press, Cambridge, 1925, pp. 700–725. [Google Scholar]
[14].Glanville J, Huang H, Nau A, Hatton O, Wagar LE, Rubelt F, Ji X, Han A, Krams SM, Pettus C, Haas N, Arlehamn CSL, Sette A, Boyd SD, Scriba TJ, Martinez OM, and Davis MM, Identifying specificity groups in the t cell receptor repertoire, Nature 547 (2017), pp. 94–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Kaplinsky J and Arnaout R, Robust estimates of overall immune-repertoire diversity from high-throughput measurements on samples, Nat. Commun 7 (2016), p. 11881. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Laird NM and Louis TA, Empirical bayes confidence intervals based on bootstrap samples, J. Am. Stat. Assoc 82 (1987), pp. 739–750. [Google Scholar]
[17].McLachlan G and Jones P, Fitting mixture models to grouped and truncated data via the em algorithm, Biometrics 44 (1988), pp. 571–578. [PubMed] [Google Scholar]
[18].Morris CN, Parametric empirical bayes confidence intervals, in Scientific inference, data analysis, and robustness, Elsevier, New York, 1983, pp. 25–50. [Google Scholar]
[19].Norris JL and Pollock KH, Non-parametric mle for poisson species abundance models allowing for heterogeneity between species, Environ. Ecol. Stat 5 (1998), pp. 391–402. [Google Scholar]
[20].Qi Q, Liu Y, Cheng Y, Glanville J, Zhang D, Lee JY, Olshen RA, Weyand CM, Boyd SD, and Goronzy JJ, Diversity and clonal selection in the human t-cell repertoire, Proc. Natl. Acad. Sci 111 (2014), pp. 13139–13144. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Schwartzman A, Empirical null and false discovery rate inference for exponential families, Ann. Appl. Stat 2 (2008), pp. 1332–1359. [Google Scholar]
[22].Tian L, Liu Y, Fire AZ, Boyd SD, and Olshen RA, Clonality: point estimation, Ann. Appl. Stat 13 (2019), pp. 113–131. [Google Scholar]
[23].Van der Vaart AW, Asymptotic Statistics, Vol. 3, Cambridge university press, Cambridge, 2000. [Google Scholar]
[24].Zurek WH, Complexity, Entropy and the Physics of Information, CRC Press, Boca Raton, 2018. [Google Scholar]

[R1] [1].Aldrich RJ, GCHQ: the Uncensored Story of Britain’s Most Secret Intelligence Agency, HarperPress, London, 2010. [Google Scholar]

[R2] [2].Box GE and Tiao GC, Bayesian Inference in Statistical Analysis, John Wiley & Sons, Hoboken, NJ, 2011. [Google Scholar]

[R3] [3].Carlin BP and Gelfand AE, Approaches for empirical bayes confidence intervals, J. Am. Stat. Assoc 85 (1990), pp. 105–114. [Google Scholar]

[R4] [4].Carlin BP and Louis TA, Empirical bayes: Past, present and future, J. Am. Stat. Assoc 95 (2000), pp. 1286–1289. [Google Scholar]

[R5] [5].Casella G, An introduction to empirical bayes data analysis, Am. Stat 39 (1985), pp. 83–87. [Google Scholar]

[R6] [6].Chao A, Estimating the population size for capture-recapture data with unequal catchability, Biometrics 43 (1987), pp. 783–791. [PubMed] [Google Scholar]

[R7] [7].Chao A, Estimating population size for sparse data in capture-recapture experiments, Biometrics 45 (1989), pp. 427–438. [Google Scholar]

[R8] [8].Chao A and Shen TJ, Nonparametric estimation of shannon’s index of diversity when there are unseen species in sample, Environ. Ecol. Stat 10 (2003), pp. 429–443. [Google Scholar]

[R9] [9].Dempster AP, Laird NM, and Rubin DB, Maximum likelihood from incomplete data via the em algorithm, J. R. Stat. Soc.: Ser. B Methodol 39 (1977), pp. 1–22. [Google Scholar]

[R10] [10].Efron B and Hinkley DV, Assessing the accuracy of the maximum likelihood estimator: Observed versus expected fisher information, Biometrika 65 (1978), pp. 457–483. [Google Scholar]

[R11] [11].Efron B and Thisted R, Estimating the number of unseen species: How many words did shakespeare know?, Biometrika 63 (1976), pp. 435–447. [Google Scholar]

[R12] [12].Efron B and Tibshirani RJ, An Introduction to the Bootstrap, CRC press, New York, 1994. [Google Scholar]

[R13] [13].Fisher RA, Theory of statistical estimation, in Mathematical proceedings of the Cambridge philosophical society, Vol. 22. Cambridge University Press, Cambridge, 1925, pp. 700–725. [Google Scholar]

[R14] [14].Glanville J, Huang H, Nau A, Hatton O, Wagar LE, Rubelt F, Ji X, Han A, Krams SM, Pettus C, Haas N, Arlehamn CSL, Sette A, Boyd SD, Scriba TJ, Martinez OM, and Davis MM, Identifying specificity groups in the t cell receptor repertoire, Nature 547 (2017), pp. 94–98. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Kaplinsky J and Arnaout R, Robust estimates of overall immune-repertoire diversity from high-throughput measurements on samples, Nat. Commun 7 (2016), p. 11881. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Laird NM and Louis TA, Empirical bayes confidence intervals based on bootstrap samples, J. Am. Stat. Assoc 82 (1987), pp. 739–750. [Google Scholar]

[R17] [17].McLachlan G and Jones P, Fitting mixture models to grouped and truncated data via the em algorithm, Biometrics 44 (1988), pp. 571–578. [PubMed] [Google Scholar]

[R18] [18].Morris CN, Parametric empirical bayes confidence intervals, in Scientific inference, data analysis, and robustness, Elsevier, New York, 1983, pp. 25–50. [Google Scholar]

[R19] [19].Norris JL and Pollock KH, Non-parametric mle for poisson species abundance models allowing for heterogeneity between species, Environ. Ecol. Stat 5 (1998), pp. 391–402. [Google Scholar]

[R20] [20].Qi Q, Liu Y, Cheng Y, Glanville J, Zhang D, Lee JY, Olshen RA, Weyand CM, Boyd SD, and Goronzy JJ, Diversity and clonal selection in the human t-cell repertoire, Proc. Natl. Acad. Sci 111 (2014), pp. 13139–13144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Schwartzman A, Empirical null and false discovery rate inference for exponential families, Ann. Appl. Stat 2 (2008), pp. 1332–1359. [Google Scholar]

[R22] [22].Tian L, Liu Y, Fire AZ, Boyd SD, and Olshen RA, Clonality: point estimation, Ann. Appl. Stat 13 (2019), pp. 113–131. [Google Scholar]

[R23] [23].Van der Vaart AW, Asymptotic Statistics, Vol. 3, Cambridge university press, Cambridge, 2000. [Google Scholar]

[R24] [24].Zurek WH, Complexity, Entropy and the Physics of Information, CRC Press, Boca Raton, 2018. [Google Scholar]

PERMALINK

An empirical Bayes approach for constructing confidence intervals for clonality and entropy

Zhongren Chen

Lu Tian

Richard A Olshen

Abstract

1. Introduction

2. Method

2.1. The general framework

2.2. A naive approach

Algorithm 1.

2.3. Calibration

2.4. Interval estimates of clonality and entropy

Algorithm 2.

2.4.1. Estimation of $(a_{0}, b_{0})$ and $C_{0}$

2.4.2. Estimation of the variance of $(\hat{a}, \hat{b})$

2.4.3. Sampling from the posterior distribution

Algorithm 3.

3. Simulation study

Table 1.

Figure 1.

Figure 2.

Table 2.

4. Real data studies

Figure 3.

Figure 4.

Figure 5.

5. Discussion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

An empirical Bayes approach for constructing confidence intervals for clonality and entropy

Zhongren Chen

Lu Tian

Richard A Olshen

Abstract

1. Introduction

2. Method

2.1. The general framework

2.2. A naive approach

Algorithm 1.

2.3. Calibration

2.4. Interval estimates of clonality and entropy

Algorithm 2.

2.4.1. Estimation of (a0,b0) and C0

2.4.2. Estimation of the variance of (aˆ,bˆ)

2.4.3. Sampling from the posterior distribution

Algorithm 3.

3. Simulation study

Table 1.

Figure 1.

Figure 2.

Table 2.

4. Real data studies

Figure 3.

Figure 4.

Figure 5.

5. Discussion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.4.1. Estimation of $(a_{0}, b_{0})$ and $C_{0}$

2.4.2. Estimation of the variance of $(\hat{a}, \hat{b})$