Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Sep 16.
Published before final editing as: J Appl Stat. 2025 Apr 30:10.1080/02664763.2025.2496724. doi: 10.1080/02664763.2025.2496724

An empirical Bayes approach for constructing confidence intervals for clonality and entropy

Zhongren Chen a, Lu Tian b, Richard A Olshen b,
PMCID: PMC12435542  NIHMSID: NIHMS2106193  PMID: 40958882

Abstract

This paper is motivated by the need to quantify human immune responses to environmental challenges. Specifically, the genome of the selected cell population from a blood sample is amplified by the PCR process, producing a large number of reads. Each read corresponds to a particular rearrangement of so-called V(D)J sequences. The observed data consist of a set of integers, representing numbers of reads corresponding to different V(D)J sequences. The underlying relative frequencies of distinct V(D)J sequences can be summarized by a probability vector, with the cardinality being the number of distinct V(D)J rearrangements. The statistical question is to make inferences on a summary parameter of this probability vector based on a multinomial-type observation of a large dimension. Popular summaries of the diversity include clonality and entropy. A point estimator of the clonality based on multiple replicates from the same blood sample has been proposed previously. Therefore, the remaining challenge is to construct confidence intervals of the parameters to reflect their uncertainty. In this paper, we propose to couple the Empirical Bayes method with a resampling-based calibration procedure to construct a robust confidence interval for different population diversity parameters. The method is illustrated via extensive numerical studies and real data examples.

Keywords: Empirical Bayes, clonality, entropy, confidence interval, resampling method

1. Introduction

This paper is motivated by the need to construct confidence intervals (CIs) for parameters summarizing the diversity of a cell population consisting of cells of different types. We first introduce the sources of data and give some details here. More biological and statistical background in the case of point estimation can be found in a previous paper [22]. The problem has its biomedical origin in attempting to quantify human immune responses to an environmental challenge; more specifically, to quantify the adaptive immunologic response to any antigen, e.g. vaccination against the COVID-19 virus. Briefly, blood is sampled from a patient. This blood sample may be divided, as equally as possible, into several parts, i.e. replicates. The genome of the selected cell subpopulation in each replicate is amplified by the well-known PCR process of successive heating and cooling. One resultant from each replicate is then chosen randomly for sequencing, producing a large number of reads. They number roughly 30,000 to 300,000 per replicate. Each read corresponds to a particular rearrangement of the so-called V(D)J sequence.

In the end, the observation from a particular replicate consists of a set of numbers of reads for different V(D)J sequences. Mathematically, each observation can then be thought of as a finite dimensional random vector Z=Z1,Z2,,ZC reflecting the underlying relative frequency of different cell subpopulations with particular V(D)J rearrangements in the entire circulation system from which the blood was sampled originally. The underlying relative frequencies can be summarized by a probability vector p=p1,p2,,pC0, where C0, the cardinality of the vector p, represents the total number of different V(D)J rearrangements in the circulation. While C0 is finite, its value is unknown, since some of the V(D)J rearrangements may not be observed in the replicate due to their rarity. In other words, C, the number of distinct observed V(D)J rearrangements could be substantially smaller than the number of distinct V(D)J rearrangements in the blood. Therefore, the observed frequency vector Z consists of only positive components of the full frequency vector Z˜=Z˜1,,Z˜C0 corresponding to the underlying probability vector p. While the exact value of C0 is unknown, we will show later that it can be estimated by an EM algorithm and it only has a limited effect on the parameter of interest. While we know from criticisms that the PCR process may favor some rearrangements over others, that distinction is ignored here.

Statistical concerns are making inferences on a summary parameter based on this single multinomial-type observation, Z. One popular summary of the diversity of a cell population is its clonality, GC(p)=p22, the squared l2 norm of vector pRC0, which varies between C0-1 and 1. A lower clonality value indicates a more uniform probability distribution of different V(D)J rearrangements in the blood, corresponding to higher diversity in the cell population. We report a study of point estimation for clonality based on multiple replicates for each blood sample in a previous paper [22]. After obtaining a point estimator of a particular function of p such as clonality characterizing the composition of the cell population, the remaining challenge is to construct a confidence interval for the relevant parameter to appropriately reflect the uncertainty in this point estimator. Consequently, in what follows, attention is devoted to interval estimation. Specifically, we propose to study point and interval estimations for clonality without requiring multiple replicates. Another extremely popular measure of population diversity is entropy, which is defined as

GE(p)=j=1C0-pjlogpj

(see, for example, [14]). A higher entropy value also reflects a more uniform probability distribution of V(D)J rearrangements, indicating increased diversity. Many authors prefer to summarize the variability of p by its entropy. Of course, the notion of entropy has a storied history in communications, indeed in many aspects of engineering [24]. If C0 is small, e.g. 10, and all pis are not too close to zero, then it is possible to estimate the entire probability vector p accurately with a large number of reads. In this setting, a plug-in estimator of the clonality or entropy would serve as a reasonable approximation to the true parameter value. Then, in this case, one can readily apply the delta method or the parametric bootstrap to construct a 95% confidence interval for clonality or entropy. The difficulty arises when C0 is very big. In such a case, the estimation of the entropy itself is very difficult [8]. Especially, we are unaware of careful attempts to form confidence intervals for entropy. Existing methods based on the nonparametric maximum likelihood estimator or the Horvitz-Thompson estimator [8,19] are not directly applicable in very high-dimensional settings. The method proposed in this paper covers point and interval estimations for parameters such as entropy or other functions of the probability vector p. Lastly, different summaries for clone distributions have been proposed. For example, [15] discussed multiple diversity indices, including species richness, entropy, the Simpson index, and the Berger-Parker index. In ecology, methods for estimating the number of distinct species have been well studied, as seen in the influential work of [6,7]. In a related work, Efron aimed to infer the number of words known by Shakespeare but not used in his work [11]. Estimating the proportions of unseen clones or species has been explored in [1]. In this paper, we primarily focus on making rigorous statistical inferences on clonality and entropy.

We begin with an explanation of matters that bear upon parametric approaches in Section 2. The key is that we assume that these C0 components of the probability vector p are realizations from a parametric prior distribution [2], which can be estimated based on observed data, although there is oftentimes not sufficient information in estimating the values of all individual components. With the estimated prior distribution generating the individual probability components, point and interval estimates for the function of interest can be obtained based on the posterior distribution derived from the Bayesian theorem [2]. This is essentially an Empirical Bayes (EB) approach, where we estimate the parameters of the parametric prior distribution based on observed data, which in turn generates the probability vector of interest. However, such a naive interval may fail to cover the true value of the function with the desired coverage level due to the simple fact that the uncertainty in estimating the prior distribution is not considered in this EB approach [4,5]. Therefore, we propose an additional calibration step to correct the under-coverage of this naive EB confidence interval.

We realize that a good part of understanding the performance of the constructed confidence interval should be summarized by extensive computations on simulated clinical or other suitable data. Some computations are given in the Numerical Study Section (Section 3) and Real Data Example Section (Section 4), that follow the Method Section (Section 2). Additional suggestions for further research are topics of our Discussion Section (Section 5).

2. Method

2.1. The general framework

The complete data consist of n pairs of observations Zi,Yi,i=1,,n. Let

Zi,Yi~i.i.d.pz,yθ0,i=1,,n,

where pz,yθ0 is the density function for the joint distribution with respect to Zi,Yi,i=1,,n. Suppose that we only observe Z=Z1,,Zn,Y=Y1,,Yn is missing but correlated with Z=Z1,,Zn, and θ0 is a fixed but unknown parameter. Our aim is to construct a confidence interval covering G(Y)=GY1,,Yn with a specified probability based on observed data Z only, where G() is a given function. Note that, different from the conventional setting, where the parameter of interest is a deterministic population parameter, G(Y) is a random variable that varies from data to data. We first present a simple confidence interval of G(Y) in Section 2.2. We then discuss its limitations and propose a calibration method based on parametric bootstrap in Section 2.3.

2.2. A naive approach

Assuming that a point estimator θˆ(Z) for θ0 based on Z is available, we can derive the conditional distribution of YiZi,θ0=θˆ(Z),i=1,,n, and simulate multiple copies of G*=G(Y) directly from the conditional distribution of G(Y)Z,θ0=θˆ(Z). A confidence interval for G(Y) can then be constructed based on sampled G*s. See Algorithm 1 for this naive approach.

Algorithm 1.

A Naive Algorithm to Construct a 95% Confidence Interval for GY1,Y2,,Yn

1: Find a point estimator of θ0,θˆ(Z).
2: Simulate Yi* from the conditional distribution YiZi,θ0=θˆ(Z), for i=1,,n.
3: Compute G*=GY1*,,Yn*.
4: Repeat steps 2 and 3 a large number of times, obtain G1*,,GB*, and return the 2.5th and 97.5th percentiles of the empirical distribution of G1*,,GB* as a 95% confidence interval for G(Y). Here, B is a positive integer large enough so that the 2.5th and 97.5th percentiles of the empirical distribution (G1*,,GB*) can accurately approximate the corresponding quantiles of G*.

Algorithm 1 was first proposed by [18] in a different setting and is well-known. However, this simple confidence interval doesn’t account for the uncertainty in estimating θ0. As a result, it often fails to cover the parameter of interest at the nominal level [16]. To make the constructed intervals robust to the variability of θˆ(Z), we propose to improve this naive confidence interval by coupling a method explicitly incorporating the variance of θˆ(Z) in deriving the posterior distribution with a calibration step using the parametric bootstrap [12].

2.3. Calibration

To improve the naive confidence interval, we first obtain an estimator of θ0,θˆ(Z), and a variance estimator of θˆ(Z),Jˆ(Z). For example, when the point estimator θˆ(Z) is the MLE, its variance can be estimated by the inverse of the observed information matrix [13], which is consistent and widely used in practice [10]. Expecting that θˆ(Z) behaves as an M-estimator [23], and thus follows a multivariate Gaussian, we attempt to construct the confidence interval based on an adjusted posterior distribution:

GYZ,θ0=θ*;θ*~NθˆZ,JˆZ, (1)

which, in general, has a larger variability than the naive posterior distribution

G(Y)Z,θ0=θˆ(Z),

and results in a wider confidence interval.

In addition, we also consider a calibration step based on parametric bootstrap [12] similar to the correction introduced in [3]. To be specific, we simulate the ‘observed’ data from an assumed model with θ0=θˆ(Z), simulate realizations of G(Y) from the posterior distribution given above, and construct the 100(1-α)% confidence intervals for different α based on the quantiles of samples drawn from the posterior distribution. After repeating this simulation a large number of times, we examine the empirical coverage level of the constructed confidence intervals with respect to the true G(Y) in the simulated data, anticipating that the empirical coverage level of 100(1-α)% confidence intervals may differ from (1-α). We find a value α0.95 such that the corresponding 1001-α0.95% confidence interval has an empirical coverage level of 95%. (Here, 95% may be replaced by any other nominal coverage level chosen by the researchers.) This α0.95 level will then be used to construct the 95% confidence interval based on the posterior distribution (1) from the original data. The mathematical rationale is that

PG(Y)<qα(Z)θ0PG(Y)<qα(Z)θˆ(Z),

where qα(Z) is the αth quantile of the posterior distribution (1), and the parametric bootstrap is used to estimate PG(Y)<qα(Z)θˆ(Z). This step can be viewed as a calibration step based on a bootstrap method [12]. In the end, if denoting the resulting confidence interval as [Lˆ(Z),Uˆ(Z)], we expect that

P{Lˆ(Z)G(Y)Uˆ(Z)}=0.95,

where the probability is with respect to the joint distribution of (Y,Z). The detailed steps are outlined in algorithm 2.

2.4. Interval estimates of clonality and entropy

We now apply the general algorithm introduced in Section 2.3 to construct confidence intervals for clonality and entropy. Consider a parametric model for the number of cells from a clone, i.e. cells with a particular V(D)J rearrangement:

λi~Gammaa0,b0,i=1,,C0,
Zi~Poissonλi,i=1,,C0,

where C0 is the total number of clones. λii=1C0 and a0,b0 are unknown to us, but we observe Zi if Zi>0, which is the number of reads corresponding to the ith V(D)J rearrangement. Therefore, the observed data consist of the truncated independent Poisson with parameters λ1,,λC0:

ZiZi>0,1iC0.

Note that the set of clones without any read is unobserved. To maintain consistency with the notation used in Section 2.4, we will later show that C0 can be estimated by Cˆ, and the vector (Z1,,ZCˆ) can be treated as the ‘observed data’ Z defined in Subsection 2.3. Now, we define clonality as GC:

GC:=i=1C0λij=1C0λj2,

and entropy as GE:

GE:=-i=1C0λij=1C0λjlogλij=1C0λj.

Both are functions of λ1,,λC0. Our aim is to obtain point estimates as well as confidence intervals for GC and GE.

Algorithm 2.

A General Algorithm to Construct a 95% Confidence Interval for G(Y)

1: Obtain an point estimator θˆ(Z) of θ0.
2: Compute a consistent variance estimator of θˆ(Z),Jˆ(Z).
3: for b=1,2,,B, do
4:  Simulate θ1(b),,θn(b)~i.i.d.N(θˆ(Z),Jˆ(Z)).
5:  Simulate Y1(b)*~pYyZ1,θ˜1(b),,Yn(b)*~pYyZn,θ˜n(b), where pY(yz,θ) is the density function of Y conditional on Z=z and θ0=θ.
6:  Compute Gb*(Z)=GY1(b)*,,Yn(b)*.
7: end for
8: for i=1,2,,R, do
9:  Simulate a new dataset Yij*,Zij*j=1n~i,i,d.p(y,zθˆ(Z)). Let Zj*=Zij*j=1n and Yj*=Yij*j=1n.
10:  Calculate the random variable to be estimated: Gi*=GYi1*,,Yin*=GYi*.
11:  Obtain the point estimator θˆZi* of θˆ based on the generated data Zi*.
12:  Obtain a variance estimator of θˆZi*,JˆZi*.
13:  Using Steps 3–6 to obtain G1Zi*,,GBZi*.
14:  Construct the 100(1-α)% confidence interval as the 100α/2th percentile and 100(1-α/2)th percentile of G1Zi*,,GBZi*, denoted by CI^i(1-α).
15: end for
16: Calculate the empirical coverage level of CI^i(1-α) as
1Ri=1RIGi*CI^i(1-α),

where I() is an indicator function.
17: Determine the value of α0.95 such that the empirical coverage level of CI^i1-α0.95 is 95%, i.e.
1Ri=1RIGi*CI^i1-a0.95=0.95.
18: Return the 95% confidence interval for G(Y) based on observed data as the interval between the 100α0,95/2th and 1001-α0,95/2th quantiles of G1*(Z),,GB*(Z).

To fit into the general framework described in Section 2.4, Z corresponds to (Z1,,ZCˆ) and Y corresponds to λ1,,λCˆ.θ0 corresponds to a0,b0, the shape and rate parameters of the gamma distribution. Below, we outline the algorithm step by step.

2.4.1. Estimation of (a0,b0) and C0

Marginally, Zi follows a negative binomial distribution, i.e.

PZi=z=pZza0,b0=0e-λλzz!λa0-1e-b0λb0a0Γa0dλ=z+a0-1zb0b0+1a01b0+1z,

where pZ(a,b) represents the density function of the distribution of Zi. We aim to first estimate the parameters a0 and b0 in the gamma distribution based on observed data consisting of positive reads

ZO=ZiZi>0,i=1,,C0.

The resulting log-likelihood function in terms of a0,b0 is

l(a,b)=ziZOlogpZzia,b1-pZ(0a,b), (2)

which can be maximized via an EM algorithm [9,17].

If C0 is known, i.e. we observed clones with Zi=0, the full log-likelihood function becomes

ziZOlogpzia0,b0+zi=0logp0a0,b0. (3)

Therefore, the EM algorithm consists of the following E and M steps.

  • E-step: For a given estimator aˆ and bˆ, we have
    nˆ0=EC0-Caˆ,bˆ=pZ(0aˆ,bˆ)1-pZ(0aˆ,bˆ)C
    based on the relationship
    pZ(0aˆ,bˆ)EC0-Caˆ,bˆ+C=EC0-Caˆ,bˆ,
    where C is the number of observed nonzero Zis. Since
    pZ0a0,b0=b0b0+1a0,
    the expected difference between C0 and C can be expressed as
    nˆ0=EC0-Caˆ,bˆ=bˆaˆC(bˆ+1)aˆ-bˆaˆ.
  • M-step: Update the MLE of a0,b0 by maximizing
    ziZOlogpZzia0,b0+nˆ0logpZ0a0,b0.
    This optimization can be achieved via an Inner EM algorithm treating λi as missing variables:
    1. E-step: For given (aˆ,bˆ)
      Eλiaˆ,bˆ=Zi+aˆ1+bˆ,
      Elogλiaˆ,bˆ=ΨZi+aˆ-log(1+bˆ),i=1,,C,
      Eλ0aˆ,bˆ=aˆ1+bˆ,
      Elogλ0aˆ,bˆ=Ψ(aˆ)-log(1+bˆ),
      where λ0 corresponds to the Poisson rate of unobserved clones with Zi=0 and Ψ() is the digamma function, the derivative of the log-transformed Gamma function, i.e. Ψ(x)=log0tx-1e-tdt.
    2. M-step: Maximize the log-likelihood function
      la0,b0=a0-1i=1CElogλiaˆ,bˆ+nˆ0Elogλ0aˆ,bˆ-C+nˆ0logΓa0-a0logb0-b0i=1CEλiaˆ,bˆ+nˆ0Eλ0aˆ,bˆ=a0-1i=1CΨzi+aˆ-log(1+bˆ)+nˆ0{Ψ(aˆ)-log(1+bˆ)}-C+nˆ0logΓa0-a0logb0-b0i=1Czi+aˆ+nˆ0aˆ1+bˆ.
      In practice, we iterate the inner EM algorithm to maximize
      ziZOlogpZzia0,b0+nˆ0logpZ0a0,b0
    for a given nˆ0 and iterate the outer EM algorithm until nˆ0 converges. The final convergence of (aˆ,bˆ,nˆ0) can be assessed by the relative change of the log-likelihood:
    ziZOlogpZziaˆ,bˆ1-pZ(0aˆ,bˆ),
    whose value should increase with each iteration of the outer EM algorithm.

2.4.2. Estimation of the variance of (aˆ,bˆ)

After obtaining the MLE (aˆ,bˆ) for (a0,b0) using the EM algorithm, we can estimate the variance of (aˆ,bˆ) by the inverse of the observed information matrix, which can be calculated by taking the second derivative of the log-likelihood function w.r.t. a and b, i.e.

JˆZO=2l(a,b)a22l(a,b)ab2l(a,b)ba2l(a,b)b2-1(a,b)=(aˆ,bˆ).

2.4.3. Sampling from the posterior distribution

In constructing the 95% confidence interval, we apply the log transformation to (aˆ,bˆ) to ensure positivity, and generate

a*b*~expNlogaˆlogbˆ,AˆZOJˆZOAˆZO,

where we apply the delta method and AˆZO=diagaˆ-1,bˆ-1 is the Jacobian matrix. The posterior distribution of λi* is then

λi*~λiZi,a*,b*,

which is a gamma distribution with the shape and rate parameters being a*+Zi and b*+1, respectively. This approach ensures that sampled a* and b* are always positive. Operationally, we pretend that C0=Cˆ and let

Z=Z1,Z2,,ZC,ZC+1=0,,ZCˆ=0=ZO{0,,0}.

We then define

Y=λ1,λ2,,λCˆ.

The confidence intervals CI^CZO and CI^EZO are calibrated such that for clonality,

Pi=1Cˆλij=1Cˆλj2CI^CZOaˆ,bˆ=0.95,

and for entropy,

Pi=1Cˆ-λij=1Cˆλjlogλij=1Cˆλj2CI^EZOaˆ,bˆ=0.95.

The complete algorithm is provided in Algorithm 3. Note that the algorithm for constructing confidence intervals for entropy is the same as that for clonality, except that we replace all GC(Y)=i=1C0λij=1C0λj2 by GE(Y)=-i=1C0λij=1C0λjlogλij=1C0λj.

Algorithm 3.

The Algorithm to Construct a 95% Confidence Interval for Clonality

1: Use the proposed EM algorithm to obtain the MLE for a0,b0, and C0, denoted by aˆ,bˆ and Cˆ, respectively.
2: Compute JˆZO as the inverse of the observed information matrix for a0 and b0 and AˆZO.
3: Let Z=Z1,,ZC,ZC+1=0,,ZCˆ=0.
4: for b=1,,B, do
5.  Simulate ai(b)*bi(b)*~i.i.d.expNlogaˆlogbˆ,AˆZOJˆZOAˆZO for i=1,,Cˆ.
6:  Simulate λi(b)*~Gammaai(b)*+zi,bi(b)*+1 for i=1,,Cˆ.
7:  Compute GC(b)*ZO=i=1Cˆλi(b)*j=1Cˆλi(b)*2.
8: end for
9: for i=1,,R, do
10:  Simulate λi1*,λiCˆ*~Gamma(aˆ,bˆ) and let Λi*:=λij*j=1,,Cˆ.
11:  Compute GCΛi*=j=1Cˆλij*k=1Cˆλik*2.
12:  Simulate a new set of data Zi*=Zij*j=1,,Cˆ where Zij*~Poissonλij*.
13:  Induce the observed data ZiO*=Zij*Zij>0,j=1,,Cˆ.
14:  Repeat steps 4–5 for ZiO* and obtain GC(1)*ZiO*,,GC(B)*ZiO*.
15:  Let CI^αZiO* be the interval between the 100α/2th and 100(1-α/2)th percentiles of GC(1)*ZiO*,GC(B)*ZiO*.
16: end for
17: Determine the value of αˆ0 such that the proportion of GCΛi* falls in CI^αˆ0ZiO* is closest to 95%, i.e.
1Ri=1nIGCΛi*CI^αˆ0ZiO*0.95.
18: Return the 100α/2th and 100(1-α/2)th percentiles of GC(b)*ZO,b=1,,B, which is the 95% confidence interval for the clonality.

Remark 1: This proposal is closely related to [3,16]. Specifically, in order to construct valid confidence intervals for clonality and entropy, we combine the ‘Type III bootstrap’ in [16] with the calibration method in [3]. In other words, the calibration is applied to a confidence interval that already partially accounted for the randomness in estimated hyperparameters of the prior distribution (instead of the naive confidence interval from Algorithm 1). In addition, while [3,16] offer very general frameworks for EB inference, all of their examples are relatively simple. In the current application, the quantities of interest are complex functions of a very large number of parameters. We also need to deal with the challenges of missing data by deploying the appropriate EM algorithm, since only a truncated version of the reads of all V(D)J rearrangements is observable.

3. Simulation study

We conduct a comprehensive simulation study to examine the empirical performance of the constructed confidence intervals for both clonality and entropy. For each given set of a0,b0,C0, we repeat the experiments 500 times to compute the empirical coverage level of the constructed confidence intervals. In constructing the confidence interval, we set the number of resampling for the calibration to be R=200. The number of posterior sampling is B=500.C0, the number of clones is set at 10, 000. Here, C0 partially governs the number of observed clones in the observed dataset. More specifically, a higher C0 is expected to reduce the variability of clonality or entropy and renders the effect of the calibration less sensitive to the accuracy of the estimated prior distribution. In addition, a higher C0 also helps to estimate a0,b0, the parameters of the prior distribution. Thus, we expect that the performance of the proposed confidence interval improves for larger C0 in general. The values of a0 and b0 are fixed at their maximum likelihood estimators based on data examples in Section 4 to mimic real practice. For comparison purposes, we also construct the confidence intervals based on the naive Bayesian procedure given in Algorithm 1 and the confidence intervals based on the posterior distribution (1) without the calibration step.

The simulation results are summarized in Table 1. The empirical coverage level of our proposed confidence intervals is close to its nominal level. On the other hand, the confidence intervals based on the Naive EB approach undercover the true parameter, which is expected considering the fact that the variability of (aˆ,bˆ) is ignored when constructing the confidence interval. The coverage level of confidence intervals without calibration is higher than the nominal level, suggesting that these intervals are too conservative. This could be due to the fact that this method did not consider the fact that (aˆ,bˆ) is also dependent on ZO and thus is correlated with the quantiles of the posterior distribution. This observation confirms the essential role played by the calibration step, which is also the most computationally intensive part of the proposed algorithm. To illustrate this visually, we plot confidence intervals for clonality constructed by three methods in six trials of the simulation study with a0,b0,C0=(0.732,0.882,10,000) in Figure 1. From the plot, the confidence intervals constructed using the Naive EB method (blue) are the narrowest. In contrast, the confidence intervals constructed using EB without calibration (green) are too wide, underscoring the importance of the calibration step, which produces the most appropriate confidence intervals (red). Based on this result and our limited experience, confidence intervals constructed via the EB method without calibration perform better than naive confidence intervals but may still fail to achieve the desired coverage level. Therefore, the calibration step is recommended as long as the computational cost of the parametric bootstrap remains manageable.

Table 1.

The simulation results on empirical coverage level of constructed confidence intervals.

(a0,b0) Method Clonality Coverage Entropy Coverage
(0.732, 0.882) EB w Calibration 94.0% 92.2%
EB w/o Calibration 100% 100%
Naive EB 62.8% 45.6%
(0.414, 0.335) EB w Calibration 91.2% 95.8%
EB w/o Calibration 100% 100%
Naive EB 78.0% 69.4%
(0.596, 0.960) EB w Calibration 98.4% 96.6%
EB w/o Calibration 100% 100%
Naive EB 67.2% 40.6%
(0.551, 0.775) EB w Calibration 97.0% 96.0%
EB w/o Calibration 100% 100%
Naive EB 68.4% 44.8%
(0.171, 0.301) EB w Calibration 98.4% 99.2%
EB w/o Calibration 100% 100%
Naive EB 88.6% 70.6%
(0.126, 0.132) EB w Calibration 95.8% 98.2%
EB w/o Calibration 100% 100%
Naive EB 94.0% 83.8%
(0.0860, 0.111) EB w Calibration 94.4% 99.6%
EB w/o Calibration 100% 100%
Naive EB 94.2% 84.8%
(0.113, 0.142) EB w Calibration 95.6% 98.6%
EB w/o Calibration 100% 100%
Naive EB 91.2% 83.2%

Note: EB w/ Calibration refers to the proposed method. EB w/o Calibration is based on the posterior distribution in Equation (1) without the calibration step. Naive EB denotes the confidence interval constructed using the naive EB procedure described in Algorithm 1.

Figure 1.

Figure 1.

95% confidence intervals constructed by the Naive EB, EB with calibration, and EB without calibration for clonality. From the 500 experimental trials, we randomly selected six trials to generate the plot.

Additional to results in Table 1, we plot the confidence intervals for the clonality and their corresponding true clonalities from 500 data simulated with a0,b0,C0=(0.086,0.111,10,000) in Figure 2. The confidence intervals are sorted according to the size of true clonalities. Notably, true clonalities have quite substantial variation relative to the width of the 95% confidence intervals, highlighting the importance of treating G(Y) as a random quantity.

Figure 2.

Figure 2.

95% confidence intervals for the clonality based on 500 simulated datasets; The intervals are sorted based on the size of their true clonalities (from the smallest to the largest).

Next, we examine the empirical performance of the proposed confidence intervals when the underlying rates λi,i=1,,C0 do not follow a gamma distribution, i.e. the assumed Poisson-Gamma model is misspecified in characterizing the data generation process. Specifically, we simulate λi from a log-normal distribution expNμ0,σ02 with chosen μ0 and σ0 and generate the observed reads ZO. Using the same steps described above, we construct confidence intervals based on 500 simulated datasets and calculate the empirical coverage level of these confidence intervals. The results are summarized in Table 2. The empirical coverage level of entropy is quite close to 95% even though the Poisson-Gamma parametric model is misspecified, implying the robustness of the proposed confidence intervals. On the other hand, the constructed confidence interval for clonality severely underestimates the true parameter, suggesting the importance of assuming an appropriate distribution of the Poisson rates.

Table 2.

The simulation results when the underlying model for rate is log-normal rather than Gamma.

μ0,σ02 Coverage for Clonality Coverage for Entropy
(−1.38, 1.642) 80.4% 98.2%
(−1.27, 1.722) 86.8% 97.8%
(−1.22, 1.502) 80.2% 98.4%
(−1.02, 1.622) 82.7% 95.8%

4. Real data studies

In this section, we illustrate our approach by applying it to analyze a recent study conducted by [20]. The objective is to investigate human T cell receptor (TCR) diversity. Specifically, we are interested in measuring TCR diversity by clonality and entropy. In the study conducted by [20], five replicate TCR libraries of CD4 naive T cells and CD4 memory T cells are sequenced from each of the seven participants. The total number of reads varied from 8.9 × 104 to 7.4 × 105. First, we count the total number of reads in each clone across five replicates, and Figure 3 shows the observed cumulative proportions of clones sorted from the largest to the smallest for both CD4 naive and memory T cells. From the figure, it is clear that large clones contain a higher proportion of CD4 memory T cells than naive CD4 T cells, reflecting the relative evenness of the distribution of clone sizes of naive T cells.

Figure 3.

Figure 3.

Cumulative proportion of cells from clones, sorted by clone size from largest (left) to smallest (right).

Next, we apply the proposed method to construct confidence intervals of entropy and clonality for the naive cells and the memory cells based on data from five replicates per patient, which are plotted in Figures 4 and 5. The confidence intervals based on different replicates of the same participant are fairly consistent in general, suggesting a low within-person variation relative to between-person variation, supporting the validity of the experiment result. In addition, the clonality of CD4 memory T cells is substantially greater than that of CD4 naive T cells, again confirming observations on the evenness of the distribution of the naive T cell clones in Figure 3.

Figure 4.

Figure 4.

95% confidence intervals for clonality based on five replicates per patient for naive cells and memory cells (x-axis displayed on a logarithmic scale for better visualization). A lower clonality value (further to the left on the x-axis) indicates greater diversity in the cell population.

Figure 5.

Figure 5.

95% confidence intervals for entropy based on five replicates per patient for naive cells and memory cells. A higher entropy value (further to the right on the x-axis) indicates greater diversity in the cell population.

Lastly, we estimate the ‘average’ (log-transformed) clonality and entropy of naive CD4 T cells for three participants younger than 40 and four participants older than 70, separately based on a random effects model, which is well-known from meta analysis. We then compare the clonality and entropy between younger and older participants. The average log-transformed clonality for memory T cells is −9.45[−9.66 to −9.24] for young participants and −7.79[−9.09 to −6.50] for old participants with a two-sided p-value of 0.014 for testing the null hypothesis that old participants have the same average clonality as young participants. The average entropy is 10.97[10.76 to 11.17] for young participants and 10.37[9.88 to 10.87] for old participants with a two-sided p-value of 0.029 for testing the null hypothesis that old participants have the same average entropy as young participants. These results suggest that the immune diversity in old participants is lower than that in young participants, as anticipated, implying the aging effect on the human immune system.

5. Discussion

In this paper, we discuss a method for constructing confidence intervals for entropy and clonality: both are functions of a high dimensional probability vector. The primary challenges stem from the curse of dimensionality, where traditional estimators often struggle to maintain accuracy due to sparse observations in high-dimensional spaces. In particular, the probability vector can not be estimated well, and the corresponding plug-in estimators of entropy and clonality are very poor in the high dimensional setting. EB is a natural approach because components of the high dimensional probability vector can be viewed as random samples from a prior distribution and such a distribution is estimable under appropriate parametric assumptions. When the dimension of the probability vector is high, i.e. the number of random samples is large, one may expect that the value of entropy or clonality is driven by the simple prior distribution generating the probability vector and the inference on entropy or clonality can be made accordingly. Therefore, our method is developed within a general EB framework, coupling the adjustment of the uncertainty in estimating the prior distribution with a parametric bootstrap-based calibration step. Both components are important, since the former determines the location of the confidence interval and the latter ensures the correct coverage level. Based on our numerical study, the proposed confidence interval can achieve reasonable performance when the parametric model for the prior distribution is correctly specified.

While the proposed method demonstrates some robustness against model misspecification (an average coverage of 82.5% for clonality and 97.6% for entropy across the four experimental settings for model misspecification), its validity is not entirely immune to such issues. Especially, the proposed calibration is not a substitute for imposing a good parametric model for the prior distribution in the first place. As illustrated by results in Table 2, the performance of the confidence intervals could be poor if the prior distribution is severely misspecified. Replacing the gamma distribution by a more flexible model could be a promising direction for future research. In particular, it is appealing to consider distributions from a nonparametric exponential family: p(λη)p0(λ)expB(λ)η, where p(η) is the density function of the intensity rate and B(λ) is a set of flexible basis functions given a priori, such as B(λ)=λ,λ2,λ3 [21]. Lastly, in the proposed approach, the actual number of distinct clones is replaced by its estimator, which may affect the performance of the subsequent point and interval estimation. It is conceivable that the impact is greater for some functions, such as entropy, which is more sensitive to small-size clones than other functions, such as clonality, which is robust to small-size clones. However, estimating the number of distinct clones is analogous to estimating the number of unseen species, which is a difficult problem and in the current case depends on the parametric assumption for the intensity rate [11]. Therefore, it is important to study the impact of this estimator on the construction of the confidence interval for different diversity parameters.

Acknowledgments

This work was completed when Zhongren Chen was at Stanford University. Dr. Lu Tian and Dr. Richard Olshen are supported by National Institutes of Health. Dr. Richard A. Olshen passed away on November 8, 2023. Author’s Original Manuscript: https://arxiv.org/abs/2211.14755. While preparing this manuscript, we utilized ChatGPT 4.0 and Claude 3.5 Sonnet for sentence-level edits, such as fixing grammar and rewording sentences.

Footnotes

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • [1].Aldrich RJ, GCHQ: the Uncensored Story of Britain’s Most Secret Intelligence Agency, HarperPress, London, 2010. [Google Scholar]
  • [2].Box GE and Tiao GC, Bayesian Inference in Statistical Analysis, John Wiley & Sons, Hoboken, NJ, 2011. [Google Scholar]
  • [3].Carlin BP and Gelfand AE, Approaches for empirical bayes confidence intervals, J. Am. Stat. Assoc 85 (1990), pp. 105–114. [Google Scholar]
  • [4].Carlin BP and Louis TA, Empirical bayes: Past, present and future, J. Am. Stat. Assoc 95 (2000), pp. 1286–1289. [Google Scholar]
  • [5].Casella G, An introduction to empirical bayes data analysis, Am. Stat 39 (1985), pp. 83–87. [Google Scholar]
  • [6].Chao A, Estimating the population size for capture-recapture data with unequal catchability, Biometrics 43 (1987), pp. 783–791. [PubMed] [Google Scholar]
  • [7].Chao A, Estimating population size for sparse data in capture-recapture experiments, Biometrics 45 (1989), pp. 427–438. [Google Scholar]
  • [8].Chao A and Shen TJ, Nonparametric estimation of shannon’s index of diversity when there are unseen species in sample, Environ. Ecol. Stat 10 (2003), pp. 429–443. [Google Scholar]
  • [9].Dempster AP, Laird NM, and Rubin DB, Maximum likelihood from incomplete data via the em algorithm, J. R. Stat. Soc.: Ser. B Methodol 39 (1977), pp. 1–22. [Google Scholar]
  • [10].Efron B and Hinkley DV, Assessing the accuracy of the maximum likelihood estimator: Observed versus expected fisher information, Biometrika 65 (1978), pp. 457–483. [Google Scholar]
  • [11].Efron B and Thisted R, Estimating the number of unseen species: How many words did shakespeare know?, Biometrika 63 (1976), pp. 435–447. [Google Scholar]
  • [12].Efron B and Tibshirani RJ, An Introduction to the Bootstrap, CRC press, New York, 1994. [Google Scholar]
  • [13].Fisher RA, Theory of statistical estimation, in Mathematical proceedings of the Cambridge philosophical society, Vol. 22. Cambridge University Press, Cambridge, 1925, pp. 700–725. [Google Scholar]
  • [14].Glanville J, Huang H, Nau A, Hatton O, Wagar LE, Rubelt F, Ji X, Han A, Krams SM, Pettus C, Haas N, Arlehamn CSL, Sette A, Boyd SD, Scriba TJ, Martinez OM, and Davis MM, Identifying specificity groups in the t cell receptor repertoire, Nature 547 (2017), pp. 94–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Kaplinsky J and Arnaout R, Robust estimates of overall immune-repertoire diversity from high-throughput measurements on samples, Nat. Commun 7 (2016), p. 11881. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Laird NM and Louis TA, Empirical bayes confidence intervals based on bootstrap samples, J. Am. Stat. Assoc 82 (1987), pp. 739–750. [Google Scholar]
  • [17].McLachlan G and Jones P, Fitting mixture models to grouped and truncated data via the em algorithm, Biometrics 44 (1988), pp. 571–578. [PubMed] [Google Scholar]
  • [18].Morris CN, Parametric empirical bayes confidence intervals, in Scientific inference, data analysis, and robustness, Elsevier, New York, 1983, pp. 25–50. [Google Scholar]
  • [19].Norris JL and Pollock KH, Non-parametric mle for poisson species abundance models allowing for heterogeneity between species, Environ. Ecol. Stat 5 (1998), pp. 391–402. [Google Scholar]
  • [20].Qi Q, Liu Y, Cheng Y, Glanville J, Zhang D, Lee JY, Olshen RA, Weyand CM, Boyd SD, and Goronzy JJ, Diversity and clonal selection in the human t-cell repertoire, Proc. Natl. Acad. Sci 111 (2014), pp. 13139–13144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Schwartzman A, Empirical null and false discovery rate inference for exponential families, Ann. Appl. Stat 2 (2008), pp. 1332–1359. [Google Scholar]
  • [22].Tian L, Liu Y, Fire AZ, Boyd SD, and Olshen RA, Clonality: point estimation, Ann. Appl. Stat 13 (2019), pp. 113–131. [Google Scholar]
  • [23].Van der Vaart AW, Asymptotic Statistics, Vol. 3, Cambridge university press, Cambridge, 2000. [Google Scholar]
  • [24].Zurek WH, Complexity, Entropy and the Physics of Information, CRC Press, Boca Raton, 2018. [Google Scholar]

RESOURCES