On Random Sample Size, Ignorability, Ancillarity, Completeness, Separability, and Degeneracy: Sequential Trials, Random Sample Sizes, and Missing Data

Geert Molenberghs; Michael G Kenward; Marc Aerts; Geert Verbeke; Anastasios A Tsiatis; Marie Davidian; Dimitris Rizopoulos

doi:10.1177/0962280212445801

. Author manuscript; available in PMC: 2015 Feb 1.

Published in final edited form as: Stat Methods Med Res. 2012 Apr 18;23(1):11–41. doi: 10.1177/0962280212445801

On Random Sample Size, Ignorability, Ancillarity, Completeness, Separability, and Degeneracy: Sequential Trials, Random Sample Sizes, and Missing Data

Geert Molenberghs ^1,², Michael G Kenward ³, Marc Aerts ¹, Geert Verbeke ^2,¹, Anastasios A Tsiatis ⁴, Marie Davidian ⁴, Dimitris Rizopoulos ⁵

PMCID: PMC3404233 NIHMSID: NIHMS335858 PMID: 22514029

Abstract

The vast majority of settings for which frequentist statistical properties are derived assume a fixed, a priori known sample size. Familiar properties then follow, such as, for example, the consistency, asymptotic normality, and efficiency of the sample average for the mean parameter, under a wide range of conditions. We are concerned here with the alternative situation in which the sample size is itself a random variable which may depend on the data being collected. Further the rule governing this may be deterministic or probabilistic. There are many important practical examples of such settings, including missing data, sequential trials, and informative cluster size. It is well known that special issues can arise when evaluating the properties of statistical procedures under such sampling schemes, and much has been written about specific areas^3-4. Our aim is to place these various related examples into a single framework derived from the joint modeling of the outcomes and sampling process, and so derive generic results that in turn provide insight, and in some cases practical consequences, for different settings. It is shown that, even in the simplest case of estimating a mean, some of the results appear counter-intuitive. In many examples the sample average may exhibit small sample bias and, even when it is unbiased, may not be optimal. Indeed there may be no minimum variance unbiased estimator for the mean. Such results follow directly from key attributes such as non-ancilliarity of the sample size, and incompleteness of the minimal su cient statistic of the sample size and sample sum. Although our results have direct and obvious implications for estimation following group sequential trials, there are also ramifications for a range of other settings, such as random cluster sizes, censored time-to-event data, and the joint modeling of longitudinal and time-to-event data. Here we use the simplest sequential group sequential setting to develop and explicate the main results. Some implications for random sample sizes and missing data are also considered. Consequences for other related settings will be considered elsewhere.

Keywords: Frequentist Inference, Generalized Sample Average, Informative Cluster Size, Joint Modeling, Likelihood Inference, Missing at Random, Random Cluster Size

1 Introduction

In much conventional statistical methodology, it assumed that the sample has a fixed and a priori known size. There exist many practical settings however in which sample sizes are, by contrast, random, and the derivation of the frequentist properties of statistical procedures in such cases raises fundamental issues that require careful attention. There are several distinct classes of problem of this type. First, data may be missing in the sense that fewer observations are collected than planned: the sample of units may be smaller than envisaged, or a set of observations from a unit, multivariate or longitudinal, may be incomplete¹. Second, in (group) sequential trials, the final sample size depends on the point at which accumulating evidence crosses a pre-specified boundary². Third, the sample size can be completely random (CRSS), in the sense that it does not depend on the observations, either collected already or still to be collected^3-4. Here we incorporate such settings into a general framework for problems with non-fixed sampling schemes. To allow us to develop the main ideas in a simple way, but which still allows explication of the key problematic issues, we focus on the generic case where a sample size N can take only one of two values: n or 2n. Having developed a full theory for this setting we then consider in turn the three particular settings mentioned above. A large literature already exists on various classes of the current problem. One of most familiar, and well worked concerns group sequential trials. Several authors have pointed to fundamental problems with the frequentist perspective of conventional estimators, when applied after conducting a sequential trial^5-8, whether they take the form of simple sample averages or those obtained through maximum likelihood. By incorporating a probit model for the stopping rule into the simple two sample size generic case described above, combined with normally distributed outcomes, we are able to develop a sufficiently rich setting for development, that includes the simple sequential sampling scheme as a limiting special case. By applying the work of Liu and Hall⁸ in this setup, we are able to uncover the non-trivial implications of a random sample size, essentially making use of two important properties: (a) excluding the CRSS case, the sample size is non-ancillary given the sum of the observations made regardless of whether the stopping rule is probabilistic (as in many missing data applications) or deterministic (as in sequential trials); (b) the pair made up of the accumulating sample sum and the sample size is a minimal sufficient statistic that is not complete.

Using the above we show that the classical sample average is generally biased, but asymptotically unbiased. An unbiased estimator can be obtained from the conditional likelihood, where the conditioning is on the (non-ancillary) sample size. One consequence of this lack of ancillarity is that, unexpectedly, the conditional estimator has larger mean squared error than the simple sample average, a result which follows from the joint likelihood. This underscores the point that, in contrast to some claims made in the literature, such an average is a perfectly legitimate estimator. The finite-sample bias in the sample average is zero only in the CRSS case. Even then, it is not unique in that a whole class of so-called generalized sample average estimators can be defined, all of which are unbiased. This enables us to show that the ordinary sample average is only asymptotically optimal. Indeed there is no uniformly optimal unbiased estimator in finite samples.

In addition to the three settings which we study in detail in this paper, there are several additional classes of problem that fit within our generic framework. We now list four of these, continuing the list of the second paragraph in this section. Fourth, so-called ‘informative cluster sizes’, in which the cluster sizes may themselves depend on the outcomes of interest. Fifth, censored time-to-event data have much in common with missing data. Sixth, the joint modeling of longitudinal outcomes and time-to-event data is an extension of various earlier settings, including incomplete longitudinal data, censored time-to-event data, and informative cluster sizes. Seventh, a longitudinal outcome may be paired with a random measurement occasion schedule. The general results derived in this paper also apply to these settings. The main results developed here will be applied to these four settings elsewhere.

This paper is organized as follows. In Section 2, a general framework is sketched, encompassing the various settings mentioned earlier. Key technical and modeling concepts are introduced. The main ideas are developed in context of a sequential trial with probabilistic stopping rule; it is the focus of Section 3. Ramifications for random sample sizes are discussed in Section 4, while in Section 5 we focus on missing data. Some technical details are deferred to appendices.

2 Generic Setting as ‘Joint Modeling’

Consider the generic model for a pair of random variables Y _i and C_i:

f (y_{i}, c_{i} ∣ θ, ψ) = f (y_{i} ∣ θ) \cdot f (c_{i} ∣ y_{i}, ψ)

(1)

= f (y_{i} ∣ c_{i}, θ, ψ) \cdot f (c_{i} ∣ θ, ψ),

(2)

where Y _i are outcomes of interest and C_i is a context-specific augmentation, such as sample size or missing-data pattern. Here θ and ψ are unknown fixed parameter vectors. In line with work by Rubin and Little^1-9-11, (1) is the selection model factorization and (2) is the pattern-mixture factorization. Also, following common practice in the missing-data literature, the mechanism f(c_i|y_i, ψ) can be classified according to the following taxonomy. Missing data are missing complete at random (MCAR), if f(c_i|y_i, ψ) = f(c_i|ψ), and missing at random (MAR), if $f (c_{i} ∣ y_{i}, ψ) = f (c_{i} ∣ y_{i}^{o}, ψ)$ , where $B Y_{i}^{o}$ is the observed portion of the sequence. Otherwise, the mechanism is missing not at random (MNAR). We can make a distinction between Rubin’s original definition of these and a frequentist definition according to whether these particular forms are assumed to hold just for the data observed, or for all possible data sets respectively. Note that these concepts are not restricted to the missing data case, but rather apply to all of the above settings.

The seven instances of the previous section can be formalized as follows.

1. Sequential trials

Consider a simple sequential trial, where n measurements Y_i are observed, after which a stopping rule is applied and, depending on the outcome, another set of n measurements is or is not taken. Then, Y is the (2n × 1) vector of outcomes that could be collected and C = N is the realized sample size, that is, N = n or N = 2n. This setting was considered by¹². More generally, a sequential trial would allow for size N = n_i < … < n_L, for pre-specified L.

2. Missing longitudinal data

Here Y _i is an sequence of planned measurements of length n_i for subject i in a trial of size N, and C_i is a vector of indicators, with C_ij = 1 if measurement Y_ij is taken and 0 otherwise. It is customary to decompose $Y_{i} = (Y_{i}^{0}, Y_{i}^{m})$ , the observed and missing components, respectively.

3. Completely random sample size

Here Y is an outcome vector, as above, but with C = N a CRSS, taking values N = 0, …, m, with probabilities that are independent of Y .

4. Clusters of random size

Assume that data are sampled in the form of clusters. Then Y _i is the set of outcomes for cluster i = 1, …, N and C_i = t_i is the observed cluster size. Two cases can be distinguished. First, some members of the cluster may fail to provide a measurement, implying that t_i ≤ n_i, with n_i the number of cluster members. Second, in the case that there are no missing data, that is t_i ≡ n_i, it is still possible that the cluster size contain relevant information about the parameter of interest.

5. Censored time-to-event data

Now, Y _i = T_i, a time-to-event for subject i = 1, …, N and C_i = C_i is a censoring time.

6. Joint models for longitudinal and time-to-event data

In this setting, Y _i is a longitudinal outcome, as in the second setting, and C_i = (T_i, C_i) is the pair of event and censoring times, as in the previous setting.

7. Random measurement times

A situation closely related to the previous ones is that of random measurement times. By jointly considering the outcomes Y_i and the measurement times T_i, this relates directly to the previous setting, as well as to the missing data and random cluster size settings.

In some of the settings above, covariates, X_i say, may be present. Without loss of generality these will be suppressed from notation, throughout.

A fundamental concept, also due to Rubin⁹, is that of ignorability. For pure likelihood or Bayesian inferences, and assuming MAR, it is well known that inferences about θ can proceed by merely using $f (y_{i}^{0} ∣ θ)$ , i.e., without explicitly modeling the missing-data mechanism. This is, provided the regularity condition of separability holds. (Note that this concept can be transported to generic setting (1), by letting $Y_{i}^{o}$ be the observed portion in the first factor on the right hand side.) Formally, this implies that the parameter space of (θ‘, ψ’)’ is equal to the Cartesian product of their individual product spaces. Informally stated, this means that the missing-data model does not contain information about the outcome model parameter. Even when separability is not satisfied, the consequence will be efficiency loss, but validity of the estimation method will not be altered. In other words, C_i can be considered ancillary in the sense of Cox and Hinkley¹³(pp. 32–35). We will see that this is not true for all situations. These considerations imply that ignorability may fail to hold for four reasons. First and most directly, we may be in the likelihood or Bayesian framework but with an MNAR mechanism operating. Second, ignorability does not hold in the likelihood and Bayesian framework, under MAR but in a non-separable situation; as stated above, this is not an important issue in practice. Third, under frequentist inferences, both MAR and MNAR are generally non-ignorable. Fourth, for pure likelihood and Bayesian inferences, with MAR and separability holding, ignorability in the selection model decomposition (1) does not translate to the pattern-mixture model (2), as is clear from the presence of both θ and ψ in both factors of (2). Of course, one can consider ignorability in the pattern-mixture context, by directly parameterizing the PMM model; in this case, ignorability does not carry over to the selection model setting. We learn from this that the ignorability is parameterization specific. When ranging over the seven settings described above, all of these situations may occur.

There may be instances where it is natural to model f(y_i|c_i, θ*) on the one hand, but scientific interest lies with f(y_i|θ*, ψ*) on the other. The asterisk refer to the fact that these are pattern-mixture parameters from (2) which are then transported to (1). This could be the case for example for the clustered-data setting. Then, the connection between both decompositions implies a natural weight:

f (y_{i} ∣ θ^{*}, ψ^{*}) = f (y_{i} ∣ c_{i}, θ^{*}) \cdot \frac{f (c_{i} ∣ ψ^{*})}{f (c_{i} ∣ y_{i}, θ^{*}, ψ^{*})} = f (y_{i} ∣ c_{i}, θ^{*}) \cdot w_{i} (y_{i}, c_{i}, θ^{*}, ψ^{*}) .

(3)

Here, ‘natural weight’ is used in the sense of being implied by a joint model, rather than following from ad hoc specification. Note that the above fourth case for non-ignorability now reverses: while the pattern-mixture formulation may satisfy ignorability, the derived selection model is unlikely to. The above is not speficic to the missing-data context but applies to all cases considered.

It is useful to point out the connection between ignorability and ancillarity in the sense of Cox and Hinkley¹³. These authors define an ancillary statistic T as one that complements a minimally sufficient statistic S such that, given S, T does not contain information about the parameter of interest. For example, when estimating the mean of a normally distributed population with mean μ and variance 1, the sample size T = n is ancillary given the sample sum or sample average. This is certainly true for a sample size fixed by design and also for a random sample provided the law governing the sample size does not depend on μ. The general sequential trial case considered in this paper is a clear counterexample to this.

A further property that has an important role to play in the present context is that of completeness ¹⁴(pp. 285–286). A statistic s(Y) of a random variable Y, with Y belonging to a family P_θ, is complete if, for every measurable function g(·), E[g{s(Y)}] = 0 for all θ, implies that P_θ [g{s(Y)} = 0] = 1 for all θ. The relevance of completeness in the present setting arises in two ways. First, from the the Lehman-Sche é theorem¹⁴, if a statistic is unbiased, complete, and sufficient for some parameter θ, then it is the best mean-unbiased estimator for θ. This result is important in framing the problems that occur with estimation following group sequential trials, a point to which we return in Section 3. Second, the connection with ancillarity follows from Basu’s theorem^14-15 (p. 287): a statistic that is both complete and sufficient is independent of any ancillary statistic.

The above indicates that lack of completeness might correspond to violation of ignorability. Noting that these theorems are implications rather than equivalences, we will see that there are plenty of counterexamples in the sequential-trial setting. This means that a conventional statistical analysis, in line with what would be undertaken with complete data and based only upon f(y_i|θ) in (1), may lead to bias. The lack of completeness may imply that there is no uniformly best way to overcome this problem, a point that is taken up in the next section.

We now consider the ‘weights’ in (3). These provide a natural connection with inverse probability weights; i.e., they provide a way to re-compose the population of interest from the sample data, although we do need to make the formal distinction between this type of of weighting which applies to the likelihood itself and the more familiar inverse probability or Horvitz-Thompson type¹⁶ which essentially applies to the scores or estimating equations, and strictly lies outside the likelihood framework. By definition, the ‘weights’ can be removed in the ignorable cases, but are needed in all others. We will consider various cases in sections to follow. Evidently, the connection between both frameworks is problematic if the numerator f(c_i |y_i, θ* ψ*,) of the ‘weights’ w_i ≡ w_i(y_i, c_i, θ*, ψ*) is zero over a subset of the sample space or, more precisely, are not bounded away from zero. We will refer to these problematic situations as degenerate or deterministic, each of the terms stressing different aspects of the problem. This will occur for some incomplete-data and sequential-trial settings. In contrast to the semi-parametric case, this issue does not occur in the parametric situation. Nevertheless, caution is needed: the problem has been transferred to one of extrapolation, that is, unverifiable distributional assumptions have to be made about the model governing the sample space that cannot be reached.

A final remark is that in some cases there may fail to be replication to ensure estimability of all parameters, as we will see in the sequential-trial case. This is why the parameters governing the stopping rule, whether deterministic or probabilistic, will be assumed known. This does not imply a limitation.

3 Sampling With a Deterministic or Probabilistic Stopping Rule

Sequential trials have a long history¹⁷, and throughout this, estimation following (group) sequential trials has been both of interest and controversial. This is true for point estimation^5-8 as well as estimation of precision^12–18–20. Much of the relatively early work has been summarized by Whitehead and colleagues²¹.²⁰ point out that although maximum likelihood estimators are not affected by stopping rules, their sampling distribution is. This follows from ignorability, while the likelihood itself is unaffected by the stopping rule, frequentist aspects based on sampling distributions are. It was precisely this issue that was studied in Kenward and Molenberghs¹². The same issue emerges in a different form in Hughes and Pocock⁶, who observe that there is usually bias towards overestimation of the treatment effect; they aim at reducing this bias using prior information in a Bayesian analysis. Todd, Whitehead, and Facey²⁰ then provide methods for bias correction for the estimator. As will be seen see below, one has to be extremely careful with the concept of bias. The issue of bias will be revisited and it will be shown that many estimators, including the sample average, are asymptotically unbiased. Tsiatis, Rosner, and colleagues¹⁸ and¹⁹ study precision estimation after group sequential trials in terms of the joint sampling distribution of a test statistic K, obtained after T looks on the one hand, and T, the number of looks, on the other. Thus, these authors also considered joint modeling in the broad sense of Section 2. In particular, our simple sequential trial, introduced in Section 2 and further studied below, is a special case of this situation where T can take values 1 or 2, corresponding to N = n and N = 2n, respectively. Emerson and Fleming⁷ propose estimators within an ordering paradigm. Liu and Hall⁸ cast the sequential trial test in the framework of a drift parameter θ of a Brownian motion Y (t) stopped at time T, and show that the sufficient statistic [T, Y (T)] is not complete for θ. Following our discussion of completeness in Section 2, it is not surprising that this means that there exist infinitely many unbiased estimators of θ, none of which has uniformly minimum variance.⁸ then focus on the class of so-called truncation-adaptable unbiased estimators and find the minimum-variance member among them; thus, the problem is alleviated by making a restriction to smaller class of estimators than is usually the case. They also show that this estimator is identical to the one proposed by Emerson and Fleming⁷.

The archetypical setting to be considered further was originally studied by Kenward and Molenberghs¹². In particular, they suppose that n i.i.d. N(μ, 1) observations Y₁, …, Y_n are collected and, if the sample fails to satisfy a given stopping rule, a further n observations Y_n+1, … , Y_2n are collected, which have the same distribution. While n is not a random variable, the final sample size N is, taking possible values n or 2n. Note that the setting is sufficiently general to be a paradigm for any (group) sequential trial, as will be made clear in Section 3.6. The inferential goal is to estimate μ, with an accompanying measure of precision. An obvious estimator is $\hat{μ} = N^{- 1} \sum_{i = 1}^{N} y_{i}$ . The aim of Kenward and Molenberghs¹² was to study the problems arising from such an approach, with emphasis on precision estimation. They argued that it is necessary to use the observed rather than the expected information matrix, because the setting is non-ignorable, in the sense that the corresponding likelihood is non-separable. Indeed, even though the stopping rule is conventionally defined in terms of the first n observations, and so independent of possibly missing outcomes, N does contain information about μ, given the observed outcomes. Equivalently, this means that N is not ancillary for μ given the observed data or, more precisely, a minimally sufficient statistic such as the average or sum of the observations. As is clear from the above literature review, the simple sample average is often surrounded with doubts about its sampling behavior. We will see below that it is a legitimate estimator, in a precisely defined sense. Somewhat surprisingly, we will see that this estimator is not constructed through conditioning on N, but rather is derived from from the joint likelihood for the data and the sample size. We will show that this is in complete agreement with the use of likelihood in an ignorable setting⁹. We will see, by contrast, that the conditional estimator takes a different form.

¹² proceeded by considering the full likelihood, using the pattern-mixture decomposition,

f (y_{1}, \dots, y_{N} ∣ N) f (N) .

(4)

They focused on the limiting case where the trial is stopped if $\sum_{i = 1}^{n} y_{i} [G R H] 0$ , and continued otherwise. This rule is deterministic, in line with common practice. They showed that the marginal probability of stopping $π = Φ (- \sqrt{n} μ)$ with Φ(·) the standard normal cumulative distribution function, and proceeded with the derivation of the corresponding log-likelihood and observed information. We will now frame the deterministic stopping rule as the limiting case of a probabilistic law, by considering the conditional probability of stopping:

π (N = n ∣ y_{1}) = π (y_{1}) = F (α + \frac{β}{n} k),

(5)

where

Y_{1} = (Y_{1}, \dots, Y_{n})

(6)

and $K = \sum_{i = 1}^{n} Y_{i}$ . Here, K is the sum over the observations available at the time of evaluation. So, at the time of the interim look, K is over n observations. K can also be evaluated after 2n observations for an experiment that continues after the interim look.

Even though the sample size N itself indicates whether or not stopping occurs, it is convenient to introduce a random variable Z that takes the value 1 if stopping occurs and 0 if not. Alternatively, the indicators I(N = n) and I(N = 2n) may be used. The full likelihood can be written:

L^{*} = \prod_{i = 1}^{2 n} ϕ (y_{i}; μ) \cdot F {(α + \frac{β}{n} k)}^{z} \cdot {1 - F (α + \frac{β}{n} k)}^{1 - z} .

(7)

Here, ϕ(·) is the standard normal density. Because Z is not replicated, α and β cannot be estimated from the data, so these parameters will be assumed fixed by design. This has no implications for the conceptual issues developed with this example. The observed data likelihood then takes the form:

L = \prod_{i = 1}^{N} ϕ (y_{i}; μ) \cdot F {(α + \frac{β}{n} k)}^{z} \cdot {1 - F (α + \frac{β}{n} k)}^{1 - z} .

(8)

Expressions (7) and (8) are identical, except for the sample size, which is 2n in the first case and N in the second. Because (7) starts from a selection model decomposition, it contains the marginal data model, rather than the one conditional on sample size. For the sample size, the reverse is true, in the sense that a conditional expression is given, rather than the marginal one. To reverse the factorization, as in (4), we proceed by writing

Y_{1} ~ N (μ, I_{n}),

(9)

where μ is a vector of n copies of μ and I_n is the n-dimensional identity matrix. By taking F(·) in (5) to be the probit function Φ(·), we can consider a latent stopping variable

T ∣ y_{1} ~ N (α + \frac{β}{n} k, 1),

(10)

underlying Z. From this, we can derive the joint distribution of Y₁ and T, using standard multivariate normal manipulations:

graphic file with name nihms-335858-f0001.jpg

(11)

where 1_n is an n-vector of ones. The marginal probability of stopping is therefore:

P (N = n) = P (Z = 1) = Φ (\frac{α + β μ}{\sqrt{1 + β^{2} ∕ n}}) .

(12)

Note that (12) depends on the parameter μ, implying that this pattern-mixture formulation non-separable. By contrast, although the observed data are present in the conditional stopping probability, μ is not, implying separability in the selection model formulation. In the next section, we will encounter an example of the reverse.

Continuing, the conditional probability for the outcomes is:

f (y_{i}, \dots, y_{N} ∣ Z = z) = \frac{\prod_{i = 1}^{N} ϕ (y_{i}; μ) Φ {(α + \frac{β}{n} k)}^{z} {[1 - Φ (α + \frac{β}{n} k)]}^{1 - z}}{Φ {(\frac{α + β μ}{\sqrt{1 + β^{2} ∕ n}})}^{z} {[1 - Φ (\frac{α + β μ}{\sqrt{1 + β^{2} ∕ n}})]}^{1 - z}} .

(13)

The connection between the marginal outcome model (9) and its conditional counterpart (13) again draws attention to a natural ‘weight’ or connecting factor:

f (y_{1}, \dots, y_{N}) = f (y_{1}, \dots, y_{N} ∣ Z = 1) \cdot w (y_{1}, \dots, y_{N}, Z = 1)

with

w (y_{1}, \dots, y_{N}, Z = 1) = \frac{Φ (\frac{α + β μ}{\sqrt{1 + β^{2} ∕ n}})}{Φ (α + \frac{β}{n} k)} .

(14)

A similar ‘weight’ applies to the case where Z = 0.

We now derived the limiting case in which the stopping rule is a deterministic function of the first n observations, i.e. we let β → ±∞. Focussing on the positive limit, we obtain

\frac{α + β μ}{\sqrt{1 + β^{2} ∕ n}} \overset{β \to + \infty}{\to} \sqrt{n} μ .

This leads to the probability derived directly by¹², up to an immaterial sign reversal: we will choose, in what follows, to avoid the minus sign. In this limiting case, the marginal outcome model retains the form (9), but the three other expressions change. First, the conditional outcome model (13) becomes

f (y_{1}, \dots, y_{N} ∣ Z = z) = \frac{\prod_{i = 1}^{N} ϕ (y_{i}; μ)}{Φ {(\sqrt{n} μ)}^{z} {[1 - Φ (\sqrt{n} μ)]}^{1 - z}} .

(15)

Second, (5) is given by P(N = n|y_i) = 1 if K > 0 and 0 otherwise. Third, (12) takes the limiting form $P (N = n) = Φ (\sqrt{n} μ)$ .

These results, in both probabilistic and limiting deterministic forms, have important implications which we now explore. First, when a joint modeling approach is used, beginning from (5) and (9), the kernel of the likelihood is just

L (μ) \propto \prod_{i = 1}^{N} ϕ (y_{i}; μ) .

(16)

By contrast, for the approach that is conditional on Z = z, the more elaborate expression (13) applies. Hence each approach leads to different inferences. Whereas in Cox and Hinkley¹³, in particular pages 31–35, it is argued that while conditioning on ancillary statistics is generally sensible, they also draw attention to the point that there is no universal method for constructing these. These authors also discuss a random sample size in this context. In their example, notwithstanding this randomness, ancillarity is maintained, in contrast to our example. We will see below that this lack of ancillarity leads to non-standard results.

Second, while the limiting deterministic case does not di er fundamentally from the general probabilistic one in a likelihood setting, we do see that the natural ‘weight’ (14) becomes

w (y_{1}, \dots, y_{N}, Z = 1) = \frac{Φ (\sqrt{n} μ)}{I (K \geq 0)} .

(17)

with degeneracy in the denominator.

We now study the two likelihood estimators in more detail.

3.1 The Marginal and Conditional Likelihood Estimators

Beginning with the conditional model (13) we write, for convenience, $\tilde{α} = α ∕ \sqrt{1 + β^{2} ∕ n}$ and $\tilde{β} = β ∕ \sqrt{1 + β^{2} ∕ n}$ . Further, let $ν = \tilde{α} + \tilde{β} μ$ . Evidently, ν = ν(μ) is a (simple linear) function of μ. Consider first the case where Z = 1. The kernel of the likelihood ℓ(μ), score function S(μ), and Hessian H(μ) are:

ℓ (μ) = - \frac{1}{2} \sum_{i = 1}^{N} {(y_{i} - μ)}^{2} - \ln Φ (ν),

(18)

S (μ) = \sum_{i = 1}^{N} (y_{i} - μ) - \tilde{β} \cdot \frac{ϕ (ν)}{Φ (ν)},

(19)

H (μ) = - N + {\tilde{β}}^{2} \cdot [ν \cdot Φ (ν) + ϕ (ν)] \cdot \frac{ϕ (ν)}{Φ {(ν)}^{2}} .

(20)

When Z = 0, the corresponding expressions are:

ℓ (μ) = - \frac{1}{2} \sum_{i = 1}^{N} {(y_{i} - μ)}^{2} - \ln [1 - Φ (ν)],

(21)

S (μ) = \sum_{i = 1}^{N} (y_{i} - μ) + \tilde{β} \cdot \frac{ϕ (ν)}{1 - Φ (ν)},

(22)

H (μ) = - N + {\tilde{β}}^{2} \cdot {ν \cdot [1 - Φ (ν)] - ϕ (ν)} \cdot \frac{ϕ (ν)}{{[1 - Φ (ν)]}^{2}} .

(23)

Evidently, for the joint approach, the estimator agrees with that for a simple sample from a normal distribution and we obtain the following familiar expressions:

ℓ (μ) = - \frac{1}{2} \sum_{i = 1}^{N} {(y_{i} - μ)}^{2},

(24)

S (μ) = \sum_{i = 1}^{N} (y_{i} - μ),

(25)

H (μ) = - N .

(26)

The simplicity of this estimator is a direct consequence of ignorability, given that the stopping rule satisfies MAR.

To illustrate the difference between the joint and conditional estimators, a brief simulation study has been conducted. A sample of size n was drawn from a N(μ, 1) distribution. For given values of α and β, stopping rule (10) was applied. Whenever needed, a second sample, equally of size n was generated. For all cases, a simulation run consisted of 1000 samples. Results are summarized in Table 1. As indicated above, joint estimation is based on the simple sample average.

Table 1.

Simulation study for the sequential trial case. n: sample size; μ: true mean for the standard normal from which the sample is taken; α, β: parameters governing the stopping rule; #stop: number of simulations out of 1000 for which stopping is applied; $\hat{μ}$ : average mean for the joint approach; $\hat{s}$ : average standard error for the joint approach; ${\hat{μ}}_{c}$ : average mean for the conditional approach; ${\hat{s}}_{c}$ : average standard error for the conditional approach.

n	μ	α	β	#stop	$\hat{μ}$	$\hat{s}$	${\hat{μ}}_{c}$	${\hat{s}}_{c}$
10	0	0	0	486	0.018488	0.268621	0.018488	0.268621
100	0	0	0	516	−0.004514	0.085824	−0.004514	0.085824
1000	0	0	0	484	−0.000460	0.026844	−0.000460	0.026844
10	0	0	1	488	0.019749	0.268806	0.001946	0.275121
100	0	0	1	477	0.002681	0.084682	0.000972	0.084891
1000	0	0	1	488	0.000375	0.026881	0.000190	0.026887
10	0	1	0	849	0.007495	0.302242	0.007495	0.302242
100	0	1	0	846	−0.000064	0.095489	−0.000064	0.095489
1000	0	1	0	840	0.001791	0.030141	0.001791	0.030141
10	1	0	0	525	0.989761	0.272233	0.989761	0.272233
100	1	0	0	519	0.998925	0.085912	0.998925	0.085912
1000	1	0	0	515	0.999771	0.027131	0.999771	0.027131
10	1	0	1	809	1.014183	0.298537	1.003231	0.303925
100	1	0	1	844	0.996496	0.095431	0.995216	0.095609
1000	1	0	1	854	1.000929	0.030271	1.000794	0.030276
10	1	1	0	832	1.014265	0.300667	1.014265	0.300667
100	1	1	0	845	0.996084	0.095460	0.996084	0.095460
1000	1	1	0	842	0.998763	0.030159	0.998763	0.030159
10	−1	0	0	515	−1.006709	0.271307	−1.006709	0.271307
100	−1	0	0	493	−1.001493	0.085150	−1.001493	0.085150
1000	−1	0	0	501	−1.000682	0.027001	−1.000682	0.027001
10	−1	0	1	171	−1.001404	0.239445	−1.014000	0.243141
100	−1	0	1	151	−0.995329	0.075133	−0.996375	0.075249
1000	−1	0	1	140	−0.999724	0.023657	−0.999813	0.023661
10	−1	1	0	840	−1.005741	0.301408	−1.005741	0.301408
100	−1	1	0	866	−0.997533	0.096075	−0.997533	0.096075
1000	−1	1	0	842	−0.998102	0.030159	−0.998102	0.030159
10	0	0	10	479	0.050863	0.267972	−0.078807	0.400683
100	0	0	10	511	0.014664	0.085678	−0.004165	0.099474
1000	0	0	10	499	0.001780	0.026982	−0.000150	0.027620
10	1	0	10	1000	1.003133	0.316228	0.988938	0.328269
100	1	0	10	1000	1.000227	0.100000	1.000227	0.100000
1000	1	0	10	1000	1.000494	0.031623	1.000494	0.031623
10	−1	0	10	1	−1.004547	0.223699	−1.002906	0.225618
100	−1	0	10	0	−0.998418	0.070711	−0.998418	0.070711
1000	−1	0	10	0	−0.999392	0.022361	−0.999392	0.022361

Open in a new tab

It is clear from (1) that both estimators are asymptotically unbiased, owing to their likelihood basis. This will be studied further using the likelihood equations, in Section 3.4. Indeed, (25) is the marginal score, and its expectation should be taken over the entire space, where each E(Y_i) = μ. For the other two, (19) and (22), expectation should be taken over the conditional space, given Z = z. Fortunately, the score itself is derived for this space, hence a consistent estimator follows based on standard likelihood theory. More formal arguments will be given in Section 3.4.

The results in Table 1 suggest that the precision is slightly higher for the joint approach rather than for the conditional one. This may appear counterintuitive at first sight, because in many practical applications a conditional approach is known to be more precise. However, this intuition draws from our common experience of conditioning on an ancillary statistic, which is not the case here. The superior precision for the joint estimator that is seen in Table 1 can in fact be predicted from the likelihood equations. The expected information in the joint approach is

I (μ) = E (N) = n \cdot Φ (\tilde{α} + \tilde{β} μ) + 2 n \cdot [1 - Φ (\tilde{α} + \tilde{β} μ)] = n [2 - Φ (\tilde{α} + \tilde{β} μ)],

(27)

where $\tilde{α}$ and $\tilde{β}$ are as above. In the conditional case, this is

\begin{matrix} I_{c} (μ) = & n - {\tilde{β}}^{2} \cdot \frac{ϕ (\tilde{α} + \tilde{β} μ)}{Φ {(\tilde{α} + \tilde{β} μ)}^{2}} [(\tilde{α} + \tilde{β} μ) \cdot Φ (\tilde{α} + \tilde{β} μ) + ϕ (\tilde{α} + \tilde{β} μ)] \\ + 2 n + {\tilde{β}}^{2} \cdot \frac{ϕ (\tilde{α} + \tilde{β} μ)}{{[1 - Φ (\tilde{α} + \tilde{β} μ)]}^{2}} {(\tilde{α} + \tilde{β} μ) \cdot [1 - Φ (\tilde{α} + \tilde{β} μ)] - ϕ (\tilde{α} + \tilde{β} μ)} \\ = n [2 - Φ (\tilde{α} + \tilde{β} μ)] - \frac{{\tilde{β}}^{2} ϕ {(\tilde{α} + \tilde{β} μ)}^{2}}{Φ (\tilde{α} + \tilde{β} μ) [1 - Φ (\tilde{α} + \tilde{β} μ)]} \end{matrix}

(28)

When n → ∞, the information approaches

I_{c} (μ) \overset{n \to + \infty}{\to} n {[2 - Φ (α + β μ)] - \frac{1}{n} \cdot \frac{β^{2} ϕ {(α + β μ)}^{2}}{Φ (α + β μ) [1 - Φ (α + β μ)]}} .

It is then clear that the ‘information deficit’ will tend to zero when n tends to infinity. In the limiting case, when $\tilde{β} \to + \infty$ , the second term in (28) approaches

\frac{n ϕ {(\sqrt{n} μ)}^{2}}{Φ (\sqrt{n} μ) [1 - Φ (\sqrt{n} μ)]} .

This term is non-zero for finite n but can be shown to approach 0 if n → ∞.

We conclude that the conditional estimator is less precise than the joint one, in contrast to many familiar settings such as contingency table analyses. The important feature here is conditioning is made on a non-ancillary statistic. We have also seen that the joint approach leads to the ordinary sample average, an estimator that has met with considerable concern in the past in the sequential setting. These and related issues are studied in Section 3.2.

As an aside, the problem that exists with use of the expected, as opposed to the observed, information matrix, as elucidated in Kenward and Molenberghs¹², remains. In summary, likelihood inference is valid, provided that the observed information matrix is used³. We return to the maximum likelihood estimators in Section 3.4.

3.2 Complete and Incomplete sufficient Statistics

We now consider the role of completeness in this setting. For this, our argument owes much to that in Liu and Hall⁸. A sufficient statistic for a group sequential trial, of the type we are considering and in general, is (K, N) as defined above. Liu and Hall⁸ show that (K, N) is incomplete in general group sequential trials. This means that the Lehmann-Scheffé theorem cannot be invoked (see Section 2); that is if a statistic is unbiased, complete, and sufficient for a parameter μ, then it is the best mean-unbiased estimator for μ. Of course, this in itself does not mean that there is no such optimal estimator, but Liu and Hall⁸ show directly that an optimal estimator does not exist in the group sequential case. While our simple sequential trial has only two possible sample sizes n and 2n, we have introduced a more general, probabilistic, stopping rule than is usually considered in the group sequential setting (5), and so we need to extend appropriately the results of Liu and Hall⁸. We demonstrate in this subsection that, given the lack of completeness of the sufficient statistics, there exist classes of unbiased estimators in this problem for which no member is uniformally optimal.

Continuing with the same example in which the outcomes are assumed to be normally distributed with mean μ and variance 1, and applying the results of Liu and Hall⁸ to our sequential-trial case for either stopping at N = n or continuing to N = 2n, we find that the joint density for K and N can be written

p_{μ} (N, k) = p_{0} (N, k) \cdot \exp (k μ - \frac{1}{2} n μ^{2})

with p₀(N, k) = f₀(N, k) for k in the stopping region, i.e., k < 0, and 0 otherwise, for

f_{0} (n, k) = ϕ_{n} (k) \cdot I (k > 0),

(29)

f_{0} (2 n, k) = \int_{k = - \infty}^{k = 0} f_{0} (z, n) \cdot ϕ_{n} (k - z) d z,

(30)

with ϕ_s(k) the normal density with mean 0 and variance s.

Now, when we apply the more general probabilistic stopping rule (5) with the probit form of F(·) = Φ(·), the deterministic case is obtained by letting β → +∞. Hence for, for general β, we obtain the corresponding joint densities by replacing the integration boundaries present in (29) and (30), by the corresponding probabilities:

\begin{matrix} p_{0} (n, k) = & ϕ_{n} (k) \cdot Φ (α + \frac{β}{n} k), \\ p_{0} (2 n, k) = & \int_{k = - \infty}^{k = + \infty} ϕ_{n} (z) . [1 - Φ (α + \frac{β}{n} k)] \cdot ϕ_{n} (k - z) d z \end{matrix}

(31)

= ϕ_{2 n} (k) \cdot [1 - Φ (\frac{α + \frac{β k}{2 n}}{\sqrt{\frac{2 n + β^{2}}{2 n}},})],

(32)

the derivation of which is straightforward but tedious. A sketch of the proof is given in Appendix A. When β → +∞, (32) reduces to (30). On the other hand when β = 0, we recover the CRSS case. For this CRSS case, (31)–(32) reduces to:

p_{0} (n, k) = ϕ_{n} (k) \cdot Φ,

(33)

p_{0} (2 n, k) = ϕ_{2 n} (k) \cdot (1 - Φ),

(34)

where Φ ≡ Φ(α), the random stopping probability. Note that this is a constant, indeed, because β = 0 and hence the dependence on the random variable K vanishes.

We now assume that a function g(k, N) exists such that its expectation is zero. Such a function must satisfy, for all values of μ:

g (k, 2 n) \cdot p_{0} (2 n, k) = - \int ϕ_{n} (k - z) \cdot g (z, n) \cdot ϕ_{n} (z) \cdot Φ (α + \frac{β}{n} z) d z .

(35)

Consider the following two examples.

Example 1

Choose

g (k, n) = \tilde{λ},

(36)

an arbitrary constant. Then it immediately follows from (35) that

g (k, 2 n) = - \tilde{λ} \cdot \frac{Φ (\frac{α + \frac{β k}{2 n}}{\sqrt{\frac{2 n + β^{2}}{2 n}}},)}{1 - Φ (\frac{α + \frac{β k}{2 n}}{\sqrt{\frac{2 n + β^{2}}{2 n}}},)} .

(37)

When β = 0, then the right hand side of (37) is constant and we can set $\tilde{λ} = λ (1 - Φ)$ , leading to g(k, n) = λ(1 − Φ) and g(k, 2n) = −λΦ.

Example 2

Choose

g (k, n) = \frac{λ}{Φ (α + \frac{β}{n} k)},

(38)

with λ a given constant, then

g (k, 2 n) = - \frac{λ}{1 - Φ (\frac{α + \frac{β k}{2 n}}{\sqrt{\frac{2 n + β^{2}}{2 n}}},)} .

(39)

Such g(k, N) functions lead to entire classes of estimators. To see this, assume that an estimator for μ is available, $\hat{μ}$ , say. For example, $\hat{μ}$ could be the sample average

\hat{μ} = \frac{1}{N} K

(40)

or a generalization of this:

\overset{‒}{μ} \frac{K}{N} \cdot [c \cdot I (N = n)] + d \cdot I (N = 2 n) = K \cdot [\frac{c \cdot I (N = n)}{n} + \frac{d \cdot I (N = 2 n)}{2 n}],

(41)

for constant c and d. We will return to (40) and (41), but for now we concentrate on the construction of classes of functions.

Example 1 (continued)

Choosing (36) and (37) for the special case of β = 0 leads to the following class of estimators:

{\hat{μ}}_{λ} = \overset{‒}{μ} + λ \cdot [(1 - Φ) I (N = n) - Φ I (N = 2 n)] .

(42)

It follows directly from the construction of g(k, N) that $E (\overset{‒}{μ}) = E ({\hat{μ}}_{λ})$ and hence, if $\overset{‒}{μ}$ is unbiased, then so is ${\hat{μ}}_{λ}$ .

For the variance of (42), we obtain $var ({\hat{μ}}_{λ}) = var (\overset{‒}{μ}) + λ^{2} Φ (1 - Φ)$ which, within this class, is minimal for λ = 0. Hence, for β = 0, i.e., the CRSS case, the original estimator is more efficient than any member of the new class. This will change when β ≠ 0. We also need to consider the basic estimator itself, e.g., either (40) or (41). before moving on to this, we first complete the second example.

Example 2 (continued)

Choosing (38) and (39) produces the estimator

{\tilde{μ}}_{λ} = \overset{‒}{μ} + λ \cdot [\frac{I (N = n)}{Φ (α + \frac{β}{n} k)} - \frac{I (n = 2 n)}{1 - Φ (\frac{α + \frac{β k}{2 n}}{\sqrt{\frac{2 n + β^{2}}{2 n}}},)}] .

(43)

3.3 The Generalized Sample Average

The two examples above both take the form of augmented sample averages. It is, however, of interest to consider the sample average, or rather its generalization (41), in its own right. This is relevant here because several authors in the group sequential literature have reported that the the sample average is biased^6-20. However, as shown at the beginning of Section 3.1, the sample average is the maximum likelihood estimator from the joint model.

Consider now the generalized sample average estimator (41). Clearly the arithmetic sample average is the special case of this for which for c = d = 1. Straightforward but tedious algebra leads to the expectation (Appendix A):

E (\overset{‒}{μ}) = d μ + (c - d) μ Φ (ν) + \frac{2 c - d}{2 n} \frac{β}{\sqrt{1 + β^{2} ∕ n}} \cdot ϕ (ν)

(44)

with $ν = (α + β μ) ∕ \sqrt{1 + β^{2} ∕ n}$ .

Consider first β = 0, then we have again that Φ ≡ Φ (α) and hence the estimator (41) is unbiased if and only if

d = \frac{1 - c Φ}{1 - Φ} .

(45)

Clearly, c = d = 1 is a solution, so the ordinary sample average is an unbiased estimator. An obvious question then concerns the optimal choices of c and d. From first principles (Appendix A), the variance can be derived as

var (\overset{‒}{μ}) = μ^{2} {(c - 1)}^{2} (\frac{Φ}{1 - Φ}) + \frac{1}{n} Φ c^{2} + \frac{1}{2 n (1 - Φ)} {(1 - c Φ)}^{2},

(46)

which is minimal for

c_{opt} = 1 - \frac{1}{2 n} \cdot \frac{1 - Φ}{μ^{2} + \frac{2 - Φ}{2 n}}, d_{opt} = 1 + \frac{1}{2 n} \cdot \frac{Φ}{μ^{2} + \frac{2 - Φ}{2 n}} .

(47)

These coeffcients are different from 1 and are not uniform across μ. So, for finite samples the ordinary sample average is not optimal, but is asymptotically.

When β ≠ 0, the expression (44) does not in general simplify. Suppose that we assume existence of a uniformly unbiased estimator, i.e., that there exist c and d such that (44) reduces to μ, for all μ, and in particular for μ = 0. For this special case

0 = \frac{2 c - d}{2 n} \cdot \frac{β}{\sqrt{1 + β^{2} ∕ n}} \cdot ϕ (ν) .

Given that β ≠ 0, this expression leads to the condition 2c = d. Substituting this into (44) produces

E (\overset{‒}{μ}) = c μ [2 - Φ (ν)],

which equals μ only if c = [2 − Φ(ν)]⁻². But given that Φ(ν) is not constant but rather depends on μ, unless β = 0, we see that there can be no uniformly unbiased estimator for the generalized sample average type. In other words, a simple average estimator, that merely uses the observed measurements in a least-squares fashion, can never be unbiased unless β = 0.

It is insightful to study the generalized sample average’s asymptotic bias. This has been done for β = 0, with all unbiased solutions given by (45). All other choices for c and d will lead to bias that does not disappear asymptotically.

Turning to β ≠ 0, we begin with the ordinary sample average, i.e. c = d = 1, which has expectation equal to

E (\hat{μ}) = μ + \frac{1}{2 n} \frac{β}{\sqrt{1 + β^{2}} ∕ n} \cdot ϕ (ν) \overset{n \to + \infty}{\to} μ + \frac{1}{2 n} β \cdot ϕ (α + β μ) \overset{n \to + \infty}{\to} μ .

(48)

In particular, when β → +∞, we see that

E (\hat{μ}) = μ + \frac{1}{2 \sqrt{n}} \cdot ϕ (\sqrt{n} μ) \overset{n \to + \infty}{\to} μ .

(49)

There exist other choices that also lead to asymptotically unbiased generalized sample averages. For β ≠ 0 but finite, the expectation becomes

E (\hat{μ}) \overset{n \to + \infty}{\to} d μ + (c - d) μ Φ (α + β μ),

(50)

which equals μ if and only if:

d = \frac{1 - c Φ (α + β μ)}{1 - Φ (α + β μ)} .

(51)

While (51) and (45) are similar, there is a crucial difference between these: the latter is independent of μ, while the former is not, except when c = d = 1. In other words, there is no uniformly asymptotically unbiased generalized sample average for finite, non-zero β, except for the ordinary sample average itself.

The situation is different for the deterministic β → ∞, because then (50) becomes

E (\overset{‒}{μ}) = d μ + (c - d) μ Φ (\sqrt{n μ}) + \frac{2 c - d}{2 \sqrt{n}} ϕ (\sqrt{n μ}) \overset{n \to + \infty}{\to} {\begin{matrix} c μ & if μ > 0, \\ d μ & if μ < 0, \\ 0 & if μ = 0 . \end{matrix}

(52)

This provides us with the interesting situation that, for positive μ, c = 1 yields an asymptotically unbiased estimator, regardless of d, with the reverse holding for negative μ. In the special case that μ = 0, both coefficients are immaterial. In addition, we see here as well that the only uniform solution is obtained by requiring that the bias asymptotically vanishes for all values of μ, that is c = d = 1.

We have seen above that, even for β = 0, the sample average is not optimal, and that there is no uniform optimal solution, even though the sample average approximately is. However, the sample average is optimal in the restricted class of estimators that is invariant to future decisions. Indeed, if stopping occurs, then the choice of the coefficient c leads to an unbiased estimator, provided the appropriate d is chosen. However, this d will never be used as it pertains to ‘future’ observations. This can be avoided only by setting both coefficients to be equal, from which the conventional sample average emerges.

3.4 The Likelihood Estimators Revisited

We can also obtain the asymptotic unbiasedness of the sample average from the fact that it is the maximum likelihood estimator from the joint likelihood (16). In terms of the conditional likelihood, the estimator is obtained from the solution to the score equations, (19) and (22). These can be that can be reformulated as:

\tilde{S} (μ) = \frac{1}{N} \sum_{i = 1}^{N} y_{i} - μ - \frac{β}{\sqrt{1 + β^{2} ∕ n}} \cdot ϕ (ν) \cdot {\frac{I (N = n)}{n \cdot Φ (ν)} - \frac{I (N = 2 n)}{2 n \cdot [1 - Φ (ν)]}} .

(53)

The expectation of (53) results from (48), combined with the observation that the probability of stopping is Φ(ν):

\begin{matrix} E [\tilde{S} (μ)] = & μ + \frac{1}{2 n} \frac{β}{\sqrt{1 + β^{2} ∕ n}} \cdot ϕ (ν) - μ \\ - \frac{β}{\sqrt{1 + β^{2} ∕ n}} \cdot ϕ (ν) \cdot {\frac{1}{n \cdot Φ (ν)} \cdot Φ (ν) - \frac{1}{2 n \cdot [1 - Φ (ν)]} \cdot [1 - Φ (ν)]} = 0 . \end{matrix}

Unbiasedness follows directly from the linearity of the score in the data. It follows that, while the conditional estimator is slightly less precise than the joint estimator, as follows from comparing (27) and (28), the former is nevertheless unbiased whereas the latter is only asymptotically so.

In other words, the difference between both score equations is bias-correcting. The correction is a non-linear function of μ and has no closed-form solution, underscoring the point that no simple algebraic function of K and N will lead to the same estimator.

Combining (27) with the bias, leads to the mean squared error for the joint estimator

MSE (\hat{μ}) = \frac{1}{n [2 - Φ (ν)]} + \frac{1}{4 n^{2}} {\tilde{β}}^{2} ϕ {(ν)}^{2},

(54)

whereas (28) produces:

MSE {\hat{μ}}_{c}) ≃ \frac{1}{n [2 - Φ (ν)]} + \frac{1}{{[2 - Φ (ν)]}^{2} Φ (ν) [1 - Φ (ν)] n^{2}} {\tilde{β}}^{2} ϕ {(ν)}^{2} .

(55)

Comparing both expressions, we see that G(ν) = [2 − Φ(ν)]² Φ(ν)[1 − Φ(ν)] < 4, the inequality being strict. In fact, the maximal value for G(ν) equals 0.619. Hence, the joint estimator has the smallest MSE of both, even though the difference will be very small for moderate to large sample sizes. This holds regardless of the choice for α, β, and n, and of the true value of μ. For β finite and when n → ∞, ν approaches α + βμ and $\tilde{β}$ approaches β. Then Φ(α + βμ) and φ(α + βμ) become constant and the difference between the two expression disappears because the second terms on the right hand sides and (54) and (55) are of the order of 1/n².

A different argument applies in the deterministic case, when β → +∞, because then $ν = \sqrt{n} μ$ and $\tilde{β} = \sqrt{n}$ , producing

\begin{matrix} MSE (\hat{μ} ∣ β \to \infty) = \frac{1}{n [2 - Φ (\sqrt{n} μ)]} + \frac{1}{4 n} ϕ {(\sqrt{n} μ)}^{2}, \\ MSE ({\hat{μ}}_{c} ∣ β \to \infty) ≃ \frac{1}{n [2 - Φ (\sqrt{n} μ)]} + \frac{1}{{[2 - Φ (\sqrt{n} μ)]}^{2} Φ (\sqrt{n} μ) [1 - Φ (\sqrt{n} μ)] n} ϕ {(\sqrt{n} μ)}^{2} . \end{matrix}

When n → ∞, and using the same arguments as in Section 3.1, both expressions converge to 1/(2n) if the trial continues and 1/n if the trial stops, and the difference between them disappears. In Section 3.1, we saw that the joint estimator appears to take a simpler form than the conditional one, but this is deceptive, because the latter in fact removes information, as is clear from (27) and (28). Here, we have refined this result by showing that, while the joint estimator exhibits small-sample bias and the conditional one does not, it retains its superiority in terms of mean squared error. Asymptotically, the difference between them vanishes, in both the probabilistic and deterministic cases.

Further, when β = 0, the CRSS case, the expression for both the information and for the mean squared error coincide, given that both estimators reduce to the ordinary sample average. Recall that, for this setting, the estimator is not optimal, and no uniformly optimal estimator exists, a consequence of non-completeness.

Finally, we note that the conditional likelihood estimator is also conditionally unbiased, i.e., it is unbiased for both situations N = n and N = 2n separately. To see this, it is convenient to rewrite the expectation of the generalized sample average (44)

\begin{matrix} E (\overset{‒}{μ}) = & c \cdot {μ + \frac{β}{\sqrt{1 + β^{2} ∕ n}} \cdot \frac{ϕ (ν)}{n Φ (ν)}} \cdot Φ (ν) \cdot I (N = n) \\ + d \cdot {μ - \frac{β}{\sqrt{1 + β^{2} ∕ n}} \cdot \frac{ϕ (ν)}{2 n [- 1 Φ (ν)]}} \cdot [1 - Φ (ν)] \cdot c d o t I (N = 2 n), \end{matrix}

from which both expectations $E (\overset{‒}{μ} ∣ N)$ follow:

E (\overset{‒}{μ} ∣ N = n) = μ + \frac{β}{\sqrt{1 + β^{2} ∕ n}} \cdot \frac{ϕ (ν)}{n Φ (ν)},

(56)

E (\overset{‒}{μ} ∣ N = 2 n) = μ - \frac{β}{\sqrt{1 + β^{2} ∕ n}} \cdot \frac{ϕ (ν)}{2 n [1 - Φ (ν)]} .

(57)

For the specific case β → ∞, these become, either by taking limits or using expressions for truncated normals²²:

E (\overset{‒}{μ} ∣ N = n) = μ + \frac{ϕ (\sqrt{n} μ)}{\sqrt{n} Φ (\sqrt{n} μ)},

(58)

E (\overset{‒}{μ} ∣ N = 2 n) = μ - \frac{ϕ (\sqrt{n} μ)}{2 \sqrt{n} [1 - Φ (\sqrt{n} μ)]} .

(59)

Using these, it follows that $E [\tilde{S} (μ) ∣ N] = 0$ .

For the sample average a little more caution is required. From (56) and (57), it follows that $E (\overset{‒}{μ} ∣ N)$ converges to μ at a rate of n, because ν → α + βμ. The situation is more subtle when β → ∞. To show this, we take the limit of (58) and (59) as n → ∞. When μ < 0, applying de l’Hôpital’s rule whenever needed, the limits are $E (\overset{‒}{μ} ∣ N) = n \to 0$ and $E (\overset{‒}{μ} ∣ N) = 2 n \to μ$ . Similarly, when μ > 0, the corresponding expressions are $E (\overset{‒}{μ} ∣ N) = n \to μ$ and $E (\overset{‒}{μ} ∣ N) = 2 n \to μ ∕ 2$ . It follows that when μ = 0, these both are equal to 0. Strictly speaking, this shows that there is bias in the conditional means. However, this needs careful qualification: when n grows large, the probability itself that N = n (N = 2n) for negative (positive) μ goes to zero. This implies that, overall, conditional inference based on the ordinary sample average is still acceptable.

In conclusion, the sample average is conditionally and marginally biased, with the bias vanishing as n goes to infinity, except in the situations that correspond to vanishing probabilities. In contrast, the conditional estimator is unbiased, whether considered conditionally on the observed sample size or marginalized over it.

3.5 Ramifications

By starting from a simple sequential trial setting with two possible sample sizes, N = n and N = 2n and normally distributed outcomes, but allowing for a probabilistic stopping rule, expressed as a probit of the form Φ(α + β/n · k), we simultaneously considered three important cases: (a) when β = 0, the CRSS setting; (b) for β ≠ 0 and finite, probabilistic stopping based on prior outcomes is considered, which is an example of MAR; and (c) for β → +∞, a conventional sequential trial.

For all three cases, the sufficient statistic (K, N) in incomplete. It follows that broad classes of estimators with the same expectation can be constructed. This is particularly noteworthy for the completely random stopping case (a), because it implies, in particular, that apart from the sample average, there is an infinite class of so-called generalized sample average estimators, without there being a uniformly optimal one. In particular, the sample average is not optimal, although it does have this property asymptotically. This is very interesting, because the sample size is an ancillary statistic in this case, so the comfort of the ancillarity property, combined with the of lack of completeness, creates at first sight counterintuitive results.

Cases (b) and (c) share a large number of properties, in the sense that there is little or no difference between the deterministic and probabilistic cases. Rather, the defining feature is the fact that “missingness” is of the MAR rather than the MCAR type. In particular, all generalized sample average estimators are biased, although a sub-class of them, including the ordinary sample average, is asymptotically unbiased.

The sample average also follows as the maximum likelihood estimator from the joint likelihood, underscoring the fact that the asymptotic bias should be zero. For the conditional likelihood, a correction term emerges that reduces the precision but removes the bias. In terms of mean squared error, the ordinary sample average is superior.

While the setting considered here is very specific in the sense that (1) there is only one intermediate look, rather than an arbitrary but fixed number, and (2) the outcomes are assumed normally distributed, the results are sufficiently generic to make the conclusions more broadly valid. Because all of the settings considered contain this probabilistic stopping rule, with the sequential trial as a special case, it follows that in all cases the simple average, combined with the (random) sample or cluster size, is an incomplete statistic. We do however need to make some further points about the more general case and we sketch this in the next section. Following that, in Sections 4 and 5 we deal in more detail with cases (a), random sample size, and (b), missing data, respectively.

3.6 Stopping Rules with Arbitrary Numbers of Looks

We now generalize the two-sample size case, with N = n or N = 2n, to the general n₁ < n₂ < … < n_L, case with L a pre-specified maximum number of looks.

In line with (5), we assume that the stopping probability takes the form F(α_j + β_j/n_j · k), for j = 1, …, L. As before, if all β_j = 0, then the completely random sample size case follows, and if all β_j → +∞, then a sequential-trial setting follows. Based upon earlier derivations and the work by⁸, it follows that

\begin{matrix} p_{0} (n_{j}, k) = f_{0} (n_{j}, k) \cdot F (α_{j} + β_{j} ∕ n_{j} \cdot k), \\ f_{0} (n_{1}, k) = ϕ_{n_{1}} (k), \\ f_{0} (n_{j}, k) = \int f_{0} (n_{j - 1}, z) \cdot [1 - F (α_{j - 1} + β_{j - 1} ∕ n_{j - 1} \cdot z)] \cdot ϕ_{n_{j} - n_{j} - 1} (k - z) d z, \end{matrix}

for j = 2, …, L. A zero-expectation function g(K, N) needs to satisfy

\sum_{j = 1}^{L} e^{- \frac{1}{2} n_{j} μ^{2}} \int g (k, n_{j}) \cdot p_{0} (n_{j}, k) \cdot e^{k μ} d k = 0 .

Using the same approach as in Section 3.2, such a function is obtained by choosing g(K, N) for N = n₁, …, n_L−1, and then computing, for N = n_L:

g (k, n_{L}) = \frac{1}{p_{0} (k, n_{L})} \sum_{j = 1}^{L - 1} \int g (z, n_{j}) \cdot p_{0} (n_{j}, z) \cdot ϕ_{n_{L} - n_{j}} (k - z) d z .

One obvious choice for N = n₁, …, n_L−1 is a constant g(k, n_j) = a_j. This establishes non-completeness in the general case.

The implications will be, as before, that the generalized sample average is biased, and this includes the ordinary sample average, except when β_j = 0 for all j, the CRSS case. The latter is important in its own right, and will be studied in the next section.

For the general sequential situation, with either a probabilistic or deterministic stopping rule, the maximum likelihood estimators can be considered. The likelihood takes the form

L (μ ∣ k, n_{ℓ}) = ϕ_{n_{ℓ}} (k) \cdot \prod_{j = 1}^{ℓ - 1} [1 - F (α_{j} + \frac{β_{j}}{n_{j}} \cdot k)] \cdot F {(α_{ℓ} + \frac{β_{ℓ}}{n_{ℓ}} \cdot k)}^{I (ℓ < L)} .

(60)

Let the latent variable, underlying I(N = n_ℓ) be T_ℓ for ℓ = 1, …, L − 1, defined by

\begin{matrix} T_{l} ∣ y_{1}, \dots, y_{l} \equiv T_{l} ∣ y_{1}, \dots, y_{L} & ~ N (α_{l} + \frac{β_{l}}{n_{l}} \sum_{j = 1}^{n_{l}} y_{j}, 1) \\ ~ N [α_{l} + \frac{β_{l}}{n_{l}} (1 \dots 1 0 \dots 0) (\begin{matrix} y_{1} \\ ⋮ \\ y_{n_{L}} \end{matrix}), 1] . \end{matrix}

This implies that, for the entire vector T = (T₁, …, T_L−1)’, the conditional distribution is T|y ~ N(A + Cy, I_L−1) with A = (α₁, …, α_L−1)’ and

c = (\begin{matrix} 1 \dots 1 & 0 \dots 0 & 0 \dots 0 & \dots \\ 1 \dots 1 & 1 \dots 1 & 0 \dots 0 & \dots \\ ⋮ ⋱ ⋮ & ⋮ ⋱ ⋮ & ⋮ ⋱ ⋮ & ⋱ \end{matrix}),

an (L − 1) × n_L matrix where the ℓth row starts with n_ℓ ones and ends with n_L − n_ℓ zeros. Given that Y ~ N(μ1_{n_L}, I_{n_L}) it follows directly that

graphic file with name nihms-335858-f0002.jpg

Hence, the set of probabilities of observing a particular sample size take a multivariate probit form:

\begin{matrix} P (N = n_{ℓ}) = Φ_{ℓ} [{\tilde{I}}_{ℓ} (α^{(ℓ)} + β^{(ℓ)} μ), I_{ℓ} + D^{(ℓ)}], (ℓ = 1, \dots, L - 1) \\ P (N = n_{L}) = Φ_{L - 1} [- (α + β μ), I_{L - 1} + D], \end{matrix}

where a superscript “(ℓ)” indicates a sub-vector of the first ℓ components and ${\tilde{I}}_{l}$ is a diagonal matrix with −1 along the main diagonal, except in the (ℓ, ℓ) entry, which is set to +1.

As a result, assuming a probit stopping rule, the conditional likelihood takes the form,

L_{c} (μ ∣ k, n_{ℓ}) = \frac{ϕ_{n_{ℓ}} (k) \cdot \prod_{j = 1}^{ℓ - 1} [1 - Φ (α_{j} + \frac{β_{j}}{n_{j}} \cdot k)] \cdot Φ {(α_{ℓ} + \frac{β_{ℓ}}{n_{ℓ}} \cdot k)}^{I (ℓ < L)}}{Φ_{ℓ} {[{\tilde{I}}_{ℓ} (α^{(ℓ)} + β^{(ℓ)} μ), I_{ℓ} + D^{(ℓ)}]}^{I (ℓ < L)} \cdot Φ_{L - 1} {[- (α + β μ), I_{L - 1} + D]}^{I (ℓ = L)}} .

The score equations and information matrices are not especially difficult to derive, but require some tedious algebra. Fitting can be done by resorting to numerical methods, possibly replacing the multivariate probit by an appropriate product of univariate probit expressions.

The arguments used earlier for the N = n, 2n case to obtain the properties of the estimators based on the joint likelihood, the conditional likelihood, and generalized sample average can be extended in a straightforward way for the current more general case. We omit details, but indicate now why such a generalization follows. There are two steps. First, earlier results derived for N = n, 2n carry over directly to the N = n₁, n₂ case. Second, when L rather than 2 sample sizes are allowed, the results can be applied sequentially to establish the key results: (a) any generalized sample average is biased, except when β = 0, although a subclass, including the ordinary sample average, will be asymptotically unbiased; (b) the joint likelihood estimator reduces to the ordinary sample average and is asymptotically unbiased; (c) the maximum conditional likelihood estimator is unbiased, yet less efficient than the joint one; (d) as we found in in Section 3.4, the joint-likelihood-based ordinary sample average has the smallest mean squared error.

4 Completely Random Sample Size

So far the CRSS case has been dealt with by setting β = 0 in the simple two sample size setup. We now extend this to the setting with sample size N with associated probability function π_n for n = 0, …, m, and m some upper bound on the sample size, n. Evidently, $\sum_{n - 0}^{m} π_{n} = 1$ . When only two of these are non-zero, for n and 2n, the earlier results are recovered.

4.1 Incomplete Sufficient Statistics

In this case, p₀(n, k) = ϕ_n(k)π_n. The condition that needs to be satisfied for a function g(k, n) to have zero expectation is

\sum_{n = 0}^{m} e^{\frac{1}{2} n μ^{2}} \int g (k, n) \cdot p_{0} (n, k) \cdot e^{μ k} d k = 0 .

The above expression can be rewritten as

\sum_{n = 0}^{m} π_{n} \int ϕ_{m - n} (k) \cdot e^{μ k} d k \cdot \int g (k, n) \cdot ϕ_{n} (k) \cdot e^{μ k} d k = 0 .

Using the uniqueness of the Laplace transform, we find

\sum_{n = 0}^{m} π_{n} \int g (z, n) \cdot ϕ_{n} (z) \cdot ϕ_{m - n} (k - z) d z = 0 .

(61)

Choosing g(z, n) = b_n, i.e., a constant not depending on k, (61) implies

ϕ_{m} (k) \cdot \sum_{n = 0}^{m} π_{n} \cdot b_{n} = 0 .

This is possible only when

\sum_{n = 0}^{m} π_{n} \cdot b_{n} = 0 .

(62)

In other words, (62) can be interpreted as a vector b that is orthogonal to the probability vector π.

4.2 Generalized Sample Average

Consider for this case a generalized sample average estimator of the form

\overset{‒}{μ} = \sum_{n = 0}^{m} \frac{K}{n} \cdot a_{n} \cdot I (N = n) .

(63)

The expectation of (63) equals μ if and only if the vector a can be written as a = 1 + b, with b satisfying (62) and 1 a (m + 1)-vector of ones. In other words, the existence of an entire class of generalized sample average estimators is directly linked to the incompleteness of the sufficient statistic (K, N). One possible solution is given by a_n = 1, which leads to the simple sample average. This provides an additional reason to derive the variance, and explore the existence, of an optimal choice for the coefficients.

We now assume, in a slight generalization of the previous setting, that Y_i ~ N(μ, σ²). The variance follows from the usual iterative rule and takes the form

var (\overset{‒}{μ}) = μ^{2} \sum_{n = 0}^{m} {(a_{n} - 1)}^{2} \cdot π_{n} + σ^{2} \sum_{n = 0}^{m} \frac{a_{n}^{2}}{n} \cdot π_{n} .

(64)

If a_n = 1, the variance reduces to

var (μ) = σ^{2} \sum_{n = 0}^{m} \frac{π_{n}}{n},

(65)

as expected.

To derive the optimal estimator, i.e., the optimal choice for the vector a, we begin by writing down the objective function

F (a, λ) = μ^{2} \sum_{n = 0}^{m} {(a_{n} - 1)}^{2} \cdot π_{n} + σ^{2} \sum_{n = 0}^{m} \frac{a_{n}^{2}}{n} \cdot π_{n} - 2 λ (\sum_{n = 0}^{m} a_{n} \cdot π_{n} - 1),

(66)

where λ is a Lagrange multiplier. The optimum follows from setting the derivatives of F(a, λ) with respect to the components a_n and λ equal to zero. First, we find that

a_{n} = \frac{1}{μ^{2} + σ^{2} ∕ n} \cdot (λ + μ^{2}) .

(67)

Multiplying (67) by π_n and summing leads to

λ = {(\sum_{k = 0}^{m} \frac{π_{k}}{μ^{2} + σ^{2} ∕ k})}^{- 1} - μ^{2},

which in turn leads to

a_{n} = \frac{\frac{1}{μ^{2} + σ^{2} ∕ n}}{\sum_{k = 0}^{m} \frac{π_{k}}{μ^{2} + σ^{2} ∕ k}} .

(68)

We find that the optimal generalized sample average estimator not only differs from the ordinary sample average, but is not uniform in the unknown parameters. Again, this is a consequence of incompleteness of the sufficient statistics.

In the special case that only n and 2n are possible sample sizes, with probabilities of occurrence Φ and 1 − Φ, respectively, then the earlier result (47) follows.

Throughout this section, it has been assumed that m is a finite upper limit to the sample size. It is relatively straightforward to generalize our findings to the case where N ranges over all non-negative integers. For example, N could be assumed to follow a Poisson distribution.

5 Missing Data

the stopping rule with β = 0 that was used throughout Section 3, is a special case of MCAR, whereas, if β ≠ 0, the mechanism can be interpreted as MAR, in the sense that stopping, like dropout, depends on the observed outcomes, but there is no dependence on the subsequent, and potentially unobserved, outcomes. Note as well that β → ∞ corresponds to the missing data setting, but now with a deterministic rule.

The difference from general missing-data settings, such as those arising in longitudinal studies is that, up till now, our outcomes have been univariate rather than hierarchical. This implied, as stated in Section 2, that the parameters α and β governing the stopping rule, have to be assumed known because they cannot be estimated from the data. This is the reason we now sketch a more general missing-data framework and indicate which results apply in this broader setting.

The missing data case can be expressed as follows through the generic form (3):

\begin{matrix} f (y_{i}^{o}, y_{i}^{m} ∣ θ^{*}, ψ^{*}) = & f (y_{i}^{o}, y_{i}^{m} ∣ t_{i}, θ^{*}) \cdot \frac{f (t_{i} ∣ ψ^{*})}{f (t_{i} ∣ y_{i}^{o}, y_{i}^{m}, θ^{*}, ψ^{*})} \\ f (y_{i}^{o}, y_{i}^{m} ∣ t_{i}, θ^{*}) \cdot w_{i} (y_{i}^{o}, y_{i}^{m}, t_{i}, θ^{*}, ψ^{*}) . \end{matrix}

(69)

Here, t_i is a vector of indicator variables the missingness status (observed: 1) of the elements of Y_i. In the case of dropout in a longitudinal study, t_i can be replaced by an integer, indicating the occurrence of dropout.

When MAR holds, (69) can be replaced by an expression operating at the observed-data level only:

f (y_{i}^{o} ∣ θ^{*}, ψ^{*}) = f (y_{i}^{o} ∣ t_{i}, θ^{*}) \cdot \frac{f (t_{i} ∣ ψ^{*})}{f (t_{i} ∣ y_{i}^{o}, θ^{*}, ψ^{*})} = f (y_{i}^{o} ∣ t_{i}, θ^{*}) \cdot w_{i} (y_{i}^{o}, t_{i}, θ^{*}, ψ^{*}) .

(70)

In accord with the discussion in Section 2 we need to be careful when considering the extent of ignorability. When the model is formulated in a pattern-mixture way, then the marginal observed-data distribution is governed by θ* and ψ*, and separability does not apply. In other words, ancillarity does not hold and the missing-data model needs to taken into account for a valid analysis. This is no longer necessary when the model is formulated in selection-model terms and under MAR, when pure likelihood or Bayesian inferences are drawn. For semi-parametric methods, inverse probability weighting can be used, noting that the missing-data model is required for this, but in this case degeneracy poses severe problems, in the sense that some weights will not be defined and the marginal model cannot be fully recovered. For this reason the weights need to be bounded away from zero^23-24.

Example: Normally Distributed Outcomes and Probit Dropout Model

As with the sequential trial case, it is instructive to consider an example. Assume that Y_i ~ N(μ_i, ∑_i) and that the dropout model takes the form:

f (t_{i} ∣ y_{i}^{o}) = \prod_{j = 2}^{t_{i} - 1} [1 - Φ (α_{j} + β_{j} y_{i, j - 1})] \cdot Φ {(α_{t_{i}} + β_{t_{i}} y_{i, d_{i} - 1})}^{I (t_{i} \leq n_{i})} .

(71)

For simplicity, and to emphasize the connections with our initial sequential trial setting, we partition $Y_{i} = {(Y_{i 1}^{'}, Y_{i 2}^{'})}^{'}$ , assuming sub-vector Y_i1 is always observed and Y_i2 is potentially missing. We partition μ_i and ∑_i accordingly. We can now have t_i take two value only: t_i = 1 if the full vector is observed for subject i and t_i = 0 if only the first sub-vector is observed. The observed-data likelihood is

L = \prod_{i = 1}^{N} ϕ (y_{i}^{o}; μ_{i}, Σ_{i}) \cdot Φ {(α + y_{i 1} β)}^{t_{i}} \cdot {[1 - Φ (α + y_{i 1} β)]}^{1 - t_{i}} .

(72)

It is convenient to introduce the latent variable Z_i that determines T_i:

Z_{i} ∣ y_{i} ~ N [α + (y_{i 1}^{'}, y_{i 2}^{'}) (\begin{matrix} β \\ 0 \end{matrix}); 1] .

Given that Y_i and Z_i are then jointly normally distributed, standard manipulations lead to

graphic file with name nihms-335858-f0003.jpg

(73)

from which the marginal probability of dropping out can be obtained

P (T_{i} = 1) = Φ (\frac{α + β^{'} μ_{i 1}}{\sqrt{1 + β^{'} Σ_{i 11} β}}) .

(74)

As in the sequential trial setting, it is interesting to consider the deterministic case, where β approaches infinity, in the sense that, for example, β = λβ₀ and λ → +∞. Then, the limiting probability is

P (T_{i} = 1) = Φ (\frac{β_{0}^{'} μ_{i 1}}{\sqrt{β_{0}^{'} Σ_{i 11} β_{0}}}) .

(75)

In a similar way to expression (14), the ‘weight’ connecting the marginal (selection model) to the pattern-mixture formulation takes the form:

w (y_{i}^{0}) = \frac{Φ (\frac{α + μ_{i 1}^{'} β}{\sqrt{1 + β^{'} Σ_{i 11} β}})}{Φ (α + y_{i 1}^{'} β)} .

(76)

This ‘weight’ shows that, even though the selection model may be separable, the corresponding pattern-mixture model is not. Again, the degeneracy of the denominator in (76) when λ → +∞ underscores the parallels with the degeneracy of inverse probability weighting methods under such deterministic mechanisms.

From the above expressions, it can be seen that similar algebra establishes the properties of the maximum likelihood estimators. First, consider the joint likelihood (72) and assume the mean vector and covariance matrix are constant across units. The more general case readily follows as well. Using a partition as in (73), the maximum likelihood estimators are¹:

{\hat{μ}}_{1} = \frac{1}{N} \sum_{i = 1}^{N} y_{i 1},

(77)

{\hat{μ}}_{2} = \frac{1}{n} \sum_{i = 1}^{n} y_{i 2} - Σ_{21} Σ_{11}^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} y_{i 1} - \frac{1}{N} \sum_{i = 1}^{N} y_{i 1}),

(78)

where there are N subjects all together, out of which n are fully observed and N − n are only observed in the sub-vector Y_i1.

Note that there is a fundamental difference between this and the sequential-trial case. Because of the correlation between the repeated measurements, (78) is not a simple sample average, though it could be termed a prediction corrected sample average. It only takes the form of a simple sample average when the first set of measurements is independent of the second, i.e., when ∑₁₂ = 0. As in the sequential-trial case, we see that the joint estimator does not involve the missing-data model. This is again the familiar property of ignorability, applied in the selection model framework. However, the (generalized) sample average is biased, because of its failure to take $f (y_{i}^{m} ∣ y_{i}^{0}, θ^{*})$ into account. Again, we consider the conditional likelihood. The counterpart to (13) is obtained through the division of (72) by the proper function of (74)

L_{c} = \frac{\prod_{i = 1}^{N} ϕ (y_{i}^{o}; μ_{i}, Σ_{i}) \cdot Φ {(α + y_{i 1} β)}^{t_{i}} \cdot {[1 - Φ (α + y_{i 1} β)]}^{1 - t_{i}}}{Φ {(\frac{α + β^{'} μ_{i 1}}{\sqrt{1 + β^{'} Σ_{i 11} β}})}^{t_{i}} \cdot {[1 - Φ (\frac{α + β^{'} μ_{i 1}}{\sqrt{1 + β^{'} Σ_{i 11}} β})]}^{1 - t_{i}}}

(79)

Although we may be primarily interested in the parameters governing μ and/or the covariance parameters, the missing-data mechanism will enter the likelihood equations through contributions based on the denominator of (79). This likelihood is of the pattern-mixture form: as seen earlier, ignorability is model-framework specific and the fact that we are be estimating selection-model parameters through a pattern-mixture likelihood makes the missing data mechanism non-ignorable, even under MAR. It can be shown for this setting as well (details not given) that, while the conditional estimator is unbiased, whereas the marginal one is only asymptotically unbiased, the latter has the smaller mean squared error.

6 Concluding Remarks

The developments in this paper show that, when the sample size is random, an estimator as familiar as the sample average has frequentist properties that are not altogether straightforward. This arises from the interplay between the properties of (non-)ancillarity and incompleteness.

In the case of a completely random sample size, the sample average is unbiased but not optimal. It is one member of a class of generalized sample averages that is unbiased. This class no uniformly optimal member. However, the sample average does converge to the optimum for increasing sample size. The ordinary average is unique in having weights that do not change regardless of when exactly the sequence is stopped; this is definitely an advantage. In the light of these results, the sample average remains the most sensible practical choice.

In the cases of univariate incomplete data and sequential trials, the sample average is biased in small samples but asymptotically unbiased, both conditionally and marginally. This follows from direct frequentist calculations, as well as from invoking likelihood theory. The exact calculations done here allow us to gauge the extent of the bias. The sample average follows as the maximum likelihood estimator from the joint likelihood for outcomes and sample size. In spite of confusion about the sample size estimator in the literature, this result is not totally surprising, given that it follows from invoking ignorability in the likelihood context. By contrast, the maximum conditional likelihood estimator is unbiased but the ordinary sample average still has the smaller mean squared error. This estimator can be interpreted as the one based on the pattern-mixture decomposition of the likelihood. These results combined together indicate that the use of the ordinary sample average in a sequential trial is a sensible choice, provided an estimator of precision based on the observed information be used^8-12-19.

The majority of our derivations has been in terms of specific assumptions, namely a probabilistic stopping rule of a probit form, standard normally distributed outcomes, and two possible sample sizes, n and 2n. Extensions to (1) arbitrary sample sizes in the completely random sample size settings, (2) general sequential trials with L > 1 looks, (3) normal outcomes with other than unit variance, and (4) more general missing data problems have also been considered. In (4) it has been shown that the ordinary sample average is no longer (asymptotically) unbiased because the joint likelihood estimator, again following from invoking ignorability under MAR, involves the predictive mean of the unobserved outcomes, given the observed ones. This well known correction¹ is a direct consequence of the correlation between observed and unobserved measurements.

Although the probit form for the stopping rule/missing data mechanism and the normality of the outcomes may be seen as limitations, the developments are generic. The advantage of our choices is the existence of explicit expressions that permit insightful interpretation. These choices are also practically relevant for many settings.

The results developed here have important practical implications for a wide range of settings. For sequential trials, there has been long-standing confusion and controversy regarding the (in)appropriateness of the sample average when estimating a parameter after such a trial. Our results show that the ordinary sample average, while small-sample biased and not uniform minimum variance unbiased, is perfectly acceptable. This should be seen against the background of the conditional likelihood estimator. Even though the latter is small sample unbiased, it suffers from a slightly increased mean square error. Thus, in conclusion, while some familiar properties no longer hold, estimation after sequential trials is more straightforward than commonly considered and there is less need for complicated, modified estimators than perhaps generally thought, given that the ordinary sample average can be used without trouble. While our results were, for clarity, presented for a fairly simple setting, they carry over for general studies, with multiple looks, different types of stopping rules, etc.

In the same vein, many clinical studies are prone to incompleteness. Our results provide a bridge between them. Hence, very similar considerations apply. The main difference between both is that incomplete data frequently occur in a setting where several measurements are collected on the same subject and that the missing-data mechanism is generally, but not always, stochastic rather than deterministic. Our derivations show that these differences are not fundamental and nicely connect the likelihood-based ignorability theory to the validity of the simple sample average in the sequential trial setting.

Further work will draw out similar results for studies with random cluster sizes, and for studies where longitudinal and time-to-event data are collected simultaneously. Also the missing-data case will be scrutinized further.

Our results have implications for several broad settings not considered in detail here. Examples are clusters with random size, censored time-to-event outcomes, joint modeling of longitudinal and survival outcomes, and random observation times. These will be the subject of further research.

Acknowledgments

Geert Molenberghs, Mike Kenward, Marc Aerts, and Geert Verbeke gratefully acknowledge support from IAP research Network P6/03 of the Belgian Government (Belgian Science Policy). The work of Anastasios Tsiatis and Marie Davidian was supported in part by NIH grants P01 CA142538, R37 AI031789, R01 CA051962, and R01 CA085848.

Appendix

A Completeness, Unbiasedness, and Minimum Variance Calculations for the Stopping Rule Cases

Details behind key derivations in Section 3 are given here.

A.1 Derivation of p₀(2n, k)

The function takes the form:

p_{0} (2 n, k) = ϕ_{2 n} (k) - \int ϕ_{n} (z) \cdot ϕ_{n} (k - z) \cdot Φ (α + β z ∕ n) d z = ϕ_{2 n} (k) - \tilde{p} 0 (2 n, k) .

Write

p_{0} (2 n, k) = \frac{1}{{(2 π n)}^{3 ∕ 2}} \int_{z = - \infty}^{z = + \infty} \int_{t = - \infty}^{t = α + β ∕ n \cdot z} \exp [- \frac{1}{2} \frac{z^{2}}{n} - \frac{1}{2} \frac{{(k - z)}^{2}}{n} - \frac{1}{2} t^{2}] d t d z

and apply the change of variables: t = β/nz + s. The integral becomes

{\tilde{p}}_{0} (2 n, k) = \frac{1}{{(2 π)}^{3 ∕ 2} n} \int_{s = - \infty}^{s = α} e^{- \frac{1}{2 n} p} d s . \int_{z = - \infty}^{z = + \infty} \exp [- \frac{1}{2 n} {(\sqrt{\frac{2 n + β^{2}}{n}} z + m)}^{2}] d z

with

\begin{matrix} m = & \sqrt{\frac{2 n + β^{2}}{n}} (s β - k), \\ p = & \frac{2 n^{2} s^{2} + 2 n β k s + k^{2} (n + β^{2})}{2 n + β^{2}} . \end{matrix}

Upon rearranging terms, we obtain:

{\tilde{p}}_{0} (2 n, k) = \frac{\sqrt{n}}{2 π \sqrt{n} \sqrt{2 n + β^{2}}} e^{- \frac{1}{2 n} \frac{k^{2} (n + 1 ∕ 2 β^{2}}{2 n + β^{2}}} \int_{s = - \infty}^{s = α} \exp [- \frac{1}{2 n} \frac{{(\sqrt{2} n s + \sqrt{2} ∕ 2 β k)}^{2}}{2 n + β^{2}}] d s .

Applying a further change of variables

q = \frac{s + \frac{β k}{2 n}}{\sqrt{\frac{2 n + β^{2}}{2 n}}}

the integral reduces to

{\tilde{p}}_{0} (2 n, k) = ϕ_{2 n} (k) \cdot Φ (\frac{α + \frac{β k}{2 n}}{\sqrt{\frac{2 n + β^{2}}{2 n}}}) .

A.2 Expectation of the Generalized Sample Average

To show that the expectation of (41) is equal to (44), it is useful to first consider the auxiliary quantity:

\begin{matrix} I = & \int k \cdot f_{N} (K) \cdot Φ (A + \frac{B}{N} k) d k \\ = & \frac{1}{2 π \sqrt{N}} \sum_{k = - \infty}^{k = + \infty} d k \int_{t = - \infty}^{t = A + \frac{B}{N} k} d t k \cdot \exp [- \frac{1}{2 N} {(k - N μ)}^{2} - \frac{1}{2} t^{2}] \\ = & \frac{1}{\sqrt{2 π}} \int_{s = - \infty}^{s = A} \exp [- \frac{1}{2 + 2 B^{2} ∕ N} {(s + μ B)}^{2}] d s, \end{matrix}

(A.1)

upon applying the change of variables and t = s + B/N · k and completing the square, we obtain:

\begin{matrix} I = & \frac{1}{\sqrt{2 π}} \frac{1}{{(1 + B^{2} ∕ N)}^{3 ∕ 2}} {N μ \int_{s = - \infty}^{s = A} \exp [- \frac{1}{2} {(\frac{s + μ B}{\sqrt{1 + B^{2} ∕ N}})}^{2}] d s - B \int_{s = - \infty}^{s = A} s \cdot \exp [- \frac{1}{2} {(\frac{s + μ B}{\sqrt{1 + B^{2} ∕ N}})}^{2}] d s} . \\ = & N μ Φ (\frac{A + B μ}{\sqrt{1 + B^{2 ∕ N}}}) + \frac{B}{\sqrt{2 π} \sqrt{1 + B^{2} ∕ N}} \exp [- \frac{1}{2} (\frac{A + B μ}{\sqrt{1 + B^{2} ∕ N}})] . \end{matrix}

Returning to (41), it follows that

\begin{matrix} E (\overset{‒}{μ}) = & \frac{c}{n} \int k \cdot f_{n} (k) \cdot Φ (α + \frac{β}{n} k) d k + \frac{d}{2 n} \int k \cdot f_{2 n} (k) \cdot [1 - Φ (\frac{α + \frac{β}{2 n} k}{\sqrt{\frac{2 n + β^{2}}{2 n}}})] d k \\ = & d \cdot μ + \frac{1}{2 n} (2 c I_{1} + d I 2) . \end{matrix}

(A.2)

Now both I₁ and I₂ are of the form (A.1) with, for I₁: N = n, A = α, B = β and, for I₂:N = 2n, $A = α ∕ \sqrt{\frac{2 n + β^{2}}{2 n}}$ , and $B = β ∕ \sqrt{\frac{2 n + β^{2}}{2 n}}$ . As a consequence, (A.2) can be written as:

E (\overset{‒}{μ}) = d \cdot μ + (c - d) Φ (ν) + \frac{2 c - d}{2 n \sqrt{2 π}} \frac{β}{\sqrt{1 + β^{2} ∕ n}} e^{- \frac{1}{2} ν^{2}},

with $ν = (α + β μ) ∕ \sqrt{1 + β^{2} ∕ n}$ . This result is based upon the fact that, for both I₁ and I₂, the identities hold: $A ∕ \sqrt{1 + B^{2} ∕ N} = α ∕ \sqrt{1 + β^{2} ∕ n}$ and $B ∕ \sqrt{1 + B^{2} ∕ N} = β ∕ \sqrt{1 + β^{2} ∕ n}$ . Therefore, (44) follows.

A.3 Variance for Generalized Sample Average When β = 0

To derive (46), observe:

\begin{matrix} E (\overset{‒}{μ} ∣ N = n) = c μ, \\ E (\overset{‒}{μ} ∣ N = 2 n) = \frac{1 - c Φ}{1 - Φ}, \end{matrix}

from which it follows that:

var [E (\overset{‒}{μ} ∣ N)] = μ^{2} {(1 - c)}^{2} \frac{Φ}{1 - Φ} .

(A.3)

Further,

\begin{matrix} var (\overset{‒}{μ} ∣ N = n) = \frac{c}{n}, \\ var (\overset{‒}{μ} ∣ N = 2 n) = \frac{{(1 - c Φ)}^{2}}{2 n {(1 - Φ)}^{2}}, \end{matrix}

producing

E [var (\overset{‒}{μ} ∣ N)] = \frac{Φ \cdot c^{2}}{n} + \frac{{(1 - c Φ)}^{2}}{2 n (1 - Φ)} .

(A.4)

Adding (A.3) and (A.4) produces (46).

Calculating the derivative of (46) with respect to c and equating it to zero yields (47). It is easy to verify that this optimum corresponds to a minimum.

References

1.Little RJA, Rubin DB. Statistical Analysis with Missing Data. John Wiley & Sons; 2002. [Google Scholar]
2.Wald A. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics. 1945;16:117–86. [Google Scholar]
3.Grambsch P. Sequential sampling based on the observed Fisher information to guarantee the accuracy of the maximum likelihood estimator. Annals of Statistics. 1983;11:68–77. [Google Scholar]
4.Barndor-Nielsen O, Cox DR. The effect of sampling rules on likelihood statistics. International Statistical Review. 1984;52:309–26. [Google Scholar]
5.Siegmund D. Estimation following sequential tests. Biometrika. 1978;64:191–99. [Google Scholar]
6.Hughes MD, Pocock SJ. Stopping rules and estimation problems in clinical trials. Statistics in Medicine. 1988;7:1231–42. doi: 10.1002/sim.4780071204. [DOI] [PubMed] [Google Scholar]
7.Emerson SS, Fleming TR. Parameter estimation following group sequential hypothesis testing. Biometrika. 1990;77:875–92. [Google Scholar]
8.Liu A, Hall WJ. Unbiased estimation following a group sequential test. Biometrika. 1999;86:71–8. [Google Scholar]
9.Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]
10.Little RJA. Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association. 1993;88:125–34. [Google Scholar]
11.Little RJA. A class of pattern-mixture models for normal incomplete data. Biometrika. 1994;81:471–83. [Google Scholar]
12.Kenward MG, Molenberghs G. Likelihood based frequentist inference when data are missing at random. Statistical Science. 1998;13:236–247. [Google Scholar]
13.Cox DR, Hinkley DV. Theoretical Statistics. Chapman & Hall; 1974. [Google Scholar]
14.Casella G, Berger RL. Statistical Inference. Duxbury Press; 2001. [Google Scholar]
15.Basu D. On statistics independent of a complete su cient statistic. Sankhya. 1955;15:377–80. [Google Scholar]
16.Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–85. [Google Scholar]
17.Armitage P. Sequential Medical Trials. Black-well; 1975. [PubMed] [Google Scholar]
18.Tsiatis AA, Rosner GL, Mehta CR. Exact confidence intervals following a group sequential test. Biometrics. 1984;40:797–803. [PubMed] [Google Scholar]
19.Rosner GL, Tsiatis AA. Exact confidence intervals following a group sequential trial: A comparison of methods. Biometrika. 1988;75:723–29. [Google Scholar]
20.Todd S, Whitehead J, Facey KM. Point and interval estimation following a sequential clinical trial. Biometrika. 1996;83:453–61. [Google Scholar]
21.Whitehead J. A unified theory for sequential clinical trials. Statistics in Medicine. 1999;18:2271–86. doi: 10.1002/(sici)1097-0258(19990915/30)18:17/18<2271::aid-sim254>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
22.Patel JK, Read CB. Handbook of the Normal Distribution. Marcel Dekker; 1996. [Google Scholar]
23.Rotnitzky A. Inverse probability weighted methods. In: Fitzmaurice G, Davidian M, Verbeke G, Molenberghs G, editors. Longitudinal Data Analysis. CRC/Chapman & Hall; 2009. pp. 453–476. [Google Scholar]
24.Vansteelandt S, Carpenter JR, Kenward MG. Analysis of incomplete data using inverse probability weighting and doubly robust estimators. Methodology. 2010;6:37–48. [Google Scholar]

[R1] 1.Little RJA, Rubin DB. Statistical Analysis with Missing Data. John Wiley & Sons; 2002. [Google Scholar]

[R2] 2.Wald A. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics. 1945;16:117–86. [Google Scholar]

[R3] 3.Grambsch P. Sequential sampling based on the observed Fisher information to guarantee the accuracy of the maximum likelihood estimator. Annals of Statistics. 1983;11:68–77. [Google Scholar]

[R4] 4.Barndor-Nielsen O, Cox DR. The effect of sampling rules on likelihood statistics. International Statistical Review. 1984;52:309–26. [Google Scholar]

[R5] 5.Siegmund D. Estimation following sequential tests. Biometrika. 1978;64:191–99. [Google Scholar]

[R6] 6.Hughes MD, Pocock SJ. Stopping rules and estimation problems in clinical trials. Statistics in Medicine. 1988;7:1231–42. doi: 10.1002/sim.4780071204. [DOI] [PubMed] [Google Scholar]

[R7] 7.Emerson SS, Fleming TR. Parameter estimation following group sequential hypothesis testing. Biometrika. 1990;77:875–92. [Google Scholar]

[R8] 8.Liu A, Hall WJ. Unbiased estimation following a group sequential test. Biometrika. 1999;86:71–8. [Google Scholar]

[R9] 9.Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92. [Google Scholar]

[R10] 10.Little RJA. Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association. 1993;88:125–34. [Google Scholar]

[R11] 11.Little RJA. A class of pattern-mixture models for normal incomplete data. Biometrika. 1994;81:471–83. [Google Scholar]

[R12] 12.Kenward MG, Molenberghs G. Likelihood based frequentist inference when data are missing at random. Statistical Science. 1998;13:236–247. [Google Scholar]

[R13] 13.Cox DR, Hinkley DV. Theoretical Statistics. Chapman & Hall; 1974. [Google Scholar]

[R14] 14.Casella G, Berger RL. Statistical Inference. Duxbury Press; 2001. [Google Scholar]

[R15] 15.Basu D. On statistics independent of a complete su cient statistic. Sankhya. 1955;15:377–80. [Google Scholar]

[R16] 16.Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–85. [Google Scholar]

[R17] 17.Armitage P. Sequential Medical Trials. Black-well; 1975. [PubMed] [Google Scholar]

[R18] 18.Tsiatis AA, Rosner GL, Mehta CR. Exact confidence intervals following a group sequential test. Biometrics. 1984;40:797–803. [PubMed] [Google Scholar]

[R19] 19.Rosner GL, Tsiatis AA. Exact confidence intervals following a group sequential trial: A comparison of methods. Biometrika. 1988;75:723–29. [Google Scholar]

[R20] 20.Todd S, Whitehead J, Facey KM. Point and interval estimation following a sequential clinical trial. Biometrika. 1996;83:453–61. [Google Scholar]

[R21] 21.Whitehead J. A unified theory for sequential clinical trials. Statistics in Medicine. 1999;18:2271–86. doi: 10.1002/(sici)1097-0258(19990915/30)18:17/18<2271::aid-sim254>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]

[R22] 22.Patel JK, Read CB. Handbook of the Normal Distribution. Marcel Dekker; 1996. [Google Scholar]

[R23] 23.Rotnitzky A. Inverse probability weighted methods. In: Fitzmaurice G, Davidian M, Verbeke G, Molenberghs G, editors. Longitudinal Data Analysis. CRC/Chapman & Hall; 2009. pp. 453–476. [Google Scholar]

[R24] 24.Vansteelandt S, Carpenter JR, Kenward MG. Analysis of incomplete data using inverse probability weighting and doubly robust estimators. Methodology. 2010;6:37–48. [Google Scholar]

PERMALINK

On Random Sample Size, Ignorability, Ancillarity, Completeness, Separability, and Degeneracy: Sequential Trials, Random Sample Sizes, and Missing Data

Geert Molenberghs

Michael G Kenward

Marc Aerts

Geert Verbeke

Anastasios A Tsiatis

Marie Davidian

Dimitris Rizopoulos

Abstract

1 Introduction

2 Generic Setting as ‘Joint Modeling’

1. Sequential trials

2. Missing longitudinal data

3. Completely random sample size

4. Clusters of random size

5. Censored time-to-event data

6. Joint models for longitudinal and time-to-event data

7. Random measurement times

3 Sampling With a Deterministic or Probabilistic Stopping Rule

3.1 The Marginal and Conditional Likelihood Estimators

Table 1.

3.2 Complete and Incomplete sufficient Statistics

Example 1

Example 2

Example 1 (continued)

Example 2 (continued)

3.3 The Generalized Sample Average

3.4 The Likelihood Estimators Revisited

3.5 Ramifications

3.6 Stopping Rules with Arbitrary Numbers of Looks

4 Completely Random Sample Size

4.1 Incomplete Sufficient Statistics

4.2 Generalized Sample Average

5 Missing Data

Example: Normally Distributed Outcomes and Probit Dropout Model

6 Concluding Remarks

Acknowledgments

Appendix

A Completeness, Unbiasedness, and Minimum Variance Calculations for the Stopping Rule Cases

A.1 Derivation of p0(2n, k)

A.2 Expectation of the Generalized Sample Average

A.3 Variance for Generalized Sample Average When β = 0

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A.1 Derivation of p₀(2n, k)