Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2019 Jul 23;106(4):989–996. doi: 10.1093/biomet/asz038

On nonparametric maximum likelihood estimation with double truncation

J Xiao 1, M G Hudgens 2,
PMCID: PMC6845852  PMID: 31754284

Summary

Doubly truncated survival data arise if failure times are observed only within certain time intervals. The nonparametric maximum likelihood estimator is widely used to estimate the underlying failure time distribution. Using a directed graph representation of the data suggested by Vardi (1985), a certain graphical condition holds if and only if the nonparametric maximum likelihood estimate exists and is unique. If this condition does not hold, then such an estimate may exist but need not be unique, so another graphical condition is proposed to check whether such an estimate exists. The conditions are simple to check using existing graphical software. Reanalysis of an AIDS incubation time dataset shows that a nonparametric maximum likelihood estimate does not exist for these data.

Keywords: Graph, Nonparametric estimator, Truncation

1. Introduction

Doubly truncated survival data arise when a failure time Inline graphic is observed only if it lies within a time interval Inline graphic. In this article we will consider an example where the age Inline graphic at onset of AIDS is observed only if disease onset occurs after the time of a contaminated blood transfusion on the left, Inline graphic, and age at a particular calendar date on the right, Inline graphic. Doubly truncated data also arise in astronomy; for example, Efron & Petrosian (1999) described a dataset where, due to experimental conditions, quasar luminosity Inline graphic is observed only if it lies in a certain interval Inline graphic.

Suppose that we observe Inline graphic independent copies of three random variables Inline graphic (Inline graphic) where Inline graphic is the left truncation time, Inline graphic is the failure time and Inline graphic is the right truncation time, with Inline graphic. Our objective is to compute the nonparametric maximum likelihood estimate of Inline graphic. If Inline graphic is independent of Inline graphic, then the likelihood is proportional to Inline graphic, where Inline graphic is the density of Inline graphic. Let Inline graphic and Inline graphic. Any two cumulative distribution functions Inline graphic and Inline graphic that are proportional between Inline graphic and Inline graphic, i.e., Inline graphic for some constant Inline graphic and Inline graphic, will give rise to the same likelihood value, regardless of the values of Inline graphic and Inline graphic for Inline graphic (Efron & Petrosian, 1999). We therefore limit our search for the nonparametric maximum likelihood estimate to the class of distribution functions such that Inline graphic, since any such estimator of Inline graphic corresponds to a unique Inline graphic for fixed Inline graphic that maximizes the nonparametric likelihood subject to Inline graphic and Inline graphic.

The nonparametric maximum likelihood estimator for doubly truncated data was introduced by Turnbull (1976) and is widely used in areas such as astronomy (e.g., Lloyd-Ronning et al., 2002). To compute the nonparametric maximum likelihood estimate of Inline graphic, iterative algorithms (Turnbull, 1976; Efron & Petrosian, 1999) have been proposed. Software packages such as the R package DTDA (Moreira et al., 2010; R Development Core Team, 2019) are available to implement these algorithms. However, convergence of these algorithms for a particular dataset does not imply that a nonparametric maximum likelihood estimate actually exists. In fact, such an estimate may not exist, in which case estimates computed using an iterative algorithm may be misleading. Thus, for any dataset it is important to determine whether a nonparametric maximum likelihood estimate exists. In the following we give a necessary and sufficient graphical condition, based on Vardi (1985), for the existence of a unique such estimate. If Vardi’s condition does not hold, then a nonparametric maximum likelihood estimate may exist but not be unique; therefore, we also propose a new condition for checking whether such an estimate exists. Both conditions are simple to verify using existing software.

2. Existence and uniqueness

A nonparametric maximum likelihood estimator, if it exists, will place mass only on the distinct ordered failure times Inline graphic, where Inline graphic (Vardi, 1985). Consider the space of discrete distribution functions with support on the observed Inline graphic distinct ordered failure times, and let Inline graphic, where Inline graphic for Inline graphic. Let Inline graphic and Inline graphic for Inline graphic, where Inline graphic denotes the indicator function. Then the conditional likelihood for Inline graphic given Inline graphic and Inline graphic is

graphic file with name M53.gif (1)

Define Inline graphic to be a nonparametric maximum likelihood estimator of Inline graphic if Inline graphic subject to the constraints

graphic file with name M57.gif (2)

The nonparametric maximum likelihood estimator of Inline graphic is then Inline graphic.

Depending on the particular dataset Inline graphic observed, a nonparametric maximum likelihood estimate may or may not exist or be unique. A necessary and sufficient condition for its existence and uniqueness is given in Proposition 1. Define a directed graph Inline graphic with Inline graphic vertices, each representing an observation triplet Inline graphic, such that there is a directed edge from vertex Inline graphic to vertex Inline graphic if and only if Inline graphic. A directed path is a sequence of edges that connects a sequence of vertices with all the edges in the same direction. A graph Inline graphic is strongly connected if, for any two vertices Inline graphic and Inline graphic, there exists a directed path from Inline graphic to Inline graphic and a directed path from Inline graphic to Inline graphic. From Vardi (1985), we have the following proposition.

Proposition 1.

There exists a unique nonparametric maximum likelihood estimate if and only if Inline graphic is strongly connected.

To illustrate Proposition 1, consider the two examples given in Table 1. Example (a) is from Efron & Petrosian (1999), and example (b) is a modification where Inline graphic instead of Inline graphic. The observations are ordered by failure times and there are no ties, so Inline graphic. The directed graph Inline graphic of the original data, shown in Fig. 1(a), is strongly connected; therefore by Proposition 1 a nonparametric maximum likelihood estimate exists and is unique. This is not the case for the modified data: as shown in Fig. 1(b), there is no directed path from vertex 7 to any of the other vertices, since no failure time other than Inline graphic is contained in Inline graphic. Therefore, the directed graph Inline graphic of the modified data is not strongly connected.

Table 1.

(a) Example from Efron & Petrosian (1999); (b) modified example

(a)
Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
1 0.75 [0.4, 2] 5 2.25 Inline graphic
2 1.05 [0.3, 1.4] 6 2.4 Inline graphic
3 1.25 [0.8, 1.8] 7 2.5 Inline graphic
4 1.5 [0, 2.3]      
(b)
Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
1 0.75 Inline graphic 5 2.25 Inline graphic
2 1.05 Inline graphic 6 2.4 Inline graphic
3 1.25 Inline graphic 7 2.5 Inline graphic
4 1.5 Inline graphic      

Fig. 1.

Fig. 1.

Directed graphs for the data in examples (a) and (b) in Table 1.

Furthermore, for the example in Table 1(b) we can show that a nonparametric maximum likelihood estimate does not exist. To see this, note that the likelihood is

graphic file with name M104.gif

Suppose, by way of contradiction, that there exists Inline graphic which maximizes Inline graphic subject to the constraints (2). Consider another estimate Inline graphic defined by

graphic file with name M108.gif

for some Inline graphic. It follows that Inline graphic and Inline graphic for Inline graphic. Furthermore,

graphic file with name M113.gif

where the inequality holds because Inline graphic. So Inline graphic, which contradicts the supposition that Inline graphic maximizes Inline graphic. Thus a nonparametric maximum likelihood estimate does not exist.

The example in Table 1(b) is suggestive of another graphical condition which implies that a nonparametric maximum likelihood estimate does not exist. Define an undirected path in a graph to be a sequence of edges connecting an ordered list of distinct vertices, and define a graph Inline graphic to be connected if there exists a path between each pair of vertices. The graph in Fig. 1(b) is connected but not strongly connected, and for these data a nonparametric maximum likelihood estimate does not exist. In fact, this relationship between connectedness and existence of a nonparametric maximum likelihood estimate is not a special case. The following proposition states that this graphical condition is in general sufficient for the nonexistence of a nonparametric maximum likelihood estimate. A proof is given in the Appendix.

Proposition 2.

If Inline graphic is connected but not strongly connected, then a nonparametric maximum likelihood estimate does not exist.

Proposition 1 concerns the scenario where Inline graphic is strongly connected. Proposition 2 addresses the case where Inline graphic is connected but not strongly connected. The following corollary covers the remaining possible situation, where Inline graphic is not connected. The first part follows from Proposition 1, upon noting that mass can be redistributed between the connected subgraphs without affecting the value of the likelihood. The second part follows from Proposition 2.

Corollary 1.

Suppose that Inline graphic is not connected and can be partitioned into two or more connected subgraphs. If each of these subgraphs is strongly connected, then a nonparametric maximum likelihood estimate exists but is not unique. Otherwise, if at least one connected subgraph is not strongly connected, then a nonparametric maximum likelihood estimate does not exist.

3. Application to AIDS study

A study was conducted by the U.S. Centers for Disease Control and Prevention of individuals infected with HIV through blood transfusion and diagnosed with AIDS before 1 July 1986. Several datasets from this study have been analysed previously. Here we consider the dataset from Wang (1989), which contains data on Inline graphic children who were between zero and four years of age at the time of blood transfusion. Austin et al. (2014) recently analysed these data to estimate the distribution of age at onset of AIDS, Inline graphic, which is assumed to be doubly truncated by age at contaminated blood transfusion on the left, Inline graphic, and age as of 1 July 1986 on the right, Inline graphic. Assume that Inline graphic. The solid line in Fig. 3, which is very similar to the Turnbull estimate from Fig. 1 in Austin et al. (2014), represents the Efron–Petrosian estimate Inline graphic for these data obtained using the R package DTDA (Moreira et al., 2010). The convergence criterion for the Efron–Petrosian iterative algorithm is that the absolute value in the change of Inline graphic between successive iterations should be less than 0.001 for Inline graphic, where Inline graphic for these data. At convergence, the loglikelihood was equal to Inline graphic.

Fig. 3.

Fig. 3.

Directed graph Inline graphic of the AIDS data.

Fig. 2.

Fig. 2.

AIDS study: Efron–Petrosian estimate Inline graphic (solid line) and the modified estimate Inline graphic (dashed line) described in § 3.

The directed graph Inline graphic for these data, given in Fig. 3, was constructed using the R package igraph (Csardi & Nepusz, 2006). The is.connected function from igraph can be used to determine that the graph is connected but not strongly connected. Specifically, let Inline graphic and Inline graphic; then Inline graphic for all Inline graphic and Inline graphic, implying that there is no directed path from any vertices in Inline graphic to any vertices in Inline graphic. Thus, according to Proposition 2, a nonparametric maximum likelihood estimate does not exist for these data.

Moreover, by the proof of Proposition 2 and similarly to the example in Table 1(b), we can construct an estimate Inline graphic which increases the loglikelihood relative to the Efron–Petrosian estimate Inline graphic. In particular, let Inline graphic if Inline graphic and Inline graphic otherwise, where Inline graphic, Inline graphic and Inline graphic. It is straightforward to show that Inline graphic. For example, when Inline graphic, Inline graphic. The corresponding estimate Inline graphic is shown by the dashed line in Fig. 3. While the values of the loglikelihood evaluated at the two estimates are close, differences in the estimates of Inline graphic are not trivial at certain time-points. For example, at Inline graphic the Efron–Petrosian estimate is approximately 0.86, whereas the modified estimate is approximately 1.00.

In either case, because the nonparametric maximum likelihood estimate does not exist for these data, both the Efron–Petrosian estimate and the modified estimate are potentially misleading. Both estimates place almost all of the mass on Inline graphic, corresponding to Inline graphic, Inline graphic, Inline graphic and Inline graphic. The accumulation of mass at these early time-points need not reflect a high true underlying risk of AIDS before age two; in fact, approximately two-thirds of the observed Inline graphic exceed two years. Rather, because Inline graphic is connected but not strongly connected, shifting mass from Inline graphic to Inline graphic will always increase the likelihood.

4. Remarks

Constructing the graph Inline graphic and assessing whether it is strongly connected, connected, or not connected, is important to avoid misleading estimates when computing the nonparametric maximum likelihood estimator in the presence of double truncation. The proposition at the end of this section indicates that, under certain assumptions, Inline graphic will be strongly connected almost surely as Inline graphic, suggesting that such misleading estimates will only tend to occur in smaller samples.

In the setting of left-truncated data, Lawless (2003, § 3.5.1) suggested avoiding this ‘pathological behaviour’ by computing the nonparametric maximum likelihood estimate conditional on survival beyond a certain early time-point; but how to select such a time-point is unclear. Motivation for this approach arises from the necessity of satisfying Proposition 1. For example, consider the AIDS study data, for which Proposition 1 does not hold. If we condition on survival beyond the failure times of individuals Inline graphic, the graph Inline graphic constructed using only data from individuals Inline graphic is strongly connected and so Proposition 1 is satisfied. Therefore the conditional nonparametric maximum likelihood estimate exists and is unique, although the targeted estimand has now been redefined and data from individuals Inline graphic are inefficiently discarded.

To avoid nonexistence that arises when Inline graphic is connected but not strongly connected, one might consider instead replacing the strict inequality in (2) with Inline graphicInline graphic such that the optimization problem then entails maximizing Inline graphic over a closed, bounded set. However, the likelihood may not be well-defined for points on the boundary of this larger parameter space. For example, suppose we observe Inline graphic doubly truncated survival times with Inline graphic, Inline graphic and Inline graphic. Then the likelihood equals Inline graphic, which is undefined for Inline graphic and Inline graphic. Redefining the likelihood at the boundary by limits is also not possible in general. For example, for Inline graphic, the limit of Inline graphic as Inline graphic is Inline graphic, whereas the limit of Inline graphic as Inline graphic is Inline graphic. Thus the limit of Inline graphic as Inline graphic does not exist.

Another approach is to reparameterize the likelihood. For instance, continuing with the Inline graphic example, let Inline graphic where Inline graphic, Inline graphic and Inline graphic, so that the likelihood can be re-expressed as Inline graphic. This modified parameter likelihood is well-defined over the parameter space Inline graphic. The maximum of Inline graphic occurs at the boundary value Inline graphic, which corresponds to Inline graphic in the original parameterization, demonstrating that the potential for misleading estimates remains. Nonetheless, modifying the parameterization provides additional insight when the nonparametric maximum likelihood estimate does not exist. In particular, suppose that Inline graphic is connected but not strongly connected, and let Inline graphic denote a minimal strongly connected component, as defined in the Appendix. Consider replacing the parameters Inline graphic with the parameters Inline graphic and Inline graphic, assuming there exist at least two nodes not in Inline graphic. Then it follows from the proof of Proposition 2 that the modified parameter likelihood is maximized when Inline graphic, and the parameters Inline graphic are equal to the nonparametric maximum likelihood estimate for the dataset consisting only of data points Inline graphic for Inline graphic, which by Proposition 1 exists and is unique.

For the biased sampling model, Vardi (1985) and Gill et al. (1988) derived asymptotic properties of the nonparametric maximum likelihood estimator under an assumption which implies its existence and uniqueness with probability 1. Likewise, here derivation of large-sample properties such as consistency would seem to require assumptions which imply that Inline graphic is strongly connected almost surely as Inline graphic. A natural starting point is to consider conditions under which Inline graphic is identifiable. Shen (2010) states that Inline graphic is identifiable if Inline graphic and Inline graphic, where Inline graphic and Inline graphic for any distribution function Inline graphic, Inline graphic, Inline graphic and Inline graphic. However, without further assumptions, it is straightforward to construct examples where Inline graphic is not identifiable. For example, if Inline graphic is uniform on the interval Inline graphic, Inline graphic is uniform on the interval Inline graphic, Inline graphic is Bernoulli with mean Inline graphic, Inline graphic, Inline graphic, and the density Inline graphic of Inline graphic is positive on Inline graphic such that Inline graphic and Inline graphic, then Inline graphic is identified up to proportionality within Inline graphic and within Inline graphic but Inline graphic is not identified. Moreover, almost surely as Inline graphic, any graph Inline graphic corresponding to this data-generating mechanism can be partitioned into two subgraphs corresponding to observations with Inline graphic and observations with Inline graphic. The next proposition gives a set of sufficient conditions for the existence and uniqueness of the nonparametric maximum likelihood estimator with probability 1. A proof is given in the Appendix.

Proposition 3.

Assume that Inline graphic, Inline graphic and Inline graphic are continuous with positive densities on Inline graphic, Inline graphic and Inline graphic, respectively, Inline graphic, and there exists Inline graphic such that Inline graphic. Then Inline graphic is strongly connected almost surely as Inline graphic.

Acknowledgement

This work was partially supported by the U.S. National Institutes of Health. The authors thank the associate editor and a reviewer for helpful comments.

Appendix

Proof of Proposition 2.

The proof relies on results from graph theory (West, 1996). Define a strongly connected component of a directed graph Inline graphic to be a subgraph which is strongly connected and such that no additional adjacent edges or vertices in Inline graphic can be included in the subgraph, with it remaining strongly connected. The collection of strongly connected components Inline graphic partitions the set of vertices of Inline graphic. Let Inline graphic if there exists a directed path from a vertex in Inline graphic to a vertex in Inline graphic. The binary relation Inline graphic is a partial order because it satisfies reflexivity, antisymmetry and transitivity. Therefore the set Inline graphic is partially ordered. Since Inline graphic is finite and nonempty, it has at least one minimal element, say Inline graphic, such that no other element Inline graphic satisfies Inline graphic. Hence, for any Inline graphic where Inline graphic, there is no directed path from a vertex in Inline graphic to a vertex in Inline graphic. The vertices of Inline graphic must form a proper subset of the vertices in Inline graphic, since Inline graphic is not strongly connected by assumption. Consequently, there must exist at least one directed path from a vertex in Inline graphic to a vertex not in Inline graphic, because Inline graphic is connected by assumption.

Let Inline graphic be the subset of vertices in Inline graphic with directed edges to vertices not in Inline graphic. Let Inline graphic denote the set of vertices that are in Inline graphic but not in Inline graphic. Define Inline graphic analogously. Then the likelihood (1) can be expressed as Inline graphic where

Proof of Proposition 2.

Note that Inline graphic may be empty, in which case we let Inline graphic. On the other hand, Inline graphic and Inline graphic must be nonempty by the arguments above.

The proof of the proposition proceeds by showing that for any estimate Inline graphic satisfying (2), another estimate Inline graphic can be constructed which also satisfies (2) and is such that Inline graphic. In particular, let Inline graphic with

Proof of Proposition 2.

where Inline graphic, Inline graphic and Inline graphic. By the construction of Inline graphic it follows that

Proof of Proposition 2.

Next, note that if Inline graphic, then Inline graphic; otherwise, because there are no directed edges between vertices in Inline graphic and vertices in Inline graphic,

Proof of Proposition 2.

Finally,

Proof of Proposition 2.

where Inline graphic and the inequality holds because Inline graphic for Inline graphic and Inline graphic. Thus Inline graphic, Inline graphic and Inline graphic, implying Inline graphic. □

Proof of Proposition 3.

Let Inline graphic be a partition of Inline graphic where Inline graphicInline graphic. Consider two arbitrary vertices Inline graphic and Inline graphic of Inline graphic and assume without loss of generality that Inline graphic. Under the proposition’s assumptions, as Inline graphic there exist almost surely Inline graphic such that Inline graphic and Inline graphic for Inline graphic, so there will be a directed path of the form Inline graphic for some Inline graphic. Likewise, it can be shown that a directed path will almost surely exist from Inline graphic to Inline graphic. □

References

  1. Austin, M. D., Simon, D. K. & Betensky, R. A. (2014). Computationally simple estimation and improved efficiency for special cases of double truncation. Lifetime Data Anal. 20, 335–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Csardi, G. & Nepusz, T. (2006). The igraph software package for complex network research. Inter J. Complex Syst., article no. 1695. [Google Scholar]
  3. Efron, B. & Petrosian, V. (1999). Nonparametric methods for doubly truncated data. J. Am. Statist. Assoc. 94, 824–34. [Google Scholar]
  4. Gill, R. D., Vardi, Y. & Wellner, J. A. (1988). Large sample theory of empirical distributions in biased sampling models. Ann. Statist. 16, 1069–112. [Google Scholar]
  5. Lawless, J. F. (2003). Statistical Models and Methods for Lifetime Data. Hoboken, New Jersey: John Wiley & Sons, 2nd ed. [Google Scholar]
  6. Lloyd-Ronning, N. M., Fryer, C. L. & Ramirez-Ruiz, E. (2002). Cosmological aspects of gamma-ray bursts: Luminosity evolution and an estimate of the star formation rate at high redshifts. Astrophys. J. 574, 554–65. [Google Scholar]
  7. Moreira, C., de Uña-Álvarez, J. & Crujeiras, R. (2010). DTDA: An R package to analyze randomly truncated data. J. Statist. Software 37, 1–20. [Google Scholar]
  8. R Development Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria: ISBN 3-900051-07-0 http://www.R-project.org. [Google Scholar]
  9. Shen, P. (2010). Nonparametric analysis of doubly truncated data. Ann. Inst. Statist. Math. 62, 835–53. [Google Scholar]
  10. Turnbull, B. W. (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Statist. Soc. B 38, 290–5. [Google Scholar]
  11. Vardi, Y. (1985). Empirical distributions in selection bias models. Ann. Statist. 13, 178–203. [Google Scholar]
  12. Wang, M. C. (1989). A semiparametric model for randomly truncated data. J. Am. Statist. Assoc. 84, 742–8. [Google Scholar]
  13. West, D. B. (1996). Introduction to Graph Theory. Upper Saddle River, New Jersey: Prentice-Hall. [Google Scholar]

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES