Summary
Doubly truncated survival data arise if failure times are observed only within certain time intervals. The nonparametric maximum likelihood estimator is widely used to estimate the underlying failure time distribution. Using a directed graph representation of the data suggested by Vardi (1985), a certain graphical condition holds if and only if the nonparametric maximum likelihood estimate exists and is unique. If this condition does not hold, then such an estimate may exist but need not be unique, so another graphical condition is proposed to check whether such an estimate exists. The conditions are simple to check using existing graphical software. Reanalysis of an AIDS incubation time dataset shows that a nonparametric maximum likelihood estimate does not exist for these data.
Keywords: Graph, Nonparametric estimator, Truncation
1. Introduction
Doubly truncated survival data arise when a failure time
is observed only if it lies within a time interval
. In this article we will consider an example where the age
at onset of AIDS is observed only if disease onset occurs after the time of a contaminated blood transfusion on the left,
, and age at a particular calendar date on the right,
. Doubly truncated data also arise in astronomy; for example, Efron & Petrosian (1999) described a dataset where, due to experimental conditions, quasar luminosity
is observed only if it lies in a certain interval
.
Suppose that we observe
independent copies of three random variables
(
) where
is the left truncation time,
is the failure time and
is the right truncation time, with
. Our objective is to compute the nonparametric maximum likelihood estimate of
. If
is independent of
, then the likelihood is proportional to
, where
is the density of
. Let
and
. Any two cumulative distribution functions
and
that are proportional between
and
, i.e.,
for some constant
and
, will give rise to the same likelihood value, regardless of the values of
and
for
(Efron & Petrosian, 1999). We therefore limit our search for the nonparametric maximum likelihood estimate to the class of distribution functions such that
, since any such estimator of
corresponds to a unique
for fixed
that maximizes the nonparametric likelihood subject to
and
.
The nonparametric maximum likelihood estimator for doubly truncated data was introduced by Turnbull (1976) and is widely used in areas such as astronomy (e.g., Lloyd-Ronning et al., 2002). To compute the nonparametric maximum likelihood estimate of
, iterative algorithms (Turnbull, 1976; Efron & Petrosian, 1999) have been proposed. Software packages such as the R package DTDA (Moreira et al., 2010; R Development Core Team, 2019) are available to implement these algorithms. However, convergence of these algorithms for a particular dataset does not imply that a nonparametric maximum likelihood estimate actually exists. In fact, such an estimate may not exist, in which case estimates computed using an iterative algorithm may be misleading. Thus, for any dataset it is important to determine whether a nonparametric maximum likelihood estimate exists. In the following we give a necessary and sufficient graphical condition, based on Vardi (1985), for the existence of a unique such estimate. If Vardi’s condition does not hold, then a nonparametric maximum likelihood estimate may exist but not be unique; therefore, we also propose a new condition for checking whether such an estimate exists. Both conditions are simple to verify using existing software.
2. Existence and uniqueness
A nonparametric maximum likelihood estimator, if it exists, will place mass only on the distinct ordered failure times
, where
(Vardi, 1985). Consider the space of discrete distribution functions with support on the observed
distinct ordered failure times, and let
, where
for
. Let
and
for
, where
denotes the indicator function. Then the conditional likelihood for
given
and
is
![]() |
(1) |
Define
to be a nonparametric maximum likelihood estimator of
if
subject to the constraints
![]() |
(2) |
The nonparametric maximum likelihood estimator of
is then
.
Depending on the particular dataset
observed, a nonparametric maximum likelihood estimate may or may not exist or be unique. A necessary and sufficient condition for its existence and uniqueness is given in Proposition 1. Define a directed graph
with
vertices, each representing an observation triplet
, such that there is a directed edge from vertex
to vertex
if and only if
. A directed path is a sequence of edges that connects a sequence of vertices with all the edges in the same direction. A graph
is strongly connected if, for any two vertices
and
, there exists a directed path from
to
and a directed path from
to
. From Vardi (1985), we have the following proposition.
Proposition 1.
There exists a unique nonparametric maximum likelihood estimate if and only if
is strongly connected.
To illustrate Proposition 1, consider the two examples given in Table 1. Example (a) is from Efron & Petrosian (1999), and example (b) is a modification where
instead of
. The observations are ordered by failure times and there are no ties, so
. The directed graph
of the original data, shown in Fig. 1(a), is strongly connected; therefore by Proposition 1 a nonparametric maximum likelihood estimate exists and is unique. This is not the case for the modified data: as shown in Fig. 1(b), there is no directed path from vertex 7 to any of the other vertices, since no failure time other than
is contained in
. Therefore, the directed graph
of the modified data is not strongly connected.
Table 1.
(a) Example from Efron & Petrosian (1999); (b) modified example
| (a) | |||||
|---|---|---|---|---|---|
|
|
|
|
|
|
| 1 | 0.75 | [0.4, 2] | 5 | 2.25 |
|
| 2 | 1.05 | [0.3, 1.4] | 6 | 2.4 |
|
| 3 | 1.25 | [0.8, 1.8] | 7 | 2.5 |
|
| 4 | 1.5 | [0, 2.3] | |||
| (b) | |||||
|
|
|
|
|
|
| 1 | 0.75 |
|
5 | 2.25 |
|
| 2 | 1.05 |
|
6 | 2.4 |
|
| 3 | 1.25 |
|
7 | 2.5 |
|
| 4 | 1.5 |
|
|||
Fig. 1.
Directed graphs for the data in examples (a) and (b) in Table 1.
Furthermore, for the example in Table 1(b) we can show that a nonparametric maximum likelihood estimate does not exist. To see this, note that the likelihood is
![]() |
Suppose, by way of contradiction, that there exists
which maximizes
subject to the constraints (2). Consider another estimate
defined by
![]() |
for some
. It follows that
and
for
. Furthermore,
![]() |
where the inequality holds because
. So
, which contradicts the supposition that
maximizes
. Thus a nonparametric maximum likelihood estimate does not exist.
The example in Table 1(b) is suggestive of another graphical condition which implies that a nonparametric maximum likelihood estimate does not exist. Define an undirected path in a graph to be a sequence of edges connecting an ordered list of distinct vertices, and define a graph
to be connected if there exists a path between each pair of vertices. The graph in Fig. 1(b) is connected but not strongly connected, and for these data a nonparametric maximum likelihood estimate does not exist. In fact, this relationship between connectedness and existence of a nonparametric maximum likelihood estimate is not a special case. The following proposition states that this graphical condition is in general sufficient for the nonexistence of a nonparametric maximum likelihood estimate. A proof is given in the Appendix.
Proposition 2.
If
is connected but not strongly connected, then a nonparametric maximum likelihood estimate does not exist.
Proposition 1 concerns the scenario where
is strongly connected. Proposition 2 addresses the case where
is connected but not strongly connected. The following corollary covers the remaining possible situation, where
is not connected. The first part follows from Proposition 1, upon noting that mass can be redistributed between the connected subgraphs without affecting the value of the likelihood. The second part follows from Proposition 2.
Corollary 1.
Suppose that
is not connected and can be partitioned into two or more connected subgraphs. If each of these subgraphs is strongly connected, then a nonparametric maximum likelihood estimate exists but is not unique. Otherwise, if at least one connected subgraph is not strongly connected, then a nonparametric maximum likelihood estimate does not exist.
3. Application to AIDS study
A study was conducted by the U.S. Centers for Disease Control and Prevention of individuals infected with HIV through blood transfusion and diagnosed with AIDS before 1 July 1986. Several datasets from this study have been analysed previously. Here we consider the dataset from Wang (1989), which contains data on
children who were between zero and four years of age at the time of blood transfusion. Austin et al. (2014) recently analysed these data to estimate the distribution of age at onset of AIDS,
, which is assumed to be doubly truncated by age at contaminated blood transfusion on the left,
, and age as of 1 July 1986 on the right,
. Assume that
. The solid line in Fig. 3, which is very similar to the Turnbull estimate from Fig. 1 in Austin et al. (2014), represents the Efron–Petrosian estimate
for these data obtained using the R package DTDA (Moreira et al., 2010). The convergence criterion for the Efron–Petrosian iterative algorithm is that the absolute value in the change of
between successive iterations should be less than 0.001 for
, where
for these data. At convergence, the loglikelihood was equal to
.
Fig. 3.
Directed graph
of the AIDS data.
Fig. 2.
AIDS study: Efron–Petrosian estimate
(solid line) and the modified estimate
(dashed line) described in § 3.
The directed graph
for these data, given in Fig. 3, was constructed using the R package igraph (Csardi & Nepusz, 2006). The is.connected function from igraph can be used to determine that the graph is connected but not strongly connected. Specifically, let
and
; then
for all
and
, implying that there is no directed path from any vertices in
to any vertices in
. Thus, according to Proposition 2, a nonparametric maximum likelihood estimate does not exist for these data.
Moreover, by the proof of Proposition 2 and similarly to the example in Table 1(b), we can construct an estimate
which increases the loglikelihood relative to the Efron–Petrosian estimate
. In particular, let
if
and
otherwise, where
,
and
. It is straightforward to show that
. For example, when
,
. The corresponding estimate
is shown by the dashed line in Fig. 3. While the values of the loglikelihood evaluated at the two estimates are close, differences in the estimates of
are not trivial at certain time-points. For example, at
the Efron–Petrosian estimate is approximately 0.86, whereas the modified estimate is approximately 1.00.
In either case, because the nonparametric maximum likelihood estimate does not exist for these data, both the Efron–Petrosian estimate and the modified estimate are potentially misleading. Both estimates place almost all of the mass on
, corresponding to
,
,
and
. The accumulation of mass at these early time-points need not reflect a high true underlying risk of AIDS before age two; in fact, approximately two-thirds of the observed
exceed two years. Rather, because
is connected but not strongly connected, shifting mass from
to
will always increase the likelihood.
4. Remarks
Constructing the graph
and assessing whether it is strongly connected, connected, or not connected, is important to avoid misleading estimates when computing the nonparametric maximum likelihood estimator in the presence of double truncation. The proposition at the end of this section indicates that, under certain assumptions,
will be strongly connected almost surely as
, suggesting that such misleading estimates will only tend to occur in smaller samples.
In the setting of left-truncated data, Lawless (2003, § 3.5.1) suggested avoiding this ‘pathological behaviour’ by computing the nonparametric maximum likelihood estimate conditional on survival beyond a certain early time-point; but how to select such a time-point is unclear. Motivation for this approach arises from the necessity of satisfying Proposition 1. For example, consider the AIDS study data, for which Proposition 1 does not hold. If we condition on survival beyond the failure times of individuals
, the graph
constructed using only data from individuals
is strongly connected and so Proposition 1 is satisfied. Therefore the conditional nonparametric maximum likelihood estimate exists and is unique, although the targeted estimand has now been redefined and data from individuals
are inefficiently discarded.
To avoid nonexistence that arises when
is connected but not strongly connected, one might consider instead replacing the strict inequality in (2) with 
such that the optimization problem then entails maximizing
over a closed, bounded set. However, the likelihood may not be well-defined for points on the boundary of this larger parameter space. For example, suppose we observe
doubly truncated survival times with
,
and
. Then the likelihood equals
, which is undefined for
and
. Redefining the likelihood at the boundary by limits is also not possible in general. For example, for
, the limit of
as
is
, whereas the limit of
as
is
. Thus the limit of
as
does not exist.
Another approach is to reparameterize the likelihood. For instance, continuing with the
example, let
where
,
and
, so that the likelihood can be re-expressed as
. This modified parameter likelihood is well-defined over the parameter space
. The maximum of
occurs at the boundary value
, which corresponds to
in the original parameterization, demonstrating that the potential for misleading estimates remains. Nonetheless, modifying the parameterization provides additional insight when the nonparametric maximum likelihood estimate does not exist. In particular, suppose that
is connected but not strongly connected, and let
denote a minimal strongly connected component, as defined in the Appendix. Consider replacing the parameters
with the parameters
and
, assuming there exist at least two nodes not in
. Then it follows from the proof of Proposition 2 that the modified parameter likelihood is maximized when
, and the parameters
are equal to the nonparametric maximum likelihood estimate for the dataset consisting only of data points
for
, which by Proposition 1 exists and is unique.
For the biased sampling model, Vardi (1985) and Gill et al. (1988) derived asymptotic properties of the nonparametric maximum likelihood estimator under an assumption which implies its existence and uniqueness with probability 1. Likewise, here derivation of large-sample properties such as consistency would seem to require assumptions which imply that
is strongly connected almost surely as
. A natural starting point is to consider conditions under which
is identifiable. Shen (2010) states that
is identifiable if
and
, where
and
for any distribution function
,
,
and
. However, without further assumptions, it is straightforward to construct examples where
is not identifiable. For example, if
is uniform on the interval
,
is uniform on the interval
,
is Bernoulli with mean
,
,
, and the density
of
is positive on
such that
and
, then
is identified up to proportionality within
and within
but
is not identified. Moreover, almost surely as
, any graph
corresponding to this data-generating mechanism can be partitioned into two subgraphs corresponding to observations with
and observations with
. The next proposition gives a set of sufficient conditions for the existence and uniqueness of the nonparametric maximum likelihood estimator with probability 1. A proof is given in the Appendix.
Proposition 3.
Assume that
,
and
are continuous with positive densities on
,
and
, respectively,
, and there exists
such that
. Then
is strongly connected almost surely as
.
Acknowledgement
This work was partially supported by the U.S. National Institutes of Health. The authors thank the associate editor and a reviewer for helpful comments.
Appendix
Proof of Proposition 2.
The proof relies on results from graph theory (West, 1996). Define a strongly connected component of a directed graph
to be a subgraph which is strongly connected and such that no additional adjacent edges or vertices in
can be included in the subgraph, with it remaining strongly connected. The collection of strongly connected components
partitions the set of vertices of
. Let
if there exists a directed path from a vertex in
to a vertex in
. The binary relation
is a partial order because it satisfies reflexivity, antisymmetry and transitivity. Therefore the set
is partially ordered. Since
is finite and nonempty, it has at least one minimal element, say
, such that no other element
satisfies
. Hence, for any
where
, there is no directed path from a vertex in
to a vertex in
. The vertices of
must form a proper subset of the vertices in
, since
is not strongly connected by assumption. Consequently, there must exist at least one directed path from a vertex in
to a vertex not in
, because
is connected by assumption.
Let
be the subset of vertices in
with directed edges to vertices not in
. Let
denote the set of vertices that are in
but not in
. Define
analogously. Then the likelihood (1) can be expressed as
where
Note that
may be empty, in which case we let
. On the other hand,
and
must be nonempty by the arguments above.
The proof of the proposition proceeds by showing that for any estimate
satisfying (2), another estimate
can be constructed which also satisfies (2) and is such that
. In particular, let
with
where
,
and
. By the construction of
it follows that
Next, note that if
, then
; otherwise, because there are no directed edges between vertices in
and vertices in
,
Finally,
where
and the inequality holds because
for
and
. Thus
,
and
, implying
. □
Proof of Proposition 3.
Let
be a partition of
where
. Consider two arbitrary vertices
and
of
and assume without loss of generality that
. Under the proposition’s assumptions, as
there exist almost surely
such that
and
for
, so there will be a directed path of the form
for some
. Likewise, it can be shown that a directed path will almost surely exist from
to
. □
References
- Austin, M. D., Simon, D. K. & Betensky, R. A. (2014). Computationally simple estimation and improved efficiency for special cases of double truncation. Lifetime Data Anal. 20, 335–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Csardi, G. & Nepusz, T. (2006). The igraph software package for complex network research. Inter J. Complex Syst., article no. 1695. [Google Scholar]
- Efron, B. & Petrosian, V. (1999). Nonparametric methods for doubly truncated data. J. Am. Statist. Assoc. 94, 824–34. [Google Scholar]
- Gill, R. D., Vardi, Y. & Wellner, J. A. (1988). Large sample theory of empirical distributions in biased sampling models. Ann. Statist. 16, 1069–112. [Google Scholar]
- Lawless, J. F. (2003). Statistical Models and Methods for Lifetime Data. Hoboken, New Jersey: John Wiley & Sons, 2nd ed. [Google Scholar]
- Lloyd-Ronning, N. M., Fryer, C. L. & Ramirez-Ruiz, E. (2002). Cosmological aspects of gamma-ray bursts: Luminosity evolution and an estimate of the star formation rate at high redshifts. Astrophys. J. 574, 554–65. [Google Scholar]
- Moreira, C., de Uña-Álvarez, J. & Crujeiras, R. (2010). DTDA: An R package to analyze randomly truncated data. J. Statist. Software 37, 1–20. [Google Scholar]
- R Development Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria: ISBN 3-900051-07-0 http://www.R-project.org. [Google Scholar]
- Shen, P. (2010). Nonparametric analysis of doubly truncated data. Ann. Inst. Statist. Math. 62, 835–53. [Google Scholar]
- Turnbull, B. W. (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Statist. Soc. B 38, 290–5. [Google Scholar]
- Vardi, Y. (1985). Empirical distributions in selection bias models. Ann. Statist. 13, 178–203. [Google Scholar]
- Wang, M. C. (1989). A semiparametric model for randomly truncated data. J. Am. Statist. Assoc. 84, 742–8. [Google Scholar]
- West, D. B. (1996). Introduction to Graph Theory. Upper Saddle River, New Jersey: Prentice-Hall. [Google Scholar]

























































































































