On nonparametric maximum likelihood estimation with double truncation

J Xiao; M G Hudgens

doi:10.1093/biomet/asz038

. 2019 Jul 23;106(4):989–996. doi: 10.1093/biomet/asz038

On nonparametric maximum likelihood estimation with double truncation

J Xiao ¹, M G Hudgens ^2,^✉

PMCID: PMC6845852 PMID: 31754284

Summary

Doubly truncated survival data arise if failure times are observed only within certain time intervals. The nonparametric maximum likelihood estimator is widely used to estimate the underlying failure time distribution. Using a directed graph representation of the data suggested by Vardi (1985), a certain graphical condition holds if and only if the nonparametric maximum likelihood estimate exists and is unique. If this condition does not hold, then such an estimate may exist but need not be unique, so another graphical condition is proposed to check whether such an estimate exists. The conditions are simple to check using existing graphical software. Reanalysis of an AIDS incubation time dataset shows that a nonparametric maximum likelihood estimate does not exist for these data.

Keywords: Graph, Nonparametric estimator, Truncation

1. Introduction

Doubly truncated survival data arise when a failure time Inline graphic is observed only if it lies within a time interval . In this article we will consider an example where the age at onset of AIDS is observed only if disease onset occurs after the time of a contaminated blood transfusion on the left, , and age at a particular calendar date on the right, Inline graphic . Doubly truncated data also arise in astronomy; for example, Efron & Petrosian (1999) described a dataset where, due to experimental conditions, quasar luminosity is observed only if it lies in a certain interval .

Suppose that we observe Inline graphic independent copies of three random variables () where is the left truncation time, is the failure time and is the right truncation time, with . Our objective is to compute the nonparametric maximum likelihood estimate of . If is independent of , then the likelihood is proportional to Inline graphic , where is the density of . Let and . Any two cumulative distribution functions and that are proportional between and , i.e., for some constant and , will give rise to the same likelihood value, regardless of the values of and for (Efron & Petrosian, 1999). We therefore limit our search for the nonparametric maximum likelihood estimate to the class of distribution functions such that Inline graphic , since any such estimator of corresponds to a unique for fixed that maximizes the nonparametric likelihood subject to and .

The nonparametric maximum likelihood estimator for doubly truncated data was introduced by Turnbull (1976) and is widely used in areas such as astronomy (e.g., Lloyd-Ronning et al., 2002). To compute the nonparametric maximum likelihood estimate of Inline graphic , iterative algorithms (Turnbull, 1976; Efron & Petrosian, 1999) have been proposed. Software packages such as the R package DTDA (Moreira et al., 2010; R Development Core Team, 2019) are available to implement these algorithms. However, convergence of these algorithms for a particular dataset does not imply that a nonparametric maximum likelihood estimate actually exists. In fact, such an estimate may not exist, in which case estimates computed using an iterative algorithm may be misleading. Thus, for any dataset it is important to determine whether a nonparametric maximum likelihood estimate exists. In the following we give a necessary and sufficient graphical condition, based on Vardi (1985), for the existence of a unique such estimate. If Vardi’s condition does not hold, then a nonparametric maximum likelihood estimate may exist but not be unique; therefore, we also propose a new condition for checking whether such an estimate exists. Both conditions are simple to verify using existing software.

2. Existence and uniqueness

A nonparametric maximum likelihood estimator, if it exists, will place mass only on the distinct ordered failure times Inline graphic , where (Vardi, 1985). Consider the space of discrete distribution functions with support on the observed distinct ordered failure times, and let , where for . Let and for , where denotes the indicator function. Then the conditional likelihood for given and is

(1)

Define Inline graphic to be a nonparametric maximum likelihood estimator of if subject to the constraints

(2)

The nonparametric maximum likelihood estimator of Inline graphic is then .

Depending on the particular dataset Inline graphic observed, a nonparametric maximum likelihood estimate may or may not exist or be unique. A necessary and sufficient condition for its existence and uniqueness is given in Proposition 1. Define a directed graph with vertices, each representing an observation triplet , such that there is a directed edge from vertex Inline graphic to vertex if and only if . A directed path is a sequence of edges that connects a sequence of vertices with all the edges in the same direction. A graph is strongly connected if, for any two vertices and , there exists a directed path from to and a directed path from to . From Vardi (1985), we have the following proposition.

Proposition 1.

There exists a unique nonparametric maximum likelihood estimate if and only if is strongly connected.

To illustrate Proposition 1, consider the two examples given in Table 1. Example (a) is from Efron & Petrosian (1999), and example (b) is a modification where Inline graphic instead of . The observations are ordered by failure times and there are no ties, so . The directed graph of the original data, shown in Fig. 1(a), is strongly connected; therefore by Proposition 1 a nonparametric maximum likelihood estimate exists and is unique. This is not the case for the modified data: as shown in Fig. 1(b), there is no directed path from vertex 7 to any of the other vertices, since no failure time other than Inline graphic is contained in . Therefore, the directed graph of the modified data is not strongly connected.

Table 1.

(a) Example from Efron & Petrosian (1999); (b) modified example

(a)

1	0.75	[0.4, 2]	5	2.25
2	1.05	[0.3, 1.4]	6	2.4
3	1.25	[0.8, 1.8]	7	2.5
4	1.5	[0, 2.3]
(b)

1	0.75		5	2.25
2	1.05		6	2.4
3	1.25		7	2.5
4	1.5

Open in a new tab

Fig. 1. — Directed graphs for the data in examples (a) and (b) in Table 1.

Furthermore, for the example in Table 1(b) we can show that a nonparametric maximum likelihood estimate does not exist. To see this, note that the likelihood is

Suppose, by way of contradiction, that there exists Inline graphic which maximizes subject to the constraints (2). Consider another estimate defined by

for some Inline graphic . It follows that and for . Furthermore,

where the inequality holds because Inline graphic . So , which contradicts the supposition that maximizes . Thus a nonparametric maximum likelihood estimate does not exist.

The example in Table 1(b) is suggestive of another graphical condition which implies that a nonparametric maximum likelihood estimate does not exist. Define an undirected path in a graph to be a sequence of edges connecting an ordered list of distinct vertices, and define a graph Inline graphic to be connected if there exists a path between each pair of vertices. The graph in Fig. 1(b) is connected but not strongly connected, and for these data a nonparametric maximum likelihood estimate does not exist. In fact, this relationship between connectedness and existence of a nonparametric maximum likelihood estimate is not a special case. The following proposition states that this graphical condition is in general sufficient for the nonexistence of a nonparametric maximum likelihood estimate. A proof is given in the Appendix.

Proposition 2.

If is connected but not strongly connected, then a nonparametric maximum likelihood estimate does not exist.

Proposition 1 concerns the scenario where Inline graphic is strongly connected. Proposition 2 addresses the case where is connected but not strongly connected. The following corollary covers the remaining possible situation, where is not connected. The first part follows from Proposition 1, upon noting that mass can be redistributed between the connected subgraphs without affecting the value of the likelihood. The second part follows from Proposition 2.

Corollary 1.

Suppose that is not connected and can be partitioned into two or more connected subgraphs. If each of these subgraphs is strongly connected, then a nonparametric maximum likelihood estimate exists but is not unique. Otherwise, if at least one connected subgraph is not strongly connected, then a nonparametric maximum likelihood estimate does not exist.

3. Application to AIDS study

A study was conducted by the U.S. Centers for Disease Control and Prevention of individuals infected with HIV through blood transfusion and diagnosed with AIDS before 1 July 1986. Several datasets from this study have been analysed previously. Here we consider the dataset from Wang (1989), which contains data on Inline graphic children who were between zero and four years of age at the time of blood transfusion. Austin et al. (2014) recently analysed these data to estimate the distribution of age at onset of AIDS, , which is assumed to be doubly truncated by age at contaminated blood transfusion on the left, Inline graphic , and age as of 1 July 1986 on the right, . Assume that . The solid line in Fig. 3, which is very similar to the Turnbull estimate from Fig. 1 in Austin et al. (2014), represents the Efron–Petrosian estimate for these data obtained using the R package DTDA (Moreira et al., 2010). The convergence criterion for the Efron–Petrosian iterative algorithm is that the absolute value in the change of Inline graphic between successive iterations should be less than 0.001 for , where for these data. At convergence, the loglikelihood was equal to .

Fig. 3. — Directed graph of the AIDS data.

Inline graphic — Directed graph of the AIDS data.

Fig. 2. — AIDS study: Efron–Petrosian estimate (solid line) and the modified estimate (dashed line) described in § 3.

The directed graph Inline graphic for these data, given in Fig. 3, was constructed using the R package igraph (Csardi & Nepusz, 2006). The is.connected function from igraph can be used to determine that the graph is connected but not strongly connected. Specifically, let and ; then for all and , implying that there is no directed path from any vertices in Inline graphic to any vertices in . Thus, according to Proposition 2, a nonparametric maximum likelihood estimate does not exist for these data.

Moreover, by the proof of Proposition 2 and similarly to the example in Table 1(b), we can construct an estimate Inline graphic which increases the loglikelihood relative to the Efron–Petrosian estimate . In particular, let if and otherwise, where , and . It is straightforward to show that . For example, when , . The corresponding estimate is shown by the dashed line in Fig. 3. While the values of the loglikelihood evaluated at the two estimates are close, differences in the estimates of Inline graphic are not trivial at certain time-points. For example, at the Efron–Petrosian estimate is approximately 0.86, whereas the modified estimate is approximately 1.00.

In either case, because the nonparametric maximum likelihood estimate does not exist for these data, both the Efron–Petrosian estimate and the modified estimate are potentially misleading. Both estimates place almost all of the mass on Inline graphic , corresponding to , , and . The accumulation of mass at these early time-points need not reflect a high true underlying risk of AIDS before age two; in fact, approximately two-thirds of the observed exceed two years. Rather, because is connected but not strongly connected, shifting mass from Inline graphic to will always increase the likelihood.

4. Remarks

Constructing the graph Inline graphic and assessing whether it is strongly connected, connected, or not connected, is important to avoid misleading estimates when computing the nonparametric maximum likelihood estimator in the presence of double truncation. The proposition at the end of this section indicates that, under certain assumptions, Inline graphic will be strongly connected almost surely as , suggesting that such misleading estimates will only tend to occur in smaller samples.

In the setting of left-truncated data, Lawless (2003, § 3.5.1) suggested avoiding this ‘pathological behaviour’ by computing the nonparametric maximum likelihood estimate conditional on survival beyond a certain early time-point; but how to select such a time-point is unclear. Motivation for this approach arises from the necessity of satisfying Proposition 1. For example, consider the AIDS study data, for which Proposition 1 does not hold. If we condition on survival beyond the failure times of individuals Inline graphic , the graph constructed using only data from individuals is strongly connected and so Proposition 1 is satisfied. Therefore the conditional nonparametric maximum likelihood estimate exists and is unique, although the targeted estimand has now been redefined and data from individuals Inline graphic are inefficiently discarded.

To avoid nonexistence that arises when Inline graphic is connected but not strongly connected, one might consider instead replacing the strict inequality in (2) with such that the optimization problem then entails maximizing over a closed, bounded set. However, the likelihood may not be well-defined for points on the boundary of this larger parameter space. For example, suppose we observe Inline graphic doubly truncated survival times with , and . Then the likelihood equals , which is undefined for and . Redefining the likelihood at the boundary by limits is also not possible in general. For example, for , the limit of as is , whereas the limit of as is . Thus the limit of as Inline graphic does not exist.

Another approach is to reparameterize the likelihood. For instance, continuing with the Inline graphic example, let where , and , so that the likelihood can be re-expressed as . This modified parameter likelihood is well-defined over the parameter space . The maximum of occurs at the boundary value , which corresponds to in the original parameterization, demonstrating that the potential for misleading estimates remains. Nonetheless, modifying the parameterization provides additional insight when the nonparametric maximum likelihood estimate does not exist. In particular, suppose that Inline graphic is connected but not strongly connected, and let denote a minimal strongly connected component, as defined in the Appendix. Consider replacing the parameters with the parameters and , assuming there exist at least two nodes not in . Then it follows from the proof of Proposition 2 that the modified parameter likelihood is maximized when Inline graphic , and the parameters are equal to the nonparametric maximum likelihood estimate for the dataset consisting only of data points for , which by Proposition 1 exists and is unique.

For the biased sampling model, Vardi (1985) and Gill et al. (1988) derived asymptotic properties of the nonparametric maximum likelihood estimator under an assumption which implies its existence and uniqueness with probability 1. Likewise, here derivation of large-sample properties such as consistency would seem to require assumptions which imply that Inline graphic is strongly connected almost surely as . A natural starting point is to consider conditions under which is identifiable. Shen (2010) states that is identifiable if and , where and for any distribution function , , and . However, without further assumptions, it is straightforward to construct examples where Inline graphic is not identifiable. For example, if is uniform on the interval , is uniform on the interval , is Bernoulli with mean , , , and the density of is positive on such that and , then is identified up to proportionality within and within but is not identified. Moreover, almost surely as Inline graphic , any graph corresponding to this data-generating mechanism can be partitioned into two subgraphs corresponding to observations with and observations with . The next proposition gives a set of sufficient conditions for the existence and uniqueness of the nonparametric maximum likelihood estimator with probability 1. A proof is given in the Appendix.

Proposition 3.

Assume that , and are continuous with positive densities on , and , respectively, , and there exists such that . Then is strongly connected almost surely as .

Acknowledgement

This work was partially supported by the U.S. National Institutes of Health. The authors thank the associate editor and a reviewer for helpful comments.

Appendix

Proof of Proposition 2.

The proof relies on results from graph theory (West, 1996). Define a strongly connected component of a directed graph to be a subgraph which is strongly connected and such that no additional adjacent edges or vertices in can be included in the subgraph, with it remaining strongly connected. The collection of strongly connected components partitions the set of vertices of . Let if there exists a directed path from a vertex in to a vertex in . The binary relation is a partial order because it satisfies reflexivity, antisymmetry and transitivity. Therefore the set is partially ordered. Since is finite and nonempty, it has at least one minimal element, say , such that no other element satisfies . Hence, for any where , there is no directed path from a vertex in to a vertex in . The vertices of must form a proper subset of the vertices in , since is not strongly connected by assumption. Consequently, there must exist at least one directed path from a vertex in to a vertex not in , because is connected by assumption.

Let be the subset of vertices in with directed edges to vertices not in . Let denote the set of vertices that are in but not in . Define analogously. Then the likelihood (1) can be expressed as where

Note that may be empty, in which case we let . On the other hand, and must be nonempty by the arguments above.

The proof of the proposition proceeds by showing that for any estimate satisfying (2), another estimate can be constructed which also satisfies (2) and is such that . In particular, let with

where , and . By the construction of it follows that

Next, note that if , then ; otherwise, because there are no directed edges between vertices in and vertices in ,

Finally,

where and the inequality holds because for and . Thus , and , implying . □

Proof of Proposition 3.

Let be a partition of where . Consider two arbitrary vertices and of and assume without loss of generality that . Under the proposition’s assumptions, as there exist almost surely such that and for , so there will be a directed path of the form for some . Likewise, it can be shown that a directed path will almost surely exist from to . □

References

Austin, M. D., Simon, D. K. & Betensky, R. A. (2014). Computationally simple estimation and improved efficiency for special cases of double truncation. Lifetime Data Anal. 20, 335–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
Csardi, G. & Nepusz, T. (2006). The igraph software package for complex network research. Inter J. Complex Syst., article no. 1695. [Google Scholar]
Efron, B. & Petrosian, V. (1999). Nonparametric methods for doubly truncated data. J. Am. Statist. Assoc. 94, 824–34. [Google Scholar]
Gill, R. D., Vardi, Y. & Wellner, J. A. (1988). Large sample theory of empirical distributions in biased sampling models. Ann. Statist. 16, 1069–112. [Google Scholar]
Lawless, J. F. (2003). Statistical Models and Methods for Lifetime Data. Hoboken, New Jersey: John Wiley & Sons, 2nd ed. [Google Scholar]
Lloyd-Ronning, N. M., Fryer, C. L. & Ramirez-Ruiz, E. (2002). Cosmological aspects of gamma-ray bursts: Luminosity evolution and an estimate of the star formation rate at high redshifts. Astrophys. J. 574, 554–65. [Google Scholar]
Moreira, C., de Uña-Álvarez, J. & Crujeiras, R. (2010). DTDA: An R package to analyze randomly truncated data. J. Statist. Software 37, 1–20. [Google Scholar]
R Development Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria: ISBN 3-900051-07-0 http://www.R-project.org. [Google Scholar]
Shen, P. (2010). Nonparametric analysis of doubly truncated data. Ann. Inst. Statist. Math. 62, 835–53. [Google Scholar]
Turnbull, B. W. (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Statist. Soc. B 38, 290–5. [Google Scholar]
Vardi, Y. (1985). Empirical distributions in selection bias models. Ann. Statist. 13, 178–203. [Google Scholar]
Wang, M. C. (1989). A semiparametric model for randomly truncated data. J. Am. Statist. Assoc. 84, 742–8. [Google Scholar]
West, D. B. (1996). Introduction to Graph Theory. Upper Saddle River, New Jersey: Prentice-Hall. [Google Scholar]

[B1] Austin, M. D., Simon, D. K. & Betensky, R. A. (2014). Computationally simple estimation and improved efficiency for special cases of double truncation. Lifetime Data Anal. 20, 335–54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Csardi, G. & Nepusz, T. (2006). The igraph software package for complex network research. Inter J. Complex Syst., article no. 1695. [Google Scholar]

[B3] Efron, B. & Petrosian, V. (1999). Nonparametric methods for doubly truncated data. J. Am. Statist. Assoc. 94, 824–34. [Google Scholar]

[B4] Gill, R. D., Vardi, Y. & Wellner, J. A. (1988). Large sample theory of empirical distributions in biased sampling models. Ann. Statist. 16, 1069–112. [Google Scholar]

[B5] Lawless, J. F. (2003). Statistical Models and Methods for Lifetime Data. Hoboken, New Jersey: John Wiley & Sons, 2nd ed. [Google Scholar]

[B6] Lloyd-Ronning, N. M., Fryer, C. L. & Ramirez-Ruiz, E. (2002). Cosmological aspects of gamma-ray bursts: Luminosity evolution and an estimate of the star formation rate at high redshifts. Astrophys. J. 574, 554–65. [Google Scholar]

[B7] Moreira, C., de Uña-Álvarez, J. & Crujeiras, R. (2010). DTDA: An R package to analyze randomly truncated data. J. Statist. Software 37, 1–20. [Google Scholar]

[B8] R Development Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria: ISBN 3-900051-07-0 http://www.R-project.org. [Google Scholar]

[B9] Shen, P. (2010). Nonparametric analysis of doubly truncated data. Ann. Inst. Statist. Math. 62, 835–53. [Google Scholar]

[B10] Turnbull, B. W. (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Statist. Soc. B 38, 290–5. [Google Scholar]

[B11] Vardi, Y. (1985). Empirical distributions in selection bias models. Ann. Statist. 13, 178–203. [Google Scholar]

[B12] Wang, M. C. (1989). A semiparametric model for randomly truncated data. J. Am. Statist. Assoc. 84, 742–8. [Google Scholar]

[B13] West, D. B. (1996). Introduction to Graph Theory. Upper Saddle River, New Jersey: Prentice-Hall. [Google Scholar]

PERMALINK

On nonparametric maximum likelihood estimation with double truncation

J Xiao

M G Hudgens

Summary

1. Introduction

2. Existence and uniqueness

Proposition 1.

Table 1.

Fig. 1.

Proposition 2.

Corollary 1.

3. Application to AIDS study

Fig. 3.

Fig. 2.

4. Remarks

Proposition 3.

Acknowledgement

Appendix

Proof of Proposition 2.

Proof of Proposition 3.

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

On nonparametric maximum likelihood estimation with double truncation

J Xiao

M G Hudgens

Summary

1. Introduction

2. Existence and uniqueness

Proposition 1.

Table 1.

Fig. 1.

Proposition 2.

Corollary 1.

3. Application to AIDS study

Fig. 3.

Fig. 2.

4. Remarks

Proposition 3.

Acknowledgement

Appendix

Proof of Proposition 2.

Proof of Proposition 3.

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases