Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2019 May 13;106(3):724–731. doi: 10.1093/biomet/asz023

Nonidentifiability in the presence of factorization for truncated data

B Vakulenko-Lagun 1, J Qian 2, S H Chiou 1, R A Betensky 1,
PMCID: PMC6690171  PMID: 31427826

Summary

A time to event, Inline graphic, is left-truncated by Inline graphic if Inline graphic can be observed only if Inline graphic. This often results in oversampling of large values of Inline graphic, and necessitates adjustment of estimation procedures to avoid bias. Simple risk-set adjustments can be made to standard risk-set-based estimators to accommodate left truncation when Inline graphic and Inline graphic are quasi-independent. We derive a weaker factorization condition for the conditional distribution of Inline graphic given Inline graphic in the observable region that permits risk-set adjustment for estimation of the distribution of Inline graphic, but not of the distribution of Inline graphic. Quasi-independence results when the analogous factorization condition for Inline graphic given Inline graphic holds also, in which case the distributions of Inline graphic and Inline graphic are easily estimated. While we can test for factorization, if the test does not reject, we cannot identify which factorization condition holds, or whether quasi-independence holds. Hence we require an unverifiable assumption in order to estimate the distribution of Inline graphic or Inline graphic based on truncated data. This contrasts with the common understanding that truncation is different from censoring in requiring no unverifiable assumptions for estimation. We illustrate these concepts through a simulation of left-truncated and right-censored data.

Keywords: Constant-sum condition, Kendall’s tau, Left truncation, Right censoring, Survival data

1. Introduction

Truncated survival data arise when observation of the time to event, Inline graphic, occurs only when it falls within a subject-specific interval. Left truncation occurs when Inline graphic is observed only if Inline graphic, where Inline graphic is the time to sampling, i.e., the truncation variable. It often arises in longitudinal cohort studies in which a subcohort is sampled on the basis of having had a post-baseline assessment prior to the event of interest. Another example is when the time origin of interest, such as onset of cognitive impairment, may occur prior to entry into the cohort, and the endpoint of interest is time from onset of cognitive impairment to death. Estimation must account for the truncation to avoid bias due to the selection based on the magnitude of Inline graphic. A critical condition that enables simple risk-set adjustment to standard risk-set-based estimators (Lynden-Bell, 1971; Woodroofe, 1985; Wang et al., 1986) was described by Tsai (1990) as quasi-independence, or independence in the observed region, i.e., Inline graphic. In particular, this means that Inline graphic, where Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic if the event Inline graphic holds and 0 otherwise. For simplicity of presentation, we assume that Inline graphic and Inline graphic are continuous random variables and that Inline graphic, Inline graphic and Inline graphic have densities Inline graphic and Inline graphic with respect to Lebesgue measure. Nonetheless, all of our results apply also to discrete random variables, as they are based on nonparametric maximum likelihoods (Vardi, 1989). The quasi-independence assumption expressed in terms of densities is

graphic file with name M38.gif (1)

This does not imply that the sampled random variables are conditionally independent given Inline graphic; it is not equivalent to Inline graphic for Inline graphic, where we use the notation Inline graphic as a shorthand for Inline graphic.

Examination of the likelihood based on left-truncated data elucidates the simplification in estimation that arises from quasi-independence, and reveals that weaker conditions also admit this simplification for estimation of the distribution of Inline graphic or Inline graphic, but not of both. The factorization condition that enables estimation of the distribution of Inline graphic is

graphic file with name M47.gif (2)

where Inline graphic need not equal Inline graphic and is defined on the support of Inline graphic in the observable region Inline graphic. In the unobservable region, we define Inline graphic for Inline graphic. For Inline graphic in the support of Inline graphic, Inline graphic and Inline graphic are constrained by

graphic file with name M58.gif

In the Supplementary Material we derive an explicit expression for Inline graphic as a function of Inline graphic, Inline graphic and Inline graphic, under the factorization condition (2). We also discuss the special cases of overall independence and quasi-independence. The factorization condition is similar to the condition of Keiding (1992) that Inline graphic, upon identifying Inline graphic as Inline graphic. The factorization condition is reminiscent of the constant-sum condition for right-censored data (Williams & Lagakos, 1977; Betensky, 2000), under which dependent censoring can be ignored and the Kaplan–Meier estimator is valid.

Proposition 4 shows that the distribution of Inline graphic is not identifiable under (2) alone; it also requires a complementary factorization condition. The two factorization conditions together constitute quasi-independence (1), under which both distributions can be estimated. We explain in § 3 that the observed data can be used to test whether neither factorization condition holds, but cannot be used to identify which condition holds if either does. Therefore, we require unverifiable assumptions in order to estimate the distribution of Inline graphic or Inline graphic based on truncated data. This contrasts with the common understanding that truncation is distinct from censoring and requires no unverifiable assumptions for estimation.

2. Nonparametric likelihood estimation

2.1. Estimation in the absence of censoring

First we consider estimation in the absence of right censoring. The likelihood of the observed data Inline graphic under left truncation and no censoring is

graphic file with name M70.gif (3)

where

graphic file with name M71.gif

Proposition 1.

Under the factorization condition (2), the nonparametric maximum likelihood estimator of Inline graphic is the risk-set-adjusted Kaplan–Meier estimator

Proposition 1. (4)

Proof.

Under (2), Inline graphic is equal to 1 since its denominator is equal to its numerator, Inline graphic:

Proof.

Thus, Inline graphic effectively constitutes the full likelihood, with unknown parameters Inline graphic and Inline graphic. The standard risk-set-adjusted Kaplan–Meier estimator given by (4) is the maximum likelihood estimator of Inline graphic based on Inline graphic, and also equals that based on Inline graphic (Wang, 1991). This is because, in the absence of parametric assumptions on Inline graphic, Inline graphic is a multinomial likelihood with maximum value Inline graphic if there are no ties in Inline graphic, which is attained when each factor in its product is set to the corresponding sample proportion. A similar argument holds in the presence of ties. □

Since (4) is the maximizer of Inline graphic, if it also maximizes the full likelihood (3), then Inline graphic must be constant with respect to Inline graphic. If this latter condition implies factorization (2), then it would follow that (2) is a necessary condition for (4) to be the nonparametric maximum likelihood estimator of Inline graphic. We conjecture that this is false.

Under complete independence between Inline graphic and Inline graphic, (4) was shown to be uniformly consistent by Woodroofe (1985). Since the likelihoods that contribute to estimation of Inline graphic are identical and equal to Inline graphic under any of the three conditions of complete independence between Inline graphic and Inline graphic, quasi-independence (1), or factorization (2), the uniform consistency of (4) under (1) or (2) can be proved in the same way as under complete independence between Inline graphic and Inline graphic.

The likelihood (3) can also be expressed as Inline graphic where

graphic file with name M100.gif

A complementary factorization condition to (2) for Inline graphic given Inline graphic is

graphic file with name M103.gif (5)

where Inline graphic need not equal Inline graphic. Conditions (2) and (5) together are equivalent to quasi-independence, as stated in the following proposition.

Proposition 2.

Under conditions (2) and (5), Inline graphic and Inline graphic, which implies quasi-independence (1). Conversely, quasi-independence (1) implies both (2) and (5).

Proof.

Under (2) and (5), Inline graphic, implying Inline graphic and Inline graphic, i.e., quasi-independence. Under (1), Inline graphic, implying (2) and (5) with Inline graphic and Inline graphic. □

Under (5), the likelihood for estimation of Inline graphic effectively reduces to Inline graphic, and its estimation is dual to that of Inline graphic (Wang, 1991). This is summarized in the following proposition.

Proposition 3.

Under (5), the nonparametric maximum likelihood estimator of Inline graphic is

Proposition 3.

Proof.

This follows from Proposition 1 via reversal of time, by treating Inline graphic as left-truncated by Inline graphic. □

Propositions 1–3 lead to the following corollary.

Corollary 1.

Quasi-independence yields the standard risk-set-adjusted estimators of the distributions of both Inline graphic and Inline graphic as the nonparametric maximum likelihood estimators.

Assumptions (1), (2) and (5) are indistinguishable given the observed data. We formalize this conclusion in Propositions 4 and 5 and Corollary 2. Proposition 4 shows that the likelihood under (2) is equivalent to that under (1), implying that these conditions cannot be distinguished. Proposition 5 shows that (5) and (1) cannot be distinguished.

Proposition 4.

Assuming factorization (2), quasi-independence (1) cannot be determined from the observed data. As a consequence, while Inline graphic is identifiable under (2), Inline graphic is not.

Proof.

This follows from the equivalence of the likelihood functions under quasi-independence (1) and factorization (2). Under the latter, the likelihood (3) is

Proof.

Under quasi-independence (1), the likelihood (3) is

Proof.

Since Inline graphic is defined only on the observable region, it is unique only up to a constant factor. Assuming that Inline graphic and Inline graphic have positive mass at the observed times Inline graphic only, but not assuming their functional forms, the contributions to the likelihood from Inline graphic are identical under (1) and (2). Hence Inline graphic is nonidentifiable from the data.

Proposition 5.

Assuming factorization (5), quasi-independence (1) cannot be determined from the observed data. Thus, while Inline graphic is identifiable under (5), Inline graphic is not.

Corollary 2.

Quasi-independence (1) cannot be distinguished from the factorization condition (2) only, or from the factorization condition (5) only, based on the observed data.

2.2. Estimation under right censoring

The nonidentifiability problem persists in the presence of right censoring. There are two practical models for right censoring in the presence of left truncation (Qian & Betensky, 2014): one is on the residual time scale, i.e., censoring of Inline graphic, and the other is on the original time scale, i.e., censoring of Inline graphic. We extend the likelihood decomposition (3) to accommodate these models.

We first consider the independent residual censoring assumption. Suppose that Inline graphic is a residual censoring time such that Inline graphic, where Inline graphic denotes independence, and that censoring of Inline graphic occurs at Inline graphic, the total censoring time starting from the time origin. The observed data then comprise Inline graphic, Inline graphic and Inline graphic, where Inline graphic if Inline graphic and Inline graphic if Inline graphic. This model is appropriate when censoring occurs only after entry into the study. The likelihood contribution for an uncensored observation is the same as that in (3):

graphic file with name M149.gif

where the final relation follows from Inline graphic and the noninformativeness of the distribution of Inline graphic for that of Inline graphic. The contribution for a censored observation is

graphic file with name M153.gif

The probability Inline graphic can be expressed as

graphic file with name M155.gif

where the first term is unity under the factorization condition (2).

We next derive the likelihood decomposition under the censoring scheme on the original time scale, where Inline graphic is measured from the time origin, with Inline graphic assumed, and Inline graphic as in Tsai (1990). The condition Inline graphic ensures that censoring can occur only for the sampled individuals. The overall likelihood under this censoring scheme equals that under the residual censoring model, given the assumed noninformativeness of Inline graphic given Inline graphic for Inline graphic and that of Inline graphic given Inline graphic. Thus, under both models for censoring, the overall likelihood for left-truncated and right-censored data is

graphic file with name M165.gif

where

graphic file with name M166.gif

Under (2), Inline graphic. As in the uncensored case, Inline graphic is the only component of the likelihood that contributes to estimation of Inline graphic by the risk-set-adjusted Kaplan–Meier estimator (Wang, 1991)

graphic file with name M170.gif (6)

As in Proposition 4, under (2) the data cannot inform whether Inline graphic equals Inline graphic or Inline graphic. Nonetheless, the nonparametric maximum likelihood estimator of Inline graphic, assuming that it is a distribution function, is

graphic file with name M175.gif (7)

In the setting of independent Inline graphic and Inline graphic with Inline graphic, (7) estimates Inline graphic (Wang, 1991). Under factorization (2) without quasi-independence, (7) maximizes the likelihood Inline graphic given Inline graphic and estimates a normalized version of Inline graphic and not Inline graphic. Under factorization (5) without quasi-independence, an alternative decomposition of the likelihood with similar arguments yields the analogous result for estimation of Inline graphic and Inline graphic, as shown in the Supplementary Material.

3. Testing for the factorization condition

A statistic commonly used to test for quasi-independence is the conditional Kendall’s tau (Tsai, 1990; Martin & Betensky, 2005). In the presence of censoring, this is defined as Inline graphic, where Inline graphic and Inline graphic denotes the event that the pair Inline graphic is comparable and orderable. A consistent estimator of Inline graphic is the basis of a test for the null hypothesis of (2) or (5), versus the alternative of neither (2) nor (5). This is justified by Inline graphic under (2) or (5). We derive this for (2); the calculations are similar under (5):

graphic file with name M192.gif (8)

Under the residual censoring model and factorization condition (2), and upon defining Inline graphic as the survival function of Inline graphic, we can express Inline graphic as

graphic file with name M196.gif

The second term of (8) can be expressed similarly, and the remaining two terms are trivially equivalent to the first two terms upon relabelling the indices. A similar result applies under the original-scale censoring model. This demonstrates that Inline graphic under either factorization condition, and the conditional Kendall’s tau provides a valid test for the null of either (2) or (5). If the test does not reject the null hypothesis, then under (2), Proposition 4 states that Inline graphic cannot be distinguished from Inline graphic in the observable region, and so quasi-independence cannot be distinguished from factorization. This holds for any test of factorization in the absence of external information.

4. Simulation

We conducted a simulation study to illustrate empirically that the factorization condition (2) alone, without quasi-independence (1), is sufficient for the validity of the risk-set-adjusted Kaplan–Meier estimator for the distribution of Inline graphic, as stated in Proposition 1. We also demonstrate that Kendall’s tau yields a valid test of the factorization condition even in the absence of quasi-independence. Finally, we illustrate Proposition 4, that under (2) without the complementary condition (5), we may not be able to estimate the truncation distribution Inline graphic. Let

graphic file with name M202.gif

where Inline graphic and we set Inline graphic. It follows that Inline graphic, Inline graphic, and Inline graphic if Inline graphic and Inline graphic if Inline graphic. Since Inline graphic, quasi-independence does not hold. We generated right censoring through an independent residual censoring time Inline graphic. Each sample consisted of Inline graphic triples Inline graphic. This yielded 88% truncation and 30% censoring based on 1000 replications.

Our first aim is to check the validity of the risk-set-adjusted Kaplan–Meier estimator of the conditional distribution, Inline graphic. The full marginal Inline graphic is not estimable because there is no information for Inline graphic. Figure 1(a) displays the averaged adjusted Kaplan–Meier estimate, which is indistinguishable from its target, confirming that the adjusted Kaplan–Meier estimator is valid under condition (2) and does not require the stronger condition (1). We also applied the conditional Kendall’s tau test of Martin & Betensky (2005) and obtained an estimated Type I error of 0.041, which supports the validity of the test for either factorization condition even in the absence of quasi-independence. Figure 1(b) shows that the estimator (7) estimates Inline graphic and not Inline graphic, as expected from Proposition 4.

Fig. 1.

Fig. 1.

Simulation results for the estimation of: (a) Inline graphic, with a grey solid line depicting the true curve and a black dashed line depicting the average of Kaplan–Meier estimates; (b) Inline graphic, with a grey solid line depicting the true Inline graphic, a black dotted line depicting the average of the estimates of Wang (1991), and a black dashed line depicting Inline graphic.

5. Discussion

We have shown that the commonly accepted requirement of quasi-independence of Inline graphic and Inline graphic is stronger than the factorization condition (2) that is actually needed for nonparametric estimation of the distribution of Inline graphic. While we can test for factorization, the observed data do not allow us to distinguish between quasi-independence (1) and the two factorization conditions (2) and (5). This highlights an identification problem that has not been recognized in the literature; an unverifiable assumption is therefore required in order to estimate the distribution of Inline graphic based on truncated data. In some observational studies, the origin may be observed for all subjects and the delayed study entry time may be externally determined, such as by calendar date. In this case, Inline graphic is known for the whole population and so Inline graphic is known. If factorization is not rejected via Kendall’s tau test, knowledge of Inline graphic enables the factorization condition (2) to be distinguished from quasi-independence (1) and the factorization condition (5). In particular, if factorization holds and Inline graphic does not estimate Inline graphic, it follows from Proposition 3 that condition (5) does not hold, which means that (2) must hold and, importantly, we can estimate Inline graphic.

Supplementary Material

asz023_Supplementary_Material

Acknowledgement

The first two authors contributed equally. We thank Micha Mandel and Richard Cook for helpful comments. We acknowledge funding from the U.S. National Institutes of Health.

Supplementary material

Supplementary material available at Biometrika online includes the derivation of an explicit expression for Inline graphic given Inline graphic, Inline graphic and Inline graphic under factorization condition (2); it also contains a proof that under (5) and for both censoring models, although Inline graphic is nonidentifiable, Inline graphic is identifiable and its nonparametric maximum likelihood estimator is similar to the estimator (7).

References

  1. Betensky, R. A. (2000). On nonidentifiability and noninformative censoring for current status data. Biometrika 87, 218–21. [Google Scholar]
  2. Keiding, N. (1992). Independent delayed entry In Survival Analysis: State of the Art, Klein J. P. & Goel P. K., eds. Dordrecht: Springer, pp. 309–26. [Google Scholar]
  3. Lynden-Bell, D. (1971). A method of allowing for known observational selection in small samples applied to 3CR quasars. Mon. Not. R. Astron. Soc. 155, 95–118. [Google Scholar]
  4. Martin, E. C. & Betensky, R. A. (2005). Testing quasi-independence of failure and truncation times via conditional Kendall’s tau. J. Am. Statist. Assoc. 100, 484–92. [Google Scholar]
  5. Qian, J. & Betensky, R. A. (2014). Assumptions regarding right censoring in the presence of left truncation. Statist. Prob. Lett. 87, 12–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Tsai, W.-Y. (1990). Testing the assumption of independence of truncation time and failure time. Biometrika 77, 169–77. [Google Scholar]
  7. Vardi, Y. (1989). Multiplicative censoring, renewal processes, deconvolution and decreasing density: Nonparametric estimation. Biometrika 76, 751–61. [Google Scholar]
  8. Wang, M.-C. (1991). Nonparametric estimation from cross-sectional data. J. Am. Statist. Assoc. 86, 130–43. [Google Scholar]
  9. Wang, M.-C., Jewell, N. P. & Tsai, W.-Y. (1986). Asymptotic properties of the product limit estimate under random truncation. Ann. Statist. 14, 1597–605. [Google Scholar]
  10. Williams, J. S. & Lagakos, S. W. (1977). Models for censored survival analysis: Constant-sum and variable-sum models. Biometrika 64, 215–24. [Google Scholar]
  11. Woodroofe, M. (1985). Estimating a distribution function with truncated data. Ann. Statist. 13, 163–77. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

asz023_Supplementary_Material

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES