Abstract
The problem of nonparametric inference on a monotone function has been extensively studied in many particular cases. Estimators considered have often been of so-called Grenander type, being representable as the left derivative of the greatest convex minorant or least concave majorant of an estimator of a primitive function. In this paper, we provide general conditions for consistency and pointwise convergence in distribution of a class of generalized Grenander-type estimators of a monotone function. This broad class allows the minorization or majoratization operation to be performed on a data-dependent transformation of the domain, possibly yielding benefits in practice. Additionally, we provide simpler conditions and more concrete distributional theory in the important case that the primitive estimator and data-dependent transformation function are asymptotically linear. We use our general results in the context of various well-studied problems, and show that we readily recover classical results established separately in each case. More importantly, we show that our results allow us to tackle more challenging problems involving parameters for which the use of flexible learning strategies appears necessary. In particular, we study inference on monotone density and hazard functions using informatively right-censored data, extending the classical work on independent censoring, and on a covariate-marginalized conditional mean function, extending the classical work on monotone regression functions.
Keywords: Cube-root asymptotics, isotonic regression, dependent censoring, g-formula
1. Introduction
1.1. Background
In many scientific settings, investigators are interested in learning about a function known to be monotone, either due to probabilistic constraints or in view of existing scientific knowledge. The statistical treatment of nonparametric monotone function estimation has a long and rich history. Early on, Grenander (1956) derived the nonparametric maximum likelihood estimator (NPMLE) of a monotone density function, now commonly referred to as the Grenander estimator. Since then, monotone estimators of many other parameters, including hazard and regression functions, have been proposed and studied.
In the literature, most monotone function estimators have been constructed via empirical risk minimization. Specifically, these are obtained by minimizing the empirical risk over the space of nondecreasing or nonincreasing candidate functions based on an appropriate loss function. The theoretical study of these estimators has often hinged strongly on their characterization as empirical risk minimizers. This is the case, for example, for the asymptotic theory developed by Prakasa Rao (1969) and Prakasa Rao (1970) for the NPMLE of monotone density and hazard functions, respectively, and by Brunk (1970) for the least-squares estimator of a monotone regression function. Kim and Pollard (1990) unified the study of these various estimators by studying the argmin process typically driving the pointwise distributional theory of monotone empirical risk minimizers.
Many of the parameters treated in the literature on monotone function estimation can be viewed as an index of the statistical model, in the sense that the model space is in bijection with the product space corresponding to the parameter of interest and an additional variation-independent parameter. In such cases, identifying an appropriate loss function is often easy, and a risk minimization representation is therefore usually available. However, when the parameter of interest is a complex functional of the data-generating mechanism, an appropriate loss function may not be readily available. This occurs often, for example, when identification of the parameter of interest based on the observed data distribution requires adjustment for sampling complications (e.g., informative treatment attribution, missing data or loss to follow-up). It is thus imperative to develop and study estimation methods that do not rely upon risk minimization.
It is a simple fact that the primitive of a nondecreasing function is convex. This observation serves as motivation to consider as an estimator of the function of interest the derivative of the greatest convex minorant (GCM) of an estimator of its primitive function. In the literature on monotone function estimation, many estimators obtained as empirical risk minimizers can alternatively be represented as the left derivative of the GCM of some primitive estimator. This is because the definition of the GCM is intimately tied to the necessary and sufficient conditions for optimization of certain risk functionals over the convex cone of monotone functions (see, e.g., Chapter 2 of Groeneboom and Jongbloed (2014)). In particular, Grenander’s NPMLE of a monotone density equals the left derivative of the GCM of the empirical distribution function. In the recent literature, estimators obtained in this fashion have thus been referred to as being of Grenander-type. Leurgans (1982) is an early example of a general study of Grenander-type estimators for a class of regression problems.
In a seminal paper, Groeneboom (1985) introduced an approach to studying GCMs based on an inversion operation. This approach has facilitated the theoretical study of certain Grenander-type estimators without the need to utilize their representation as empirical risk minimizers. For example, under the assumption of independent right-censoring, Huang and Wellner (1995) used this approach to derive large-sample properties of a monotone hazard function estimator obtained by differentiating the GCM of the Nelson–Aalen estimator of the cumulative hazard function. This general strategy was also used by van der Vaart and van der Laan (2006), who derived and studied an estimator of a covariate-marginalized survival curve based on current-status data, including possibly high-dimensional and time-varying covariates. More recently, there has been interest in deriving general results for Grenander-type estimators applicable to a variety of cases. For instance, Anevski and Hössjer (2006) derived pointwise distributional limit results for Grenander-type estimators in a very general setting including, in particular, dependent data. Durot (2007), Durot, Kulikov and Lopuhaä (2012) and Lopuhaä and Musta (2018a) derived limit results for the estimation error of Grenander-type estimators under Lp, supremum and Hellinger norms, respectively. Durot, Groeneboom and Lopuhaä (2013) studied the problem of testing the equality of generic monotone functions with K independent samples. Durot and Lopuhaä (2014), Beare and Fang (2017) and Lopuhaä and Musta (2018b) studied properties of the least concave majorant of an arbitrary estimator of the primitive function of a monotone parameter. The monograph of Groeneboom and Jongbloed (2014) also summarizes certain large-sample properties for these estimators.
1.2. Contribution and organization of the article
In this paper, we wish to address the following three key objectives:
to provide a unified framework for studying a large class of nonparametric monotone function estimators that implies classical results but also applies in more complicated, modern applications;
to derive tractable sufficient conditions under which estimators in this class are known to be consistent and have a nondegenerate limit distribution upon proper centering and scaling;
to illustrate the use of this general framework to construct targeted estimators of monotone parameters that are possibly complex summaries of the observed data distribution, and whose estimation may require the use of data-adaptive estimators of nuisance functions.
Our first goal is to introduce a class of monotone estimators that allow the greatest convex minorization process to be performed on a possibly data-dependent transformation of the domain. For many monotone estimators in the literature, the greatest convex minorization is performed on a transformation of the domain. A strategic domain transformation can lead to significant benefits in practice, including in some cases the elimination of the need to estimate challenging nuisance parameters. Unfortunately, to our knowledge, existing results for general Grenander-type estimators do not apply in a straightforward manner in cases in which a data-dependent transformation of the domain has been used. We will define a class that permits such transformations, and demonstrate both how this class encompasses many existing estimators in the literature and how a transformation can be strategically selected in novel problems.
Our second goal is to derive sufficient conditions on the estimator of the primitive function and domain transformation that imply consistency and pointwise convergence in distribution of the monotone function estimator. As noted above, general results on pointwise convergence in distribution for the class of Grenander-type estimators, applicable in a wide variety of settings, were provided in Anevski and Hössjer (2006). Our work differs from that of Anevski and Hössjer (2006) in a few important ways. First, the role and implications of domain transformations—which, as we show, are often important in practice—were not explicitly considered in Anevski and Hössjer (2006). To our knowledge, the class of generalized Grenander-type estimators we consider in this paper, which allow for domain transformations, has not previously been studied in a unified manner, and hence, general results for this class do not currently exist. Second, in addition to pointwise distributional results, we study weak consistency. Third, in Sections 4 and 5, we pay special attention to parameters for which asymptotically linear estimators of the primitive and transformation functions can be constructed —in such cases, relatively straightforward sufficient conditions can be developed, and the limit distribution has a simpler form. While these results are weaker than those in Section 3 and in Anevski and Hössjer (2006) because they apply only to a special case, they are useful in many settings. We demonstrate the utility of these results for three groups of examples—estimation of monotone density, hazard and regression functions—and show that our results coincide with established results in these settings.
Our third goal is to discuss and illustrate Grenander-type estimation in cases in which nonparametric estimation of the primitive function requires estimation of challenging nuisance parameters. In this sense, our work follows the lead of van der Vaart and van der Laan (2006), whose setting is of this type. More generally, such primitive functions arise frequently, for example, when the observed data unit represents a coarsened version of an ideal data structure, and the coarsening occurs randomly conditional on observed covariates (Heitjan and Rubin (1991)). In our general results, we provide sufficient conditions that can be readily applied to such primitive estimators. To demonstrate the application of our theory in coarsened data structures, we consider extensions of the three classical monotone problems above to more complex settings in which covariates must be accounted for, because either the censoring process or the treatment allocation mechanism are informative, as is typical in observational studies. Specifically, we derive novel estimators of monotone density and hazard functions for use when the survival data are subject to right-censoring that may depend on covariates, and a novel estimator of a monotone dose-response curve for use when the relationship between the exposure and outcome is confounded by recorded covariates. Unlike for their classical analogues, in these more difficult problems, nonparametric estimation of the primitive function involves nuisance functions for which flexible estimation strategies must be employed. As van der Vaart and van der Laan (2006) was able to achieve in a particular problem, our general framework explicitly allows the integration of such strategies while still yielding estimators with a tractable limit theory.
Our paper is organized as follows. In Section 2, we define the class of estimators we consider and briefly introduce our three working examples. In Section 3, we present our most general results for the consistency and convergence in distribution of our class of estimators. We provide refined results, including simpler sufficient conditions and distributional results, for the special case in which the primitive and transformation estimators are asymptotically linear in Section 4. In Section 5, we apply our general theory in three examples, both for classical parameters and for the novel extensions we consider. We provide concluding remarks in Section 6. The proofs of all theorems, additional technical details and results from simulation studies that evaluate the validity of the theory in two examples are provided in the Supplementary Material (Westling and Carone (2020)).
2. Generalized Grenander-type estimators
2.1. Statistical setup and definitions
Throughout, we make use of the following definitions. For intervals I,J ⊆ R, define as the space of bounded, real-valued functions on I, as the subset of nondecreasing and càdlàg (right-continuous with left-hand limits) functions on I, and as the further subset of functions whose range is contained in J. The GCM operator is defined for any as the pointwise supremum over all convex functions H ≤ G on I. We note that GCMI(G) is necessarily convex. For , we denote by G− the generalized inverse mapping , and for a left-differentiable G, we denote by ∂−G the left derivative of G.
We are interested in making inference about an unknown function determined by the true data-generating mechanism P0 for an interval . We denote the endpoints of I by aI := inf I and bI := sup I. We define the primitive function Θ0 of θ0 pointwise for each x ∈ I as , where if aI = −∞ we assume the integral exists. The results we present in Section 3 apply in contexts with either independent or dependent data. Starting in Section 4, we focus on contexts in which the data consist of independent observations O1,…,On from an unknown distribution P0 in a nonparametric model . In such cases, we denote by O a prototypical data unit, the support of O under , and .
In its simplest formulation, a Grenander-type estimator of θ0 is given by ∂−GCMI(Θn)for some estimator Θn of Θ0. However, as a critical step in unifying classical estimators and constructing procedures with possibly improved properties, we wish to allow the GCM procedure to be performed on a possibly data-dependent transformation of the domain I. To do so, we first define for any interval the operator as for each and . We set J0 := [0, u0], with u0 ∈ (0, ∞ ) possibly depending on P0, and suppose that a domain transform is chosen. We may then consider the domain-transformed parameter , which has primitive Ψ0 defined pointwise as for t ∈ (0, u0]. As with θ0 and Θ0, ψ0 is nondecreasing and Ψ0 is convex. Thus, for each x ∈ I at which θ0 is left continuous and such that Φ0(u) < Φ0(x) for all u < x. This observation motivates us to consider estimators of θ0 of the form , where Ψn, Φn and un are estimators of Ψ0, Φ0 and u0, respectively, and we define Jn := [0, un]. We refer to any such estimator as being of the generalized Grenander-type. This class, of course, contains the standard Grenander-type estimators: setting Ψn = Θn and Φn = Id for Id the identity mapping yields . We note that, in this formulation, we require the domain J0 over which the GCM is performed to be bounded, but not so for the domain I of θ0. Additionally, we assume that the left endpoint of J0 is fixed at 0, while the upper endpoint u0 may depend on P0. However, this entails no loss in generality, since if the desired domain is instead [ℓ0, u0], where now ℓ0 also depends on P0, we can define and similarly shift Φ0 by ℓ0 to obtain the new domain .
Defining , we suppose that we have at our disposal estimators Φn and Γn of Φ0 and Γ0, respectively, as well as a weakly consistent estimator un of u0. In this work, we study the properties of a generic generalized Grenander-type estimator θn of θ0 of the form
(1) |
Our goal is to provide sufficient conditions on the triple (Γn, Φn, un) under which θn is consistent, and under which a suitable standardization of θn converges in distribution to a nondegenerate limit. As stated above, our only requirement for un is that it tends in probability to u0. Therefore, our focus will be on the pair (Γn, Φn).
We note that estimators taking form (1) constitute a more restrictive class than the set of all estimators of the form for arbitrary Ψn. Our focus on this slightly less general form is motivated by two reasons. First, as we will see in various examples, Γ0 often has a simpler form than Ψ0, and in such cases, it may be significantly easier to verify required regularity conditions for Γn and to derive limit distribution properties based on Γn rather than Ψn. Second, many celebrated monotone estimators in the literature follow this particular form. This can be seen by noting that, if Φn is a right-continuous step function with jumps at points x1, x2,…, xm, then for each x ∈ I the estimator θn(x) given in (1) equals the slope at Φn(x) of the greatest convex minorant of the diagram of points , where x0 = aI. We highlight well-known examples of estimators of this type below. In brief, we sacrifice a little generality for a substantial gain in the ease of application of our results, both for well known and novel monotone estimators. Nevertheless, conditions on the pair (Ψn, Φn) under which consistency and distributional results hold for θn can be derived similarly.
2.2. Examples
Before proceeding to our main results, we briefly discuss the examples we will use to illustrate how our framework allows us to not only obtain results on classical estimators in the monotone estimation literature directly, but also tackle more complex problems for which no estimators are currently available. These examples will be studied further in Section 5.
EXAMPLE 1 (Monotone density function)
Suppose that T is a univariate positive random variable with nondecreasing density function f0, and that T is right-censored by an independent random censoring time C. The observed data unit is O := (Y, Δ), where Y := min (T, C) and Δ := I(T ≤ C), with distribution P0 implied by the marginal distributions of T and C. The parameter of interest is θ0 := f0, the density function of T with support I. Taking Φ0 to be the identity function, we get that ψ0 = θ0. Here, both Ψ0 and Θ0 represent the distribution function F0 of T, and Φ0 plays no role. A natural estimator θn of θ0 can be obtained by taking Ψn to be the Kaplan–Meier estimator of the distribution function Ψ0. With Φn := Id, Γn := Ψn and un := maxi Yi the estimator is precisely the estimator studied by Huang and Wellner (1995). When C = +∞ with probability one, Ψn is the empirical distribution function based on Y1, Y2,…, Yn, and θn is precisely the Grenander estimator.
In Section 5, we extend estimation of a monotone density function to the setting in which the data are subject to possibly informative right-censoring. Specifically, we only require T and C to be independent conditionally upon a vector W of baseline covariates. We will study the estimator defined by differentiating the GCM of a one-step estimator of Ψ0. In this context, estimation of Ψ0 requires estimation of nuisance functions. We will use our general results to provide conditions on the nuisance estimators that imply consistency and distributional results for θn.
EXAMPLE 2 (Monotone hazard function)
Suppose now that T is a univariate positive random variable with nondecreasing hazard function λ0. In this example, we are interested in θ0 := λ0. Setting S0 := 1 − F0 to be the survival function of T, we note that , and so, taking Φ0 to satisfy Φ0(dv) = S0(v)dv makes Γ0 = F0. The restricted mean lifetime function satisfies this condition. Using this transformation, the estimator of the monotone hazard function θ0 only requires estimation of F0.
In Section 5, we again extend estimation of a monotone hazard function to allow the data to be subject to possibly informative right-censoring using the same one-step estimator Γn of Γ0 = F0 that will be introduced in Example 1 and the data-dependent transformation . We will show that, once the simpler details regarding the estimation of a monotone density are established, the asymptotic properties of this estimator of a monotone hazard are obtained essentially for free.
EXAMPLE 3 (Monotone regression function)
As our last example, we study estimation of a nondecreasing regression function. In the simplest setup, the data unit is O := (Y,A) and we are interested in . Assume without loss of generality that the data are sorted according to the observed values of A. Taking I to be the support of A and Φ0 to be the marginal distribution function of A, we have that for each u ∈ [0, 1], and for each x ∈ I. Thus and are natural nonparametric estimators of Γ0(x) and Φ0(x), respectively. Then is the classical monotone least-squares estimator of θ0.
In Section 5, we consider an extension of this example to estimation of a covariate-marginalized regression function, for use when the relationship between exposure and outcome of interest is confounded. Specifically, we will consider the data unit O := (Y,A,W), with W representing a vector of potential confounders, and focus on θ0 : x ↦ E0 [E0(Y | A = x,W)]. Under untestable causal identifiability conditions, θ0(x) is the mean of the counterfactual outcome Y(x) obtained by setting exposure at level A = x. This parameter plays a critical role in causal inference, particularly when the available data are obtained from an observational study and the exposure assignment process may be informative. As before, tackling this more complex parameter will require estimation of certain nuisance functions.
3. General results
We begin with our first set of results on the large-sample properties of θn. Our goal is to establish conditions under which consistency and pointwise convergence in distribution hold. First, we provide general results on the consistency of θn, both pointwise and uniformly. We note that the results of Anevski and Hössjer (2006), Durot (2007), Durot, Kulikov and Lopuhaä (2012) and Lopuhaä and Musta (2018a) imply conditions for consistency of Grenander-type estimators. However, because the objective of their work is to establish distributional theory for a global discrepancy between the estimated and true function, the conditions they require are stronger than needed for consistency alone. Also, their work is restricted to Grenander-type estimators, without data-dependent transformations of the domain.
Below, we refer to the sets and for β ≥ 0.
THEOREM 1 (Weak consistency)
Suppose θ0 is continuous at x ∈ I and, for some δ > 0 such that , Φ0 is strictly increasing and continuous on [x − δ, x + δ]. If , and tend to zero in probability, then θn(x) = θ0(x) + oP(1).
Suppose θ0 and Φ0 are uniformly continuous on I, and Φ0 is strictly increasing on I. If and ||Φn − Φ0||∞,I tend to zero in probability, then for each fixed β > 0.
We note that in part 1 of Theorem 1, we require uniform convergence of Γn and Φn to obtain a pointwise result for θn —this will also be the case for Theorem 2 below. This is because the GCM is a global procedure, and so, the value of θn(x1) depends on Γn(x2) even for x2 not near x1. Without uniform consistency of Γn, θn may indeed fail to be pointwise consistent. Also, we note that in part 1 of Theorem 1, we require that Γn – Γ0 and Φn – Φ0 tend to zero uniformly over the set In. This requirement stems from the fact that θn only depends on Γn through the composition , and so, values of Γn only matter at points in the range of . In part 1, we also require that Φn – Φ0 tend to zero uniformly in a neighborhood of x, while in part 2, we require that Φn – Φ0 tend to zero uniformly over I. These requirements allow us to obtain results for x values that are possibly outside In for all n. In many applications, it may be the case that Γn – Γ0 and Φn – Φ0 both tend to zero in probability uniformly over I, which implies convergence over In.
The weak conditions required for Theorem 1 are especially important for the extensions of the classical parameters that we consider in Section 5. The estimators we propose require estimating difficult nuisance parameters, such as conditional hazard, density and mean functions. While under mild conditions it is typically possible to construct uniformly consistent estimators of these nuisance parameters, ensuring a given local or uniform rate of convergence often requires additional knowledge about the true function. Thus, Theorem 1 is useful for guaranteeing consistency under weak conditions.
We now provide lower bounds on the convergence rate of θn, both pointwise and uniformly, depending on (a) the uniform rates of convergence of Γn and Φn, and (b) the moduli of continuity of θn and
THEOREM 2 (Rates of convergence)
Let x ∈ I be given. Suppose that, for some δ > 0, and Φn is strictly increasing and continuous on [x − δ, x + δ ]. Let rn be a fixed sequence such that , and are bounded in probability.
-
1If there exist K1(x), K2(x) ∈ [0, ∞ ) and α1,α2 ∈ (0,1] such that for all u ∈ I and for all u ∈ J0, then
-
2
If θ0 is constant on [x − δ, x + δ], then rn [θn(x) − θ0(x)] = OP(1).
Let rn be a fixed sequence such that and are bounded in probability, and suppose that Φ0 is strictly increasing on I.
-
3
If there exist K1, K2 ∈ [0, ∞) and α1, α2 ∈ (0,1] such that for all u, v ∈ I and for all u,v ∈ J0, then
for any random positive real sequence βn such that .
We note here that the uniform results only cover subintervals of the interval over which the GCM procedure is performed. This should not be surprising given the poor behavior of Grenander-type estimators at the boundary of the GCM interval, as discussed, for example, in Woodroofe and Sun (1993), Kulikov and Lopuhaä (2006) and Balabdaoui et al. (2011). Various boundary corrections have been proposed—applying these in our general framework is an interesting avenue for future work.
We also note that, in Theorem 2, when θ0 and Φ0 are locally or globally Lipschitz, then α1 = α2 = 1 and the resulting rate is , which yields when rn = n1/2. This rate is slower than the rate n−1/3 that is often achievable for pointwise convergence when θ0 and Φ0 are differentiable at x and the primitive estimator converges at rate n−1/2, as we discuss below. However, the assumptions in Theorems 2 are significantly weaker than typically required for the n−1/3 rate of convergence: they constrain the supremum norm of the estimation error rather than its modulus of continuity, and hold when the true function is Lipschitz but not differentiable. Our results also cover situations in which θ0 and Φ0 are in Hölder classes. The rates provided by Theorem 2 should thus be seen as lower bounds on the true rate, for use when less is known about the properties of the estimation error or of the true functions. The distributional results we provide below recover the usual rates under stronger conditions.
For a fixed sequence rn of positive real numbers, we now study the pointwise convergence in distribution of rn [θn(x) − θ0(x)] at an interior point x ∈ I at which Φ0 has a strictly positive derivative. The rate rn depends on two interdependent factors. First, we suppose that there exists some α > 0 such that |θ0(x + u) − θ0(x)| = π0(x)|u|α + o(1) as u → 0 for some constant π0(x) > 0. Second, writing and , we suppose that there exists a sequence of positive real numbers cn → ∞ such that the appropriately localized process
converges weakly. We note that Wn,x depends on α. As we formalize below, if , then rn [θn(x) − θ0(x)] has a nondegenerate limit distribution under some conditions. We now introduce some of the conditions that we build upon:
-
(A1)
for each M > 0, {Wn,x(u) : |u| ≤ M } converges weakly in to a tight limit process {Wx(u) : |u| ≤ M } with almost surely lower semi-continuous sample paths;
-
(A2)
is bounded in probability for every ;
-
(A3)
there exist β ∈ (1,1 + α), δ* > 0 and a sequence such that is decreasing, fn(1) = O(1), and for all large n and δ ≤ δ*, .
In addition, we introduce conditions on the uniform convergence of estimators Φn and Γn:
-
(A4)
for some δ > 0;
-
(A5)
.
THEOREM 3 (Convergence in distribution)
If x is an interior point of I at which Φ0 is continuously differentiable with positive derivative and , conditions (A1)–(A5) imply that
with and
If also and Wx possesses stationary increments, then
Furthermore, if with W0 a standard two-sided Brownian motion process satisfying W0(0) = 0, then
With and .
The latter limit distribution is referred to as a scaled Chernoff distribution, since Z is said to follow the standard Chernoff distribution. This distribution appears prominently in classical results in nonparametric monotone function estimation and has been extensively studied (e.g., Groeneboom and Wellner (2001)). It can also be defined as the distribution of the slope at zero of .
Theorem 3 applies in the common setting in which θ0 is differentiable at x with positive derivative, that is, when α = 1. However, as in Wright (1981) and Anevski and Hössjer (2006), Theorem 3 also applies in additional situations, including when θ0 has α ∈ {2,3,…} derivatives at x, with null derivatives of order j < α and positive derivative of order α. Nevertheless, Theorem 3 does not cover situations in which θ0 is flat in a neighborhood of x. The limit distribution of the Grenander estimator at flat points was studied in Carolan and Dykstra (1999), but it appears that similar results have not been derived for Grenander-type estimators.
We note the similarity of our Theorem 3 to Theorem 2 of Anevski and Hössjer (2006). For the special case in which Φ0 is the identity transform, the consequents of the two results coincide. Our result explicitly permits alternative transforms. Both results require weak convergence of a stochastic part of the primitive process, and also require the same local rate of growth of θ0. Additionally, condition (A2) is implied if for every ϵ and δ positive, there exists a finite m ∈ (0, +∞) such that , as in Assumption A5 of Anevski and Hössjer (2006). However, the remaining conditions and methods of proof differ. To prove our result, we first generalize the switch relation of Groeneboom (1985) and use it to convert P0(rn [θn(x) − θ0(x)] > η) into the probability that the minimizer of a process involving Wn,x falls below some value. After establishing weak convergence of this process, we then use conditions (A2) through (A5) to justify application of the argmin continuous mapping theorem. In contrast, Anevski and Hössjer (2006) establish their result using a direct appeal to convergence in distribution of ∂−GCMC(Yn)(0) to ∂−GCMC(Y0)(0), where Yn is a local limit process and Y0 its weak limit. They also provide lower-level sufficient conditions for this convergence. It may be possible to establish the consequent of Theorem 3, permitting in particular the use of a nontrivial transformation Φ0, using Theorem 2 of Anevski and Hössjer (2006) or a suitable generalization thereof. We have specified our sufficient conditions with applications to the setting α = 1 and in mind, as we discuss at length in the next section.
Suppose that is the limit process that arises when no domain transformation is used in the construction of a generalized Grenander-type estimator, that is, when both Φ0 and Φn are taken to be the identity map. In this case, under (A1)–(A5), Theorem 3 indicates that
It is natural to ask how this limit distribution compares to the one obtained using a nontrivial transformation Φ0. In particular, does using Φ0 change the pointwise distributional results for θn? The answer is of course negative whenever Wx and are equal in distribution, since is a homogeneous operator. A more detailed discussion of this question and lower-level conditions are provided in the next section.
4. Refined results for asymptotically linear primitive and transformation estimators
4.1. Distributional results
In applications of their main result, Anevski and Hössjer (2006) focus primarily on providing lower-level conditions to characterize the relationship between various dependence structures and asymptotic results for monotone regression and density function estimation. Anevski and Soulier (2011), Dedecker, Merlevède and Peligrad (2011) and Bagchi, Banerjee and Stoev (2016) provide additional applications of Anevski and Hössjer (2006) to monotone function estimation with dependent data. Our Theorem 3 could be used, for instance, to relax the common assumption of a uniform design in the analysis of monotone regression estimators. Here, we pursue an alternative direction, focusing instead on providing lower-level conditions for consistency of θn and convergence in distribution of rn [θn(x) − θ0(x)] for use in the important setting in which α = 1, rn = cn = n1/3, the data are independent and identically distributed, and Γn and Φn are asymptotically linear estimators. Such settings arise frequently, for instance, when the primitive and transformation parameters are smooth mappings of the data-generating mechanism.
Below, we write Pf to denote for any probability measure P and P-integrable function . We also use to denote the empirical distribution of independent observations O1, O2, …, On from P0 so that for any .
Suppose that there exist functions and depending on P0 such that, for each and both and are finite, and
(2) |
where Hx,n and Rx,n are stochastic remainder terms. If and tend to zero in probability, we say that Γn and Φn as uniformly asymptotically linear over I as estimators of Γ0 and Φ0, respectively. The objects and are referred to as the influence functions of Γn(x) and Φn(x), respectively, under sampling from P0.
Assessing consistency and uniform consistency of θn is straightforward when display (2) holds. For example, if the classes and are P0-Donsker, and and are bounded in probability, then and are both bounded in probability. Thus, Theorems 1 and 2 can be directly applied with rn = n1/2 provided the required conditions on θ0 and Φ0 hold. As such, we focus here on deriving a refined version of Theorem 3 for use whenever display (2) holds.
It is reasonable to expect the linear terms and to drive the behavior of the standardized difference rn [θn(x) − θ0(x)] in Theorem 3. The natural rate here is cn = rn = n1/3, for which Kim and Pollard (1990) provide intuition. Our first goal in this section is to provide sufficient conditions for weak convergence of the process , where is the empirical process and we define the localized difference function . Kim and Pollard (1990) also provide detailed conditions for weak convergence of processes of this type. Building upon their results, we are able to provide simplified sufficient conditions for convergence in distribution of n1/3 [θn(x) − θ0(x)] when Γn and Φn are uniformly asymptotically linear estimators.
We begin by introducing some conditions. First, we define and suppose that has envelope function Gx,R. The first two conditions concern the size of for small R in terms of bracketing or uniform entropy numbers, which for completeness we define here; see van der Vaart and Wellner (1996) for a comprehensive treatment. Denote by the L2(P) norm of a given P-square-integrable function . The bracketing number of a class with respect to the L2(P) norm is the smallest number of ε-brackets needed to cover , where an ε-bracket is any set of functions {f : ℓ≤ f ≤ u} with ℓ and u such that ||ℓ − u||P,2 < ε. The covering number of with respect to the L2(Q) norm is the smallest number of ε-balls in L2(Q) required to cover . The uniform covering number is the supremum of over all discrete probability measures Q such that ||G||2,Q > 0, where G is an envelope function for . We consider conditions on the size of :
-
(B1)For some constants C > 0 and V ∈ [0,2), for all ε ∈ (0,1] and R small enough, either:
-
(B1a)or
-
(B1b).
-
(B1a)
-
(B2)
, and as R → 0 for all η > 0.
Condition (B1) replaces the notion of uniform manageability of the class for small R as defined in Kim and Pollard (1990), and condition (B2) corresponds to their condition (vi). Since bounds on the bracketing and uniform entropy numbers have been derived for many common classes of functions, condition (B1) can be readily checked in practice. Together, conditions (B1) and (B2) ensure that is a relatively small class, and this helps to establish the weak convergence of the localized process {Wn,x(u) : |u| ≤ M }.
As in Kim and Pollard (1990), to guarantee that the covariance function of this localized process stabilizes, it suffices that be bounded for small enough δ > 0 and that, up to a scaling factor possibly depending on tend to the covariance function σ2(u, v) of a two-sided Brownian motion as α → 0. Below, we provide simple conditions that imply these two statements for a broad class of settings that includes our examples.
The covariance function of the Gaussian process to which converges weakly is defined pointwise as . The behavior of near (x, x) dictates the covariance of the local limit process Wx, and hence the scale parameter κ0(x). If is differentiable in (s, t) at (x, x), then κ0(x) = 0 and θn converges at a faster rate, though possibly with an asymptotic bias. When instead Chernoff asymptotics apply, the covariance function can typically be written as
(3) |
for some functions , and depending on P0, where Q0 is a probability measure induced by P0 on some measurable space . In this representation, is taken to be the differentiable portion of the covariance function, which does not contribute to the scale parameter. The second summand is not differentiable at (x, x) and makes σx,α(u, v) tend to a nonzero limit. We consider cases in which and H0 satisfy the following conditions:
-
(B3)
Representation (3) holds, and for some δ > 0, setting Bδ(x) := (x − δ, x + δ), it is also true that:
-
(B3a)
is symmetric in its arguments and continuously differentiable on Bδ(x);
-
(B3b)A0 is symmetric in its first two arguments, and s ↦ A0(s, t, v, w) is differentiable for Q0-almost every w and each s, t, v ∈ Bδ(x), with derivative continuous in s, t, v each in Bδ(x) for Q0-almost every w and satisfying
-
(B3c)
v ↦ A0(x, x, v, w) is continuous at v = x uniformly in w over the support of Q0;
-
(B3d)
v ↦ H0(v, w) is nondecreasing for all w and differentiable at each v ∈ Bδ(x), with derivative continuous at v = x uniformly in w over the support of Q0.
-
(B3a)
Representation (3) is deliberately broad to encompass a wide variety of parameters, but in many settings, the covariance function can be considerably simplified, leading then to simpler conditions in (B3). For instance, when W is a vector of covariates over which marginalization is performed to compute the parameter, Q0 typically plays the role of the marginal distribution of W under P0. In classical problems in which there is no adjustment for covariates, this feature of representation (3) is not needed and indeed vanishes. In other settings, A0(s, t, v, w) depends on v and w but not on s and t.
Finally, we must ensure that the stochastic remainder terms Hx,n and Rx,n arising in (2) do not contribute to the limit distribution. Defining , and , we consider the following conditions for the asymptotic negligibility of these remainder terms:
-
(B4)
Kn(δ) = oP(1) for each fixed δ > 0;
-
(B5)
for some α ∈ (1,2), δ ↦ δ−αE0 [Kn(δ)] is decreasing for all δ small enough and n large enough.
Condition (B4) guarantees that the remainder terms do not contribute to the weak convergence of {Wn,x(u) : |u| ≤ M }, and condition (B5) guarantees that the remainder terms satisfy condition (A3).
Combining the conditions above, we can state the following master theorem for pointwise convergence in distribution when the monotone estimator is based upon asymptotically linear primitive and transformation estimators.
THEOREM 4
Suppose that, at an interior point x ∈ I, θ0 is differentiable and Φ0 is continuously differentiable with positive derivative. Suppose also that Γn and Φn satisfy display (2), and that conditions (B1)–(B5) and (A4)–(A5) hold (with cn = n1/3). Then it holds that
where is a scale factor involving and Z follows the Chernoff distribution.
4.2. Effect of domain transform on limit distribution
As was done briefly after Theorem 3, it is natural to compare the limit distribution obtained by Theorem 4 when a transformation of the domain is used and when it is not. We will consider , the estimator obtained by directly isotonizing an estimator Θn of the primitive function Θ0 without use of a domain transformation. Denoting by Φ0 a candidate nondecreasing transformation function, and letting Γ0 := Ψ0◦Φ0 be as described in Section 2, we will also consider , where Γn and Φn are estimators of Γ0 and Φ0, respectively. Suppose Θn(x), Γn(x) and Φn(x) are each asymptotically linear estimators of their respective targets with influence functions , and , respectively, under sampling from P0.
We wish to compare the scale parameters κ0(x) and arising from the use of the distinct estimators θn(x) and . To do so, we can use expression (B3) to examine the covariance obtained in both cases. However, it appears difficult to say much without having more specific forms for the involved influence functions. Unfortunately, it also appears difficult to characterize these influence functions generally since they depend inherently on the parameter of interest θ0, and we wish to remain agnostic to the form of θ0. Nevertheless, in our next result, we describe a class of problems, characterized by the generated influence functions and regularity conditions on these, in which domain transformation has no effect on the limit distribution of the generalized Grenander-type estimator.
THEOREM 5
Suppose conditions (B1)–(B5) hold for (Θn, Id) and (Γn, Φn), and the observed data unit can be partitioned as O = (U, Z) with . Suppose that the influence functions can be expressed as
and satisfy the smoothness conditions stated in the Supplementary Material. Suppose that the density function h0 of the conditional distribution of U given Z exists and is continuous in a neighborhood of x uniformly over the support of the marginal distribution QZ,0 of Z. Then it follows that
Consequently, and have the same limit distribution.
The forms of and arise naturally in a wide variety of settings because the parameters considered involve a primitive function. The supposed form of may seem restrictive at first glance but is in fact expected given the forms of and . A heuristic justification based on the product rule for differentiation is provided in the Supplementary Material. In all of the examples we study in Section 5, the conditions of Theorem 5 apply. This provides justification for why, in each of these examples, the use of a domain transform has no impact on the limit distribution.
We remind the reader that, even if the domain transformation has no impact on the pointwise limit distribution, use of a domain transformation is still of great practical value in many circumstances. In complex problems, an estimator Θn may not be readily available for the primitive parameter Θ0 obtained without the use of a domain transformation. In some cases, Θ0 may not even be well defined, so that transformation of the domain is unavoidable. Even when Θ0 is well defined and an estimator Θn is available, with the use of a carefully chosen transformation, it may be possible to avoid the need to estimate certain nuisance parameters or to substantially simplify the verification of conditions (B1)–(B5). Examples of these phenomena are presented in Section 5.
4.3. Negligibility of remainder terms
In some applications, the estimators Γn and Φn may be linear rather than simply asymptotically linear. In such situations, the remainder terms Hx,n and Rx,n are identically zero, and conditions (B4) and (B5) are trivially satisfied. Otherwise, these conditions must be verified. While in general the exact form of these remainder terms depends upon the specific parameter under consideration and estimators used, it is frequently the case that part of the remainder is an empirical process term arising from the estimation of nuisance functions appearing in the influence functions and , as we illustrate below with one particular construction. To facilitate the verification of conditions (B4) and (B5) for these empirical process terms, we outline sufficient conditions in terms of uniform entropy and bracketing numbers.
In this subsection, we assume that Γ0(x) and Φ0(x) arise as the evaluation at P0 of maps from to , and denote by ΓP(x) and ΦP(x) the evaluation of these maps at an arbitrary . Let π = π(P) be a summary of P, and suppose that ΓP(x), ΦP(x) and the nonparametric efficient influence functions of P ↦ ΓP(x) and P ↦ ΦP(x) at P each only depend on P through π. Denote these efficient influence functions by and , respectively. Since is nonparametric, it must be that and for π0 := π(P0). To emphasize the fact that ΓP(x) and ΦP(x) depend on P only through π, we will use the symbols Γπ(x) and Φπ(x) to refer to ΓP(x) and ΦP(x), respectively.
Under regularity conditions, the so-called one-step estimators
(4) |
are asymptotically linear and efficient estimators of Γ0(x) and Φ0(x), even when πn is a data-adaptive (e.g., machine learning) estimator of π0 (e.g., Pfanzagl (1982)). van der Vaart and van der Laan (2006) pioneered the use of such one-step estimators in the context of nonparametric monotone function estimation. When this one-step construction is used, it can be shown that the remainder terms have the form Hx,n = H1,x,n + H2,x,n and Rx,n = R1,x,n + R2,x,n, where and are empirical process terms, and H2,x,n and R2,x,n are so-called second-order remainder terms arising from linearization of the corresponding parameter. Similar representations exist when other constructive approaches, such as gradient-based estimating equations methodology (e.g., Tsiatis (2006), van der Laan and Robins (2003)) and targeted maximum likelihood estimation (e.g., van der Laan and Rose (2011)), are used. As we will see in the examples of Section 5, these second-order terms can usually be shown to be asymptotically negligible provided πn tends to π0 fast enough in some appropriate norm. Here, we provide conditions on πn that ensure that the contribution of H2,x,n − θ0(x)R2,x,n to Kn(δ) satisfies conditions (B4) and (B5).
A key benefit of decomposing the remainder terms as above is that the empirical process terms can be controlled using empirical process theory, a strategy also used in van der Vaart and van der Laan (2006). In particular, we can provide conditions under which H1,x,n and R1,x,n satisfy conditions (B4) and (B5). Defining , the relevant contribution of these empirical process terms to Kn(δ) is
Suppose that πn falls in a semimetric space , with probability tending to one, and that is an envelope function for . We consider the following the conditions:
-
(C1)
for some constants C > 0 and V ∈ [0,2), for all ε ∈ (0,1] and R small enough, either one of these conditions hold:
-
(C1a)
;
-
(C1b)
;
-
(C1a)
-
(C2)
, and for all , as R → 0;
-
(C3)
uniformly for and u ∈ I, and uniformly for ;
-
(C4)
there exists some such that .
Our next result states that, under these conditions, the remainder term K1,n(δ) stated above is asymptotically negligible in the sense of conditions (B4) and (B5).
THEOREM 6
Suppose that, with probability tending to one, and conditions (C1)–(C4) hold. Then, K1,n(δ) satisfies conditions (B4)–(B5).
We note that conditions (C1) and (C2) together imply conditions (B1) and (B2). As such, if conditions (C1) and (C2) have been verified, there is no need to also verify conditions (B1) and (B2).
5. Applications of the general theory
In this section, we demonstrate the use of our general results for the three examples introduced in Section 2: estimation of monotone density, hazard and regression functions. For each of these functions, we consider various levels of complexity of the relationship between the ideal and observed data units. This allows us to illustrate that our general results (i) coincide with classical results in the simpler cases that have already been studied, and (ii) suggest novel estimation procedures with well-understood inferential properties, even in the context of complex problems that do not appear to have been previously studied. Below, we focus on distributional results for the various estimators considered. In each case, we state the main results in the text, and present additional technical details in the Supplementary Material.
5.1. Example 1: Monotone density function
Let θ0 := f0 be the density function of an event time T with support I := [0, u0 ], and suppose that f0 is known to be nondecreasing on I. We will not use any transformation in this example, so we take Φ0 and Φn to be the identity map. Thus, ψ0 = θ0 also corresponds to the density function of T, and Ψ0 = Θ0 =Γ0 to its distribution function. Below, we consider various data settings that increase in complexity. In the first setting, available observations are subject to independent right-censoring. In the second, the right-censoring mechanism is allowed to be informative—only conditional independence of the event and censoring times given a vector of observed covariates is assumed. The first case has been studied in the literature—for this, we wish to verify that our general results coincide with results already established. The second case is more difficult and does not seem to have been studied before. Our work in this setting not only highlights the generality of the theory in Sections 3 and 4, but also yields novel practical methodology.
5.1.1. Independent censoring
Suppose that C is a positive random variable independent of T, and that the observed data unit is O = (Y, Δ), where Y = min(T, C) and Δ = I(T ≤ C). The NPMLE of a monotone density function based on independently right-censored data was obtained in Laslett (1982) and McNichols and Padgett (1982), and distributional results were derived in Huang and Zhang (1994). Huang and Wellner (1995) considered an estimator θn obtained by differentiating the GCM of the Kaplan–Meier estimator of the distribution function. While this is not the NPMLE, Huang and Wellner (1995) showed that it is asymptotically equivalent to the NPMLE, and it is an attractive estimator because it is simple to construct and reduces to the Grenander estimator if T is fully observed, that is, if C ≥ T almost surely.
Since Ψ0 is the distribution function F0 = 1 − S0 with S0 denoting the survival function of T, it is natural to consider Ψn := 1 − Sn, where Sn is the Kaplan–Meier estimator of S0. It is well known that n1/2(Sn − S0) converges weakly in to a tight zero-mean Gaussian process as long as G0(τ) > 0 and S0(τ) < 1, where G0 denotes the survival function of C. Denoting by Λ0 the cumulative hazard function corresponding to S0, the influence function of the Kaplan–Meier estimator Sn(x) is known to be the nonparametric efficient influence function
and so, the local difference gx,u(y, δ) can be written as
In the Supplementary Material, we verify that condition (B2) is satisfied if S0 and G0 are positive in a neighborhood of x, and that condition (B3) is satisfied if θ0 is positive and continuous in a neighborhood of x. The covariance function is given by . We then get , so that the scale parameter is . This agrees with the results of Huang and Wellner (1995). In the Supplementary Material, we demonstrate that conditions (B4) and (B5) are also satisfied. In the case of no censoring, simplifies to , so that and κ0(x) = θ0(x). This agrees with the classical result of Prakasa Rao (1969) concerning pointwise convergence in distribution of the Grenander estimator.
5.1.2. Conditionally independent censoring
In many cases, the censoring mechanism may be informative but still independent of the event time process conditionally on a vector of recorded covariates. For simplicity, we only consider the case in which these covariates are defined at baseline, though the case of time-varying covariates can be tackled similarly. The observed data unit is now O = (Y, Δ, W), and we assume that T and C are independent given W. As long as P0(Δ = 1|W) is bounded away from zero almost surely, the survival function S0 of T can be identified pointwise in terms of the distribution P0 of O via the product-limit transform
where is the conditional subdistribution function of Y given W = w corresponding to is the conditional proportion-at-risk at time t given W = w, and Q0 is the marginal distribution of W under P0. This constitutes an example of coarsening at random, as described in Heitjan and Rubin (1991) and Gill, Van Der Laan and Robins (1997). Estimation of S0 in the context of conditionally independent censoring has been studied before by Hubbard, van der Laan and Robins (2000), Scharfstein and Robins (2002) and Zeng (2004), among others.
In this context, the nonparametric efficient influence function of S0(x) has the form D0,x − S0(x), where D0,x is given by
with S0(x | w) and G0(x | w) the conditional survival functions of T and C, respectively, at x given W = w, and Λ0 (x | w) is the conditional cumulative hazard function of T at x given W = w. A simple one-step estimator of Γ0(x) is given by , where Dn,x is obtained by substituting Sn and Gn for S0 and G0, respectively, in D0,x. Conditions (B1) and (B2) are satisfied under uniform Lipschitz conditions on S0 and G0. As we show in the Supplementary Material, condition (B3) holds, and we get , where f0(x | w) is the conditional density of T at x given W = w. It follows directly then that the Chernoff scale factor is
which reduces to the scale factor of Huang and Wellner (1995) when T and C are independent. In the Supplementary Material, we demonstrate that satisfaction of condition (B4) is highly dependent on the behavior of Sn and Gn. For instance, if Sn − S0 and Gn − G0 uniformly tend to zero in probability at rates faster than n−1/3, then conditions (B4) and (B5) are satisfied. This is not a restrictive requirement if W only has few components—in such cases, many nonparametric smoothing-based estimators satisfy such rates. Otherwise, semiparametric estimators building upon additional structure (e.g., additivity on an appropriate scale) could be used. Alternatively, for higher-dimensional W, estimators of the form with λn an estimator of the conditional hazard λ0 may be worth considering. For such Sn, we require the product of the convergence rates of λn − λ0 and Gn − G0 to be faster than n−1/3. In practice, with a moderate or high-dimensional covariate vector W, it seems desirable to leverage multiple candidate estimators using ensemble learning (e.g., van der Laan, Polley and Hubbard (2007), van der Laan and Rose (2011)). In the Supplementary Material (Westling and Carone (2020)), we conduct a simulation study validating these results using Cox’s proportional hazard model for Sn and Gn.
5.2. Example 2: Monotone hazard function
We now consider estimation of θ0 := λ0, the hazard function of T. The most obvious approach to tackle this problem would be to consider an identity transformation as in the previous example. The primitive function of interest is then the cumulative hazard function Λ0, which can be expressed as the negative logarithm of the survival function S0 and estimated naturally using any asymptotically linear estimator of S0, for example. The conditions of Theorems 3 and 4 can then be directly verified. An alternative, more expeditious approach consists of taking the domain transform Φ0 to be the restricted mean mapping . In such cases, Γ0 is simply the cumulative distribution function F0, and the mean of T. This particular choice of domain transformation for estimating a monotone hazard function therefore yields the same parameter Γ0 as for estimating a monotone density with the identity transform. Denoting by Sn the estimator of the survival function S0 based on the available data, the resulting generalized Grenander-type estimator θn is defined by taking Γn := 1 − Sn and setting Φn to be over Jn = [0, un], where . As the result below suggests, when this special domain transform is used, we can leverage some of the work performed above in analyzing the Grenander-type estimator of a monotone density function under the various right-censoring schemes considered. We recall that Id denotes the identity function.
THEOREM 7
Suppose that and set Γn := 1 − Sn. If the pair (Γn, Id) satisfies conditions (A1)–(A3), then the pair (Γn, Φn) with necessarily satisfies conditions (A1)–(A5). In particular, for , this implies that
If Wx = [ κ0(x)] 1/2W0 for W0 a two-sided standard Brownian motion, then
where Z follows the Chernoff distribution and .
Denote by T(j) the jth order statistic of {T1, T2, …,Tn } and define T(0) := 0. When there is no censoring, the choice (Γn, Φn) prescribed above indicates that Γn is the empirical distribution function based on Y1, Y2,…,Yn, and Φn is defined pointwise as , which is strictly increasing on [0, T(n)]. Therefore, θn(x) is the left derivative at Φn(x) of the GCM of the graph of . This is the NPMLE of a nondecreasing hazard function with uncensored data; see, for example, Chapter 2.6 of Groeneboom and Jongbloed (2014).
In the Supplementary Material, we verify conditions (A1)–(A3) for each of three right-censoring schemes when Θn := 1 − Sn, and Φ0 and Φn are both equal to the identity. Thus, to use Theorem 7, it would suffice to verify that tends to zero faster than n−1/3. This is straightforward given the weak convergence of n1/2 (Sn – S0). Thus, the above theorem provides distributional results for monotone hazard function estimators in each right-censoring scheme considered, as summarized below:
no censoring: , which agrees with results from Prakasa Rao (1970);
independent right-censoring: , which agrees with results from Huang and Wellner (1995);
conditionally independent right-censoring, an important setting that does not seem tohave been previously studied in the literature:
If either T or C are independent of W, the unadjusted Kaplan–Meier estimator is consistent for the true marginal survival function of T, and so, unadjusted estimators of the density and hazard functions are consistent. In these cases, we may then ask how the asymptotic distributions of the adjusted and unadjusted estimators compare. Since all limit distributions are of the scaled Chernoff type, it suffices to compare the scale factors arising from the different estimators. The second expression in (iii) is helpful to assess the impact of unnecessary covariate adjustment. If C and W are independent, then G0(x | w) = G0(x) for each w, and so, the scale factors in (ii) and (iii) are identical. If T and W are dependent, so that f0(x | w) = f0(x) for each w, but C and W are not, then the scale factor in (iii) is generally larger than the scale factor in (ii). In summary, when using an adjusted rather than unadjusted estimator of the hazard function, there may only be a penalty in asymptotic efficiency when adjusting for covariates that C depends on but T does not. The relative loss of efficiency is given by . In the Supplementary Material, we conduct a simulation study validating these results.
5.3. Example 3: Monotone regression function
We finally consider estimation of a monotone regression function. We first focus on the simple case in which the association between the outcome and exposure of interest is not confounded. In such cases, the parameter of interest is the conditional mean of the outcome given exposure level, and the standard least-squares isotonic regression estimators can be used. We show that our general theory covers this classical case. We then consider the case in which the relationship between outcome and exposure is confounded but the confounders of this relationship have been recorded. In this more challenging case, we consider the marginalization (or standardization) of the conditional mean outcome given exposure level and confounders over the marginal confounder distribution. We study this problem using results from Section 4, which allow us to provide theory for a novel estimator proposed for this important case.
5.3.1. No confounding
In the standard least-squares isotonic regression problem, we observe independent replicates of O := (A, Y), where is an outcome and is the exposure of interest. We are interested in the conditional mean function θ0 := μ0, where μ0(x) := E0(Y | A = x) is the mean outcome at exposure level x. The primitive function of θ0 can be written as for each t, where f0 is the marginal density of A. The corresponding primitive parameter at x is pathwise differentiable with nonparametric efficient influence function . An obvious approach to estimation of θ0 consists of constructing an asymptotically linear estimator of Θ0—this involves nonparametric estimation of the nuisance density f0—and differentiating the GCM of the resulting curve—this involves selecting the interval over which the GCM is calculated.
By using a domain transformation, it is possible to avoid both the need for nonparametric density estimation and the choice of isotonization interval. Let Φ0 be the marginal distribution function of A. With this transformation, we note that and for each t. This suggests taking Φn to be the empirical distribution function based on A1, A2,...,An and . The resulting estimator θn(x) is precisely the well-known least-squares isotonic regression estimator of θ0(x). Since Φn is a step function with jumps at the observed values of A, θn(x) is equal to the left-hand slope of the GCM at Φn(x) of the so-called cusum diagram , where we let A0 = −∞, S0 = 0 and for k ≥ 1.
Because both Γn and Φn are linear estimators, these estimators do not generate second-order remainder terms to analyze. The influence functions of Γn and Φn are, respectively, and . In the Supplementary Material, we demonstrate that if in a neighborhood of x, the conditional variance function, defined pointwise as , is bounded and continuous, and Φ0 possesses a positive, continuous density, then Theorem 4 holds with
coinciding with the classical results of Brunk (1970).
5.3.2. Confounding by recorded covariates
We now consider a scenario in which the relationship between outcome Y and exposure A is confounded by a vector W of recorded covariates. The observed data unit is thus O := (W,A,Y). A more relevant estimand in this scenario might be the marginalized regression function θ0 := ν0 with ν0(x) defined as E0 [E0(Y | A = x,W) ]. We note that ν0(x) can be interpreted as a causal dose-response curve if (i) W includes all confounders of the relationship between A and Y, and (ii) the probability of observing an individual subject to exposure level x is positive in P0-almost every stratum defined by W. In many scientific settings, it may be known that the causal dose-response curve is monotone in exposure level.
We again consider transformation by the marginal distribution function of A. In other words, we set Φ0(x) := P0(A ≤ x) and take for each x. We then have that
where g0 is the density ratio (a,w) ↦ f0(a | w)/f0(a), with f0(a | w) denoting the conditional density function of A at a given W = w and f0(a) the marginal density function of A at a as before, and μ0 is the regression function (a, w) ↦ E0(Y| = a, W = w). While in this case the domain transform does not eliminate the need to estimate nuisance functions, it nevertheless results in a procedure for which there is no need to choose the interval over which the GCM is calculated.
Setting for each x and w, the nonparametric efficient influence function of Γ0(x) is
Suppose that μn and gn are estimators of μ0 and g0, respectively. If the empirical distributions Φn and Qn based on A1, A2,…,An and W1, W2,…,Wn, respectively, are used as estimators of Φ0 and Q0, it can be shown that
is a one-step estimator of Γ0(x), and that it is asymptotically efficient under regularity conditions on the nuisance estimators μn and gn.
Conditions (B1)–(B5) can be verified with routine but tedious work. Here, we focus on condition (B3), which allows us to obtain the scale parameter of the limit distribution, and on condition (B4), which requires that the nuisance estimators converge sufficiently fast. We find that condition (B4) is satisfied if, for some ϵ > 0,
and additional empirical process conditions hold. Turning to condition (B3), under smoothness conditions, , where denotes the conditional variance function of Y given A and W. We then find that the scale parameter of the limit Chernoff distribution is
The marginalized and marginal regression functions exactly coincide, that is, ν0 = μ0 – if, for example, (i) Y and W are conditionally independent given A, or (ii) A and W are independent. It is natural then to ask how the limit distribution of estimators of these two parameters compare under scenarios (i) and (ii), when the parameters in fact agree with each other. In scenario (i), the scale parameter obtained based on the estimator accounting for potential confounding reduces to
by Jensen’s inequality. Thus, if Y and W are conditionally independent given A, in which case there is no need to adjust for potential confounders, the marginal isotonic regression estimator has a more concentrated limit distribution than the marginalized isotonic regression estimator. In scenario (ii), the scale parameter of the estimator accounting for potential confounding is
given that by the law of total variance. Thus, if A and W are independent, the marginal isotonic regression estimator has a less concentrated limit distribution than the marginalized isotonic regression estimator. In both scenarios (i) and (ii), the difference in concentration between the limit distributions of the two estimators varies with the amount of dependence between A and W. We note that these observations are analogous to those obtained in linear regression.
6. Concluding remarks
We have studied a broad class of estimators of monotone functions based on differentiating the greatest convex minorant of a preliminary estimator of a primitive parameter. A novel aspect of the class we have considered is its allowance for the primitive parameter to involve a possibly data-dependent transformation of the domain. The class we have defined is useful because it generalizes classical approaches for simple monotone functions, including density, hazard and regression functions, facilitates the integration of flexible, data-adaptive learning techniques, and allows valid asymptotic statistical inference. We have provided general asymptotic results for estimators in this class and have also derived refined results for the important case wherein the primitive estimator is uniformly asymptotically linear. We have proposed novel estimators of extensions of classical monotone parameters that deal with common sampling complications, and described their large-sample properties using our general results.
Our primary goal in this paper has been to establish general theoretical results that can be applied to study many specific estimators, and as such, there are numerous potential applications of our results. There are also a multitude of useful properties and modifications of Grenander-type estimators that have been studied in the literature and whose extension to our class would be important. For instance, kernel smoothing of a Grenander-type estimator yields a monotone estimator that possesses many of the properties of usual kernel smoothing estimators, including possibly faster convergence to a normal distribution (e.g., Groeneboom, Jongbloed and Witte (2010), Mammen (1991), Mukerjee (1988)). The asymptotic distribution of the supremum norm error of Grenander-type estimators has also been derived (e.g., Durot, Kulikov and Lopuhaä (2012)), and extending this result to our class would refine further our pointwise results. Asymptotic results at the boundaries of the domain and corrections for poor behavior there have been developed and would further enhance the utility of these methods (e.g., Balabdaoui et al. (2011), Woodroofe and Sun (1993), Kulikov and Lopuhaä (2006)).
There have also been various proposals for constructing asymptotically valid pointwise confidence intervals for Grenander-type estimators without the need to compute the complicated scale parameters appearing in their limit distribution. In regular statistical problems, the bootstrap is one of the most widely used such methods; unfortunately, the nonparametric bootstrap is known to fail for Grenander-type estimators (e.g., Kosorok (2008), Sen, Banerjee and Woodroofe (2010)). However, these articles have demonstrated that the m-out-of-n bootstrap can be valid for Grenander-type estimators, and that bootstrapping smoothed versions of Grenander-type estimators can also be an effective strategy for performing inference. Asymptotically pivotal distributions based on likelihood ratios have also been used to avoid the need to estimate nuisance parameters in the limit distribution and to provide a basis for improved finite-sample inference (e.g., Banerjee and Wellner (2001), Banerjee (2005a, 2005b, 2007), Groeneboom and Jongbloed (2015)). Considering these strategies in our setting would be particularly interesting.
Supplementary Material
Acknowledgments
The authors thank the referees and associate editor for their constructive and insightful comments that helped improve this manuscript. They also thank Antoine Chambaz and Mark van der Laan for stimulating conversations that sparked their interest in this problem, Jon Wellner for sharing insight on the history of this problem and Alex Luedtke and Peter Gilbert for providing feedback early on in this work.
Both authors were supported by NIAID grant 5UM1AI058635.
The second author was supported by the Career Development Fund of the Department of Biostatistics at the University of Washington.
Footnotes
SUPPLEMENTARY MATERIAL
Supplement: Proofs and simulations (DOI: 10.1214/19-AOS1835SUPP; .pdf). The supplement includes proofs of Theorems 1–7, a heuristic justification for the influence function forms in Theorem 5, additional technical details for the examples of Section 5, and a simulation study that illustrates the large-sample results of Sections 3–4 on Examples 1 and 2 from Section 5.
REFERENCES
- ANEVSKI D and HÖSSJER O (2006). A general asymptotic scheme for inference under order restrictions. Ann. Statist 34 1874–1930. MR2283721 10.1214/009053606000000443 [DOI] [Google Scholar]
- ANEVSKI D and SOULIER P (2011). Monotone spectral density estimation. Ann. Statist 39 418–438. MR2797852 10.1214/10-AOS804 [DOI] [Google Scholar]
- BAGCHI P, BANERJEE M and STOEV SA (2016). Inference for monotone functions under short- and long-range dependence: Confidence intervals and new universal limits. J. Amer. Statist. Assoc 111 1634–1647. MR3601723 10.1080/01621459.2015.1100622 [DOI] [Google Scholar]
- BALABDAOUI F, JANKOWSKI H, PAVLIDES M, SEREGIN A and WELLNER J (2011). On the Grenander estimator at zero. Statist. Sinica 21 873–899. MR2829859 10.5705/ss.2011.038a [DOI] [PMC free article] [PubMed] [Google Scholar]
- BANERJEE M (2005a). Likelihood ratio tests under local alternatives in regular semiparametric models. Statist. Sinica 15 635–644. MR2233903 [Google Scholar]
- BANERJEE M (2005b). Likelihood ratio tests under local and fixed alternatives in monotone function problems. Scand. J. Stat 32 507–525. MR2232340 10.1111/j.1467-9469.2005.00458.x [DOI] [Google Scholar]
- BANERJEE M (2007). Likelihood based inference for monotone response models. Ann. Statist 35 931–956. MR2341693 10.1214/009053606000001578 [DOI] [Google Scholar]
- BANERJEE M and WELLNER JA (2001). Likelihood ratio tests for monotone functions. Ann. Statist 29 1699–1731. MR1891743 10.1214/aos/1015345959 [DOI] [Google Scholar]
- BEARE BK and FANG Z (2017). Weak convergence of the least concave majorant of estimators for a concave distribution function. Electron. J. Stat 11 3841–3870. MR3714300 10.1214/17-EJS1349 [DOI] [Google Scholar]
- BRUNK HD (1970). Estimation of isotonic regression In Nonparametric Techniques in Statistical Inference (Proc. Sympos., Indiana Univ., Bloomington, Ind., 1969) 177–197. Cambridge Univ. Press, London: MR0277070 [Google Scholar]
- CAROLAN C and DYKSTRA R (1999). Asymptotic behavior of the Grenander estimator at density flat regions. Canad. J. Statist 27 557–566. MR1745821 10.2307/3316111 [DOI] [Google Scholar]
- DEDECKER J, MERLEVÈDE F and PELIGRAD M (2011). Invariance principles for linear processes with application to isotonic regression. Bernoulli 17 88–113. MR2797983 10.3150/10-BEJ273 [DOI] [Google Scholar]
- DUROT C (2007). On the -error of monotonicity constrained estimators. Ann. Statist 35 1080–1104. MR2341699 10.1214/009053606000001497 [DOI] [Google Scholar]
- DUROT C, GROENEBOOM P and LOPUHAÄ HP (2013). Testing equality of functions under monotonicity constraints. J. Nonparametr. Stat 25 939–970. MR3174305 10.1080/10485252.2013.826356 [DOI] [Google Scholar]
- DUROT C, KULIKOV VN and LOPUHAÄ HP (2012). The limit distribution of the L∞-error of Grenander-type estimators. Ann. Statist 40 1578–1608. MR3015036 10.1214/12-AOS1015 [DOI] [Google Scholar]
- DUROT C and LOPUHAÄ HP (2014). A Kiefer–Wolfowitz type of result in a general setting, with an application to smooth monotone estimation. Electron. J. Stat 8 2479–2513. MR3285873 10.1214/14-EJS958 [DOI] [Google Scholar]
- GILL RD, VAN DER LAAN MJ and ROBINS JM (1997). Coarsening at random: Characterizations, conjectures, counter-examples. In Proceedings of the First Seattle Symposium in Biostatistics (Lin DY, ed.) 255–294. Springer, New York. [Google Scholar]
- GRENANDER U (1956). On the theory of mortality measurement. II. Scand. Actuar. J 39 125–153. [Google Scholar]
- GROENEBOOM P (1985). Estimating a monotone density. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer, Vol. II (Berkeley, Calif., 1983). Wadsworth Statist./Probab. Ser 539–555. Wadsworth, Belmont, CA MR0822052 [Google Scholar]
- GROENEBOOM P and JONGBLOED G (2014). Nonparametric Estimation Under Shape Constraints: Estimators, Algorithms and Asymptotics Cambridge Series in Statistical and Probabilistic Mathematics 38 Cambridge Univ. Press, New York: MR3445293 10.1017/CBO9781139020893 [DOI] [Google Scholar]
- GROENEBOOM P and JONGBLOED G (2015). Nonparametric confidence intervals for monotone functions. Ann. Statist 43 2019–2054. MR3375875 10.1214/15-AOS1335 [DOI] [Google Scholar]
- GROENEBOOM P, JONGBLOED G and WITTE BI (2010). Maximum smoothed likelihood estimation and smoothed maximum likelihood estimation in the current status model. Ann. Statist 38 352–387. MR2589325 10.1214/09-AOS721 [DOI] [Google Scholar]
- GROENEBOOM P and WELLNER JA (2001). Computing Chernoff’s distribution. J. Comput. Graph. Statist 10 388–400. MR1939706 10.1198/10618600152627997 [DOI] [Google Scholar]
- HEITJAN DF and RUBIN DB (1991). Ignorability and coarse data. Ann. Statist 19 2244–2253. MR1135174 10.1214/aos/1176348396 [DOI] [Google Scholar]
- HUANG J and WELLNER JA (1995). Estimation of a monotone density or monotone hazard under random censoring. Scand. J. Stat 22 3–33. MR1334065 [Google Scholar]
- HUANG Y and ZHANG C-H (1994). Estimating a monotone density from censored observations. Ann. Statist 22 1256–1274. MR1311975 10.1214/aos/1176325628 [DOI] [Google Scholar]
- HUBBARD AE, VAN DER LAAN MJ and ROBINS JM (2000). Nonparametric locally efficient estimation of the treatment specific survival distribution with right censored data and covariates in observational studies In Statistical Models in Epidemiology, the Environment, and Clinical Trials (Minneapolis, MN, 1997). IMA Vol. Math. Appl 116 135–177. Springer, New York: MR1731683 10.1007/978-1-4612-1284-3_3 [DOI] [Google Scholar]
- KIM J and POLLARD D (1990). Cube root asymptotics. Ann. Statist 18 191–219. MR1041391 10.1214/aos/1176347498 [DOI] [Google Scholar]
- KOSOROK MR (2008). Bootstrapping the Grenander estimator In Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen (Balakrishnan N, Peña EA and Silvapulle MJ, eds.). Collections 1 282–292. Institute of Mathematical Statistics; 10.1214/193940307000000202 [DOI] [Google Scholar]
- KULIKOV VN and LOPUHAÄ HP (2006). The behavior of the NPMLE of a decreasing density near the boundaries of the support. Ann. Statist 34 742–768. MR2283391 10.1214/009053606000000100 [DOI] [Google Scholar]
- LASLETT GM (1982). The survival curve under monotone density constraints with applications to twodimensional line segment processes. Biometrika 69 153–160. MR0655680 10.1093/biomet/69.1.153 [DOI] [Google Scholar]
- LEURGANS S (1982). Asymptotic distributions of slope-of-greatest-convex-minorant estimators. Ann. Statist 10 287–296. MR0642740 [Google Scholar]
- LOPUHAÄ HP and MUSTA E (2018a). A central limit theorem for the Hellinger loss of Grenander-type estimators. Stat. Neerl To appear. 10.1111/stan.12153. [DOI] [Google Scholar]
- LOPUHAÄ HP and MUSTA E (2018b). The distance between a naive cumulative estimator and its least concave majorant. Statist. Probab. Lett 139 119–128. MR3802192 10.1016/j.spl.2018.04.001 [DOI] [Google Scholar]
- MAMMEN E (1991). Estimating a smooth monotone regression function. Ann. Statist 19 724–740. MR1105841 10.1214/aos/1176348117 [DOI] [Google Scholar]
- MCNICHOLS DT and PADGETT WJ (1982). Maximum likelihood estimation of unimodal and decreasing densities based on arbitrarily right-censored data. Comm. Statist. Theory Methods 11 2259–2270. MR0678684 10.1080/03610928208828387 [DOI] [Google Scholar]
- MUKERJEE H (1988). Monotone nonparameteric regression. Ann. Statist 16 741–750. MR0947574 10.1214/aos/1176350832 [DOI] [Google Scholar]
- PFANZAGL J (1982). Contributions to a General Asymptotic Statistical Theory Lecture Notes in Statistics 13 Springer, New York: MR0675954 [Google Scholar]
- PRAKASA RAO BLS (1969). Estimation of a unimodal density. Sankhya A 31 23–36. MR0267677 [Google Scholar]
- PRAKASA RAO BLS (1970). Estimation for distributions with monotone failure rate. Ann. Math. Stat 41 507–519. MR0260133 10.1214/aoms/1177697091 [DOI] [Google Scholar]
- SCHARFSTEIN DO and ROBINS JM (2002). Estimation of the failure time distribution in the presence of informative censoring. Biometrika 89 617–634. MR1929167 10.1093/biomet/89.3.617 [DOI] [Google Scholar]
- SEN B, BANERJEE M and WOODROOFE M (2010). Inconsistency of bootstrap: The Grenander estimator. Ann. Statist 38 1953–1977. MR2676880 10.1214/09-AOS777 [DOI] [Google Scholar]
- TSIATIS AA (2006). Semiparametric Theory and Missing Data Springer Series in Statistics. Springer, New York: MR2233926 [Google Scholar]
- VAN DER LAAN MJ, POLLEY EC and HUBBARD AE (2007). Super learner. Stat. Appl. Genet. Mol. Biol 6 Art. 25, 23 MR2349918 10.2202/1544-6115.1309 [DOI] [PubMed] [Google Scholar]
- VAN DER LAAN MJ and ROBINS JM (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer, New York. [Google Scholar]
- VAN DER LAAN MJ and ROSE S (2011). Targeted Learning: Causal Inference for Observational and Experimental Data Springer Series in Statistics. Springer, New York: MR2867111 10.1007/978-1-4419-9782-1 [DOI] [Google Scholar]
- VAN DER VAART A and VAN DER LAAN MJ (2006). Estimating a survival distribution with current status data and high-dimensional covariates. Int. J. Biostat 2 Art. 9, 42 MR2306498 10.2202/1557-4679.1014 [DOI] [Google Scholar]
- VAN DER VAART AW and WELLNER JA (1996). Weak Convergence and Empirical Processes: With Applications to Statistics Springer Series in Statistics. Springer, New York: MR1385671 10.1007/978-1-4757-2545-2 [DOI] [Google Scholar]
- WESTLING T and CARONE M (2020). Supplement to “A unified study of nonparametric inference for monotone functions.” 10.1214/19-AOS1835SUPP. [DOI] [PMC free article] [PubMed]
- WOODROOFE M and SUN J (1993). A penalized maximum likelihood estimate of f(0+) when f is nonincreasing. Statist. Sinica 3 501–515. MR1243398 [Google Scholar]
- WRIGHT FT (1981). The asymptotic behavior of monotone regression estimates. Ann. Statist 9 443–448. MR0606630 [Google Scholar]
- ZENG D (2004). Estimating marginal survival function by adjusting for dependent censoring using many covariates. Ann. Statist 32 1533–1555. MR2089133 10.1214/009053604000000508 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.