Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2018 Apr 18;20(3):517–541. doi: 10.1093/biostatistics/kxy015

Semi-supervised neighborhoods and localized patient outcome prediction

Alison E Kosel 1,, Patrick J Heagerty 1
PMCID: PMC6821385  PMID: 29912289

Summary

Robust statistical methods that can provide patients and their healthcare providers with individual predictions are needed to help guide informed medical decisions. Ideally an individual prediction would display the full range of possible outcomes (full predictive distribution), would be obtained with a user-specified level of precision, and would be minimally reliant on statistical model assumptions. We propose a novel method that satisfies each of these criteria via the semi-supervised creation of an axis-parallel covariate neighborhood constructed around a given point defining the patient of interest. We then provide non-parametric estimates of the outcome distribution for the subset of subjects in this neighborhood, which we refer to as a localized prediction. We implement local prediction methods using dynamic graphical methods which allow the user to vary key options such as the choice of the variables defining the neighborhood, and the size of the neighborhood.

Keywords: Local prediction, Non-parametric, Semi-supervised learning

1. Introduction

1.1. Scientific motivation

Patients who understand their individualized prognosis are more likely to work collaboratively with their healthcare provider to make treatment choices that are consistent with their goals and specific values. For example, the Seattle Heart Failure Model (SHFM) predicts survival for heart failure patients using traits such as patient sex, age, laboratory measures, and medications (Levy and others, 2006). The SHFM was developed because physicians need to counsel patients about their prognosis in order to guide decisions about medications, devices, or end-of-life care. Similarly, the Framingham Heart Study has developed a prognostic score to assess 10-year risk of cardiovascular disease based on patient sex, age, total and high-density lipoprotein cholesterol, systolic blood pressure, treatment for hypertension, smoking, and diabetes status. The Framingham score has been recommended for use in guiding preventive care decisions (D’Agostino and others, 2008).

Prognostic scores are also used in other medical settings such as organ transplantation. For example, the lung allocation score assigns priority to lung transplant recipients in the United States based on patient characteristics such as age, clinical status, and specific diagnostic categories which predict both the risk of death without intervention and survival if the patient is transplanted (Egan and others, 2006). In liver disease, the Mayo model for survival among primary biliary cirrhosis patients is based on measurements that are simple to obtain including patient age, total serum bilirubin and serum albumin concentrations, prothrombin time, and severity of edema (Dickson and others, 1989). More recently, the Model for end-stage liver disease assesses chronic liver disease severity to determine priority for liver transplant recipients (Kamath and others, 2001). These select examples show that prognostic models are used across a number of disease settings to both inform patient expectations and to guide clinical decision making.

Our goal is to develop a statistical framework for providing adaptive individual patient predictions that are easily interpretable by both clinicians and patients, and that do not depend on any model assumptions. Non-parametric estimation typically requires large sample sizes. However, we expect the feasibility of such direct estimation approaches will increase with the adoption of electronic health records (EHRs) prompted by recent government initiatives such as the Health Information Technology for Economic and Clinical Health (HITECH) Act, enacted in 2009, which allocates roughly $30 billion to promote the adoption of health information technology and incentivizes its use (Jha, 2010). Ultimately data quality and generalizability need to be considered with any use of contemporary EHR data (Keiding and Louis, 2016).

1.2. Statistical background and innovation

Understanding how individuals perceive information in order to make decisions is essential for developing effective and appropriate statistical summaries. Psychological research has shown that both the perceived personal relevance and the validity of information are important aspects that determine the likelihood that information will lead to individual action (Petty and Cacioppo, 1986). Furthermore, select surveys have shown that the likelihood of physicians changing their medical practice on the basis of new clinical evidence depends on the physicianâŁTMs impression of both the relevance and the reliability of the research results. For example, one aspect that has been shown to impact uptake is the sample size used for a clinical study (Ibrahim and others, 2009), with a larger sample size leading to a greater probability of adopting new interventions. Therefore, we seek a non-parametric prediction method that allows the user to directly control the number of observations within a local neighborhood of subjects. We also believe that it is critically important to disclose the exact region/neighborhood that is used for any individual prediction so that the personal relevance of the support within the data used to generate the prediction can be subjectively judged by the patient and/or provider.

We introduce a simple distance-based method to perform local non-parametric estimation based on a neighborhood that is selected to have a fixed sample size (precision), and which is based on independent restrictions on each covariate of interest. Such a rectangular neighborhood is termed “axis-parallel” and yields a simple, interpretable description for patients, and providers that communicates the data that was actually used to construct local distribution estimates. Our basic goal is to transfer the standard clinical question of making a prediction for an individual subject to the question of prediction for “subjects like the specific individual”, and we seek to return both the desired predictive distribution estimate, and a neighborhood specification that was used to provide the estimate. In summary, our choice is to transfer from a specific covariate point to a select subgroup of patients, and to explicitly return the subgroup used.

In order to detail our approach, consider an outcome, Inline graphic, and a set of covariates, Inline graphic. Typically predictive research goals focus on estimating the outcome distribution for either the entire population of patients (i.e. Inline graphic marginally) or for a single patient characterized by exact covariates values (i.e. Inline graphic). Herein, we choose an intermediate target, where we let Inline graphic be a neighborhood of distance Inline graphic about the covariate vector Inline graphic, and we then define the parameter of interest, Inline graphic, based on a localized neighborhood:

graphic file with name M9.gif (1.1)

By specifying the distance Inline graphic we can either focus on a point (Inline graphic), the entire population (Inline graphic), or a select subgroup using Inline graphic.

Ultimately, we focus on the novel idea of using a fixed precision neighborhood, defined as a region within the covariate space that has a fixed number of points contained within it. By choosing a neighborhood Inline graphic such that it contains a desired number of points, Inline graphic, we are making the decision to accept variable distances to define a neighborhood depending on the density of points around an index point Inline graphic of interest. Finally, by returning both an outcome prediction and the neighborhood used, we clearly inform the user of the data’s ability to answer the question of interest: if there are very few data points near a given patient, then the patient and/or clinician will be made aware of this by the associated size of the neighborhood defined by the region in the covariate space used to support and generate the prediction. Although a neighborhood Inline graphic could in principle be any shape, we chose to restrict our consideration to axis-parallel neighborhoods, or simple covariate rectangles. We do so for ease of interpretation since axis-parallel boxes can be described by a defined range used for restriction on each co-ordinate of Inline graphic, and are therefore, easily presented and understood by medical professionals and patients.

In order to provide individualized predictions, we seek to estimate an outcome distribution for subgroups of patients who are characterized by their specific baseline clinical or demographic variables. We are interested in the full conditional distribution rather than a summary measure such as the mean, since the distribution can then provide multiple meaningful summaries such as the conditional median value, or the 25th and 75th percentiles. Our premise is that providing details such as the percentage of subjects with very good or very bad outcomes is important for decision makers. In addition, patients can easily understand statements such as: “25% of subjects like you will have an outcome less than or equal to 3 on a 10-point pain scale,” and therefore, we focus on determining all percentiles, or equivalently the full conditional distribution. We believe that simple estimates of local means are inadequate to fully inform patient expectations.

Conditional predictions are a focus of many contemporary statistical learning methods that allow great flexibility in the breadth of candidate input predictive information (Hastie and others, 2009). However, most regression and/or learning methods focus on generating a predictive conditional mean or risk prediction, while we seek a valid estimate of the full predictive distributions. Another common limitation to the use of standard predictive methods is that transparency in terms of how well the data support prediction for an individual is not generally an intentional product of the methods. Many clinical risk prediction calculators have been created in recent years, but use of these tools rarely, if ever, gives information back to the user about the underlying covariate data distribution or relevance for the target individual. Our approach is to explicitly produce information on the âŁœsupport⣞ within the data toward generating a conditional prediction, and we explicitly return information on the characteristics of the patient neighbors who are used to inform individual prediction. Finally, quantile regression methods described by Koenker and Bassett (1978) may be used to obtain any pre-selected percentile, but do not generally provide a full distribution estimate without imposing monotonicity constraints across separate quantile regressions.

Our proposed methods are a variation on non-parametric local kernel methods studied by Stüte (1986b). However, our proposal involves choice of a specific data adaptive distance function to construct meaningful and interpretable covariate neighborhoods that form the basis for estimation. We define a distance function that uses the strength of the covariate relationships with the outcome, and therefore, we are proposing a semi-supervised local predictive method. In addition, standard bandwidth selection methods for local estimation typically consider predictive criteria that balance bias and variance, while we see value in direct control of precision (or variance) at a pre-specified level in order to facilitate interpretation, and then we explicitly communicate the transfer of estimation from an index point to a specific patient neighborhood.

As a non-parametric method, we are practically restricted to a low-dimensional set of predictors. However, our estimation can be applied after meaningful covariate dimension reduction. For example, unsupervised methods such as principal components may be adopted to derive a summary score for a related group of clinical measures. Alternatively, supervised methods can generate predictive scores which can then be used to form one element of neighborhood construction. In our illustration in Section 4, we demonstrate use of one such dimension reduction approach.

2. Methods

In this section, we detail our choice of distance metrics, our approach to obtaining an adaptive neighborhood, and the statistical properties of our local distribution estimator. A summary of key notation can be found below in Table 1.

Table 1.

A summary of the notation used to characterize the proposed localized prediction

Description Notation
Dimension of points Inline graphic
Number of points Inline graphic
Number of points in neighborhood Inline graphic
Points Inline graphic
Outcomes Inline graphic
Point of interest Inline graphic
Neighborhood of distance Inline graphic about Inline graphic Inline graphic
Estimated neighborhood of distance Inline graphic about Inline graphic Inline graphic
Vertices defining Inline graphic Inline graphic
Distribution function for outcome in Inline graphic Inline graphic

2.1. Neighborhood definition

Our goal is to create an interpretable localized estimate of the distribution of patient outcomes. We create an estimate based on subjects similar to a given patient, and therefore, rely on covariate distance metrics to define neighbors. We allow the importance of covariates to be determined by adaptive neighborhood scaling, and the final neighborhood can then be used to estimate the full local distribution.

We consider a class of distance options based on the family of metrics defined by key parameters:

graphic file with name M35.gif (2.1)

One can in theory choose any metric in this class to create a neighborhood Inline graphic. Examples of commonly used metrics in this class are Mahalanobis distance, with Inline graphic, Inline graphic, and Inline graphic; L1 distance, with Inline graphic, Inline graphic, and Inline graphic; and nearest neighbors, with Inline graphic as the function that calculates the element-wise marginal distribution function for each covariate. In the subsections below, we comment on each distance functions element with the goal of identifying a flexible yet interpretable neighborhood specification.

2.2. Choice of Inline graphic

The choice of the distance parameter Inline graphic ultimately determines the shape of any neighborhood defined by a restriction to those points within a fixed distance. For example, Inline graphic corresponds to the Inline graphic norm and results in a generalized circle. Using Inline graphic corresponds to Inline graphic and is well known to return a “diamond” shaped region. Each of these choices defines a region that is not simple to describe to patients or providers in terms of the restriction on values used for each covariate. However, to report back to the user the covariate neighborhood, we could in principle choose any metric and then take the rectangular enclosure of the points defined by that metric. Unfortunately, controlling the size of the local enclosure, or Inline graphic, in these cases would be difficult, since it is not directly determined by the defining distance, and would potentially vary across the covariate space. Therefore, we prefer to use Inline graphic, or a scaled version of the supremum norm, Inline graphic, dependent on a choice of Inline graphic, since use of this metric and a distance restriction on points around an index point will directly yield an axis-parallel neighborhood. Such neighborhoods are defined by independent interval restrictions, Inline graphic using lower (Inline graphic) and upper (Inline graphic) limits for each covariate. In this case, the description of a neighborhood is given by a simple table where each covariate is listed and the associated limit values presented. When using the scaled sup-norm metric to define a neighborhood, Inline graphic, we can control the number of points Inline graphic exactly by appropriate selection of a distance restriction, and the defining enclosure Inline graphic is simple to communicate.

Although the supremum norm creates a square by definition, we choose to shrink this square to a rectangle by taking the furthest actualized pair of distances along each axis instead of considering the neighborhood to extend equal distances in all directions from Inline graphic. The difference is likely to be small, but from an interpretation perspective, it is more sensible to consider the neighborhood to be only as big as the points it contains. Note that the resulting rectangle may not be symmetric about Inline graphic; it is unlikely to be highly asymmetric, but the square will be shrunk different distances on each side. Also note that, while the square created by the supremum norm is the smallest square centered at Inline graphic containing Inline graphic points, the resulting rectangle may not be the smallest rectangle containing Inline graphic and Inline graphic other points. If, for example, the data is particularly sparse on one side of Inline graphic, a smaller and highly unbalanced rectangle could be created.

2.3. Choice of Inline graphic

Although, we have settled on a preferred choice of Inline graphic for our distance metric, we must still consider the choice of Inline graphic and Inline graphic. Note that the supremum norm treats a one unit change in the direction of any co-ordinate as equivalent. However, this is undesirable in many cases due to both different scales of measurement for each covariate, and differential variable importance toward predicting the outcome. For example, one might not wish to equate a 1 mm of mercury change in systolic blood pressure with a one unit change in body mass index. More generally, it would not be desirable to have neighborhoods depend on choice of measurement units for any variable. Therefore, adoption of any distance function that combines information across axes will require consideration of appropriate scaling represented by the matrix Inline graphic or the function Inline graphic.

There are two main categories of methods to rescale candidate covariates: outcome-independent methods (unsupervised) and outcome-dependent methods (supervised). In other words, do we choose to rescale based on the covariates alone or do we take their relationship to the outcome into consideration? In the next subsections, we consider using global or local regressions to determine element-wise covariate rescaling specified by the matrix Inline graphic, and then we consider potential use of transformation functions, Inline graphic.

2.3.1. Global linear supervised marginal scaling

To correct the potential problem of differing scales among covariates, we propose outcome-based marginal rescaling. To implement this, one simply regresses Inline graphic on Inline graphic for each covariate Inline graphic individually. This produces a Inline graphic for each predictor Inline graphic. To rescale, all Inline graphic and Inline graphic are scaled by their regression estimate, and this equates to adopting Inline graphic. In other words, one rescales each coordinate based on its marginal association with the outcome where all covariate scales are equivalent since distances Inline graphic correspond to a common (Inline graphic-unit) change in the expected outcome. Therefore, coordinates that are strongly associated with the outcome have their distances weighted more heavily, while coordinates that are weakly associated with the outcome have their distances weighted less heavily. Marginal regression coefficient rescaling puts all covariates on an outcome-standardized scale.

While there would potentially be benefits to a regression using all covariates in a multivariate model, there may also be drawbacks. Using marginal scaling allows the user to select different subsets of covariates to form neighborhoods without changing the choice of scaling for each variable that is included. Furthermore, adoption of a multivariate regression for scaling invites issues of interaction to be explicitly considered.

2.3.2. Local linear supervised marginal scaling

Marginal coefficient scaling can be easily relaxed to accommodate possible non-linear associations between a predictor and the outcome. Flexible parametric or smooth non-parametric methods can be used to estimate a general response surface: Inline graphic. In this scenario rescaling at a point of interest Inline graphic could then use the derivative at the target point for rescaling: Inline graphic. Such a local linear rescaling requires choice of methods to determine the functional form either through spline basis choice (parametric) or bandwidth specification (non-parametric). Inference with parametric spline methods is unchanged from that using linear rescaling, while more general non-parametric rescaling may result in altered asymptotic properties for the resulting localized outcome distribution estimate depending on assumptions regarding the choice of the neighborhood size as a function of sample size. We consider these issues in Section 3.

2.4. Choice of Inline graphic

In this section, we consider outcome-independent methods that can be used to rescale the covariates into comparable units. One common choice would be to use either element-wise standardization or Mahalanobis distance to convert covariates to a standardized scale. A second option would be to transform covariates into percentiles (i.e. a choice of Inline graphic). The advantage of percentiles is that a one unit change in any direction results in inclusion of the same marginal fraction of the data, while the advantage of standard deviation units is that the actual distances within each coordinate are maintained. Finally, note that it is possible to combine outcome-based marginal rescaling with standardization, i.e. apply the outcome-based marginal rescaling to data already converted to the scale of percentiles or to standard deviations.

2.5. Algorithm: Inline graphic and supervised scaling

In this subsection, we detail one attractive algorithm which involves a simple procedure to find the axis-parallel neighborhood of Inline graphic points about a target point Inline graphic. First, the user inputs the point of interest and the number of points desired in the neighborhood. Then the distance between the point of interest and all other points is calculated, and finally a box containing the desired number of closest points is created. Note that, as described in previous sections, the choices of Inline graphic and Inline graphic are left open, but below we detail use of supervised global scaling without use of any additional covariate transformation.

  1. Select Inline graphic and m; let Inline graphic.

  2. Let Inline graphic be either the identity matrix of size Inline graphic (no scaling) or, for Inline graphic, regress Inline graphic on Inline graphic, and let the slope / coefficient be Inline graphic, and Inline graphic be the diagonal matrix created by Inline graphic (outcome-based rescaling). Let Inline graphic be any desired function, such as identity or percentile. For Inline graphic, calculate the resulting distances, Inline graphic.

  3. Calculate the quantile of the empirical distribution of the distances, Inline graphic: Inline graphic. Discard all Inline graphic such that Inline graphic; call the remaining points Inline graphic where Inline graphic.

  4. Find the enclosure of Inline graphic. For Inline graphic, let Inline graphic and Inline graphic.

  5. Finally, given the identified axis-parallel neighborhood we compute the local empirical distribution function as our final localized outcome prediction:
    graphic file with name M118.gif

2.6. Functionals

In addition to estimating the full outcome distribution, Inline graphic, we can also estimate any functionals of this distribution such as the mean. Note that our method allows nearly any choice of functional to be estimated because we non-parametrically obtain an estimate of the entire conditional distribution, and therefore simultaneously obtain estimates for any quantiles.

2.7. Ties

With discrete covariate data there is the additional potential issue of ties when constructing fixed size neighborhoods since multiple data points may have the exact same value for one or more covariates at the boundaries of the neighborhood. Therefore, in some cases obtaining precisely the desired level of precision may not be possible unless additional restrictions are used. In particular, since the Inline graphic metric results in separate restrictions on each covariate axis any discreteness can produce substantially more points than desired (or fewer depending on use of Inline graphic or Inline graphic to restrict points). In the case of ties, we recommend applying a second metric, such as Inline graphic, in order to provide additional restriction and yield a neighborhood closer to the target size of Inline graphic. In particular, either an additional Inline graphic or Inline graphic constraint will result in trimming points that are in the furthest corners of the Inline graphic rectangle and will break ties using a secondary distance. However, in the extreme case of highly discrete coordinates, it may be difficult to achieve a neighborhood of Inline graphic points without arbitrary methods to break ties.

2.8. Computational complexity

2.8.1. Single Inline graphic.

In practice, we would typically perform calculations dynamically; that is, we would calculate our neighborhood estimates based upon the user’s specific choice. We perform Inline graphic linear regressions, computed as Inline graphic, with an Inline graphic design matrix. Each regression requires calculating Inline graphic, at a cost of Inline graphic; a matrix inversion, at a cost of Inline graphic; calculating Inline graphic, at a cost of Inline graphic; and, finally, multiplying Inline graphic by Inline graphic, at a cost of Inline graphic. The resulting cost for each regression is Inline graphic, which gives us an asymptotic cost of Inline graphic for Inline graphic regressions. Calculating the distance between Inline graphic and the Inline graphic other points has time complexity Inline graphic. We then must sort the distances. For example, “quicksort” is commonly used which averages Inline graphic and has a worst case scenario of Inline graphic (Skiena, 2009). Thus, the sorting dominates the cost; overall we average Inline graphic asymptotically, with a worst case scenario of Inline graphic.

2.8.2. All Inline graphic.

If individual predictions will be based an external, static database, then we might wish to have all possible calculations already performed and simply awaiting retrieval. Our method asymptotically results in a cost of Inline graphic on average and Inline graphic in the worst case when all points are used in turn as Inline graphic. Segal and Kedem describe an algorithm to obtain the Inline graphic nearest rectilinear neighbors of a given point that has lower total cost; however, they require that Inline graphic (Segal and Kedem, 1998).

3. Inference

In this section, we consider the asymptotic distribution for the local empirical distribution estimator that uses global scaling to determine the neighborhood. When studying inference, we continue with our transfer from a specific point to a localized neighborhood, and we consider a procedure where a neighborhood of Inline graphic percent of the data about Inline graphic is used, or Inline graphic. We refer to this as the “fixed percentage” scenario. In this situation the target parameter is defined as:

graphic file with name M160.gif

Ultimately we consider four key sources of variability: variation in Inline graphic for observations in a neighborhood; variation in the points Inline graphic that are in the neighborhood; variation associated with the distance restriction, Inline graphic; and variation associated with the global scaling parameters, Inline graphic. Letting Inline graphic be the true conditional distribution function for the Inline graphicth Inline graphic in a sample of Inline graphic, we define Inline graphic. We focus on this local average of distribution functions since we do not necessarily have i.i.d. outcomes within a neighborhood Inline graphic.

Suppose we consider

graphic file with name M171.gif (3.1)
graphic file with name M172.gif (3.2)
graphic file with name M173.gif (3.3)
graphic file with name M174.gif (3.4)

where, we decompose the scaled estimation error into three conditionally independent components: that due to uncertainty in outcome (Expression 3.2), that due to the specific points in the neighborhood (Expression 3), and that due to uncertainty in neighborhood location (Expression 3.4). We label the uncertainty in outcome, Inline graphic, as Inline graphic and the uncertainty due to specific points and distance, Inline graphic, and Inline graphic, as Inline graphic and then the orthogonality of these two components is easily shown using standard conditioning arguments:

graphic file with name M180.gif

where the second equality is due to the fact that Inline graphic is constant, when Inline graphic is fixed and the third due to the clear independence of Inline graphic and Inline graphic. We, therefore, may begin by analyzing the uncertainty in outcome (Expression 3.2), which, using results from Shorack and Wellner (2009), converges to Inline graphic as Inline graphic, where Inline graphic is a normal process with mean zero and a covariance function Inline graphic that is the limit of Inline graphic, where Inline graphic. Note that, where Inline graphic is a Brownian bridge process, Inline graphic for all Inline graphic, with equality when Inline graphic is iid (i.e. when Inline graphic is the identity function); in essence, Inline graphic is a Gaussian process that is less variable than a Brownian bridge.

Expression 3 involves uncertainty associated with specific points, Inline graphic in the neighborhood defined by Inline graphic and therefore can be easily treated with empirical process results for weighted averages. Specifically, the limiting distribution is:

graphic file with name M199.gif

where Inline graphic and the scaling yields the limit of the mean variance of Inline graphic conditional on Inline graphic, which we label Inline graphic.

Finally, we may focus on the uncertainty in neighborhood (Expression 3.4), for which we turn to the Central Limit Theorem (see Appendix A.2 for further details) to obtain, where Inline graphic is the distribution of distances about Inline graphic,

graphic file with name M206.gif (3.5)
graphic file with name M207.gif (3.6)

where Inline graphic and Inline graphic.

Therefore,

graphic file with name M210.gif (3.7)
graphic file with name M211.gif (3.8)

In Appendix A.4 we also provide inference for those who wish to compare our estimate to the point-specific target parameter at the point Inline graphic and obtain an unbiased estimator thereof. In this case, our method can still be used and viewed as a data-adaptive smoothing method. We term this case the smoothing scenario, where Inline graphic, Inline graphic. It is important to note that Inline graphic varies in this scenario, as Inline graphic grows at a slower rate than Inline graphic; because Inline graphic, Inline graphic, and our target is the point rather than the neighborhood.

In the above inference, we assume no scaling, or the use of a known scaling parameter. The use of an estimated scaling parameter introduces additional variation to the estimator, and we, therefore, also consider

graphic file with name M220.gif (3.9)

This additional component can be evaluated using results from van der Vaart and Wellne (2007). Provided that the behavior of Inline graphic satisfies standard assumptions for regression estimators, we obtain

graphic file with name M222.gif (3.10)

where Inline graphic and Inline graphic can be obtained via the delta method (see Appendix).

Note that this regression term is not orthogonal to the uncertainty in outcome (Expression 3.2). We focus here only on the order of this term, as its exact value will vary depending on the method of the parameter’s estimation. We have four possible scenarios: local and global scaling parameter estimation in both the smoothing and the fixed percentage situations. In the smoothing situation under global scaling estimation, the additional uncertainty from estimating Inline graphic is asymptotically negligible; Inline graphic and the rest of the uncertainty term converges at a rate of Inline graphic. In the smoothing scenario under local scaling estimation and the fixed percentage scenario under global scaling estimation, the additional uncertainty from scaling estimation is of the same order as the Gaussian process. Local scaling would generally not be used in the fixed percentage scenario since the variation due to rescaling would dominate asymptotically.

3.1. Evaluation of asymptotic approximations

We have conducted extensive simulations to evaluate the coverage properties of confidence bands and intervals based on the asymptotic approximations to our localized patient outcome distribution estimators. Specifically, we have considered Inline graphic and Inline graphic, with both unscaled and scaled estimators. For distributional shapes we considered a simple normal model, and more complex mixture distributions. Coverage of the localized cumulative distribution function parameter, Inline graphic, was assessed for both simultaneous confidence bands across all Inline graphic, and pointwise for select values of Inline graphic. In addition, we considered prediction for Inline graphic at the center and at the edge of the distribution of Inline graphic. In summary, our estimation of Inline graphic proved to be unbiased and our coverage was almost always within 1-2% of nominal levels, regardless of the outcome distribution as would be expected for non-parametric estimators. Accounting for estimation of scaling parameters also yielded nominal coverage levels. Detailed simulation settings and associated results may be found in supplementary material available at Biostatistics online.

4. Implementation and illiustration

4.1. Implementation using dynamic graphics

We have used the Shiny package in R (R Core Team, 2012, Chang and others, 2017) to implement our method and to allow the user to dynamically select multiple key parameters. Specifically, we query the user regarding the covariate characteristics of the index subject for whom a prediction is desired, and we require specification of the desired sample size. We then return the covariate restrictions that define the neighborhood and produce an estimated cumulative distribution function or a simple histogram to estimate the localized outcome distribution. We also provide a graph showing the neighborhood that was used and the relative density of observations for the entire data set in order to inform the user of the characteristics of the neighborhood that was used to support the local prediction (see Figure 1 for an example discussed below). In addition, we automatically provide summary statistics for the outcome such as the quartiles and mean of the empirical distribution, displaying both the full sample summary (overall) and the summary for the neighborhood of interest. By providing dynamic computational tools we allow the user to easily evaluate the impact of changing the target sample size, or of changing the scaling methods used to construct the neighborhood. In particular, we allow choice of global linear scaling of the covariate axes, or local linear scaling to allow the relative importance of covariates to potentially change depending on the index point selected for the local prediction. In summary, our dynamic graphical tool allows simple web-based exploration of predictive distributions in a user-selected dataset and focuses on returning both the characteristics of the neighborhood that was used and the associated non-parametric full predictive distribution for the outcome of interest.

Fig. 1.

Fig. 1.

In this example of our dynamic graphical tool, we display the outcome for a patient aged 75 who has a baseline disability of 10.

Our current implementation in R has a number of key features that facilitate adoption by others, and there are some select limitations that we hope to address in future releases. First, the localized patient outcome prediction (LPOP) tool easily allows a user-loaded dataset, and automatically populates each of the variable choice drop-down menus with the available variables. This feature allows the user to easily change the predictors of interest and to explore predictive distributions as a function of alternative variable sets. Furthermore, the choice of neighborhood size is a simple slider and different sizes can be explored with the neighborhood and predictive distribution plots updated immediately. Finally, the tool allows the user to choose details of neighborhood construction such as using no covariate scaling or either global or local scaling options. Although our current dynamic graphical implementation presents the neighborhood for only two covariates, this tool could easily be extended to additional covariates. Rather than showing the bivariate or multivariate neighborhood, we would present the marginal density of each covariate, and we would show the index subject and the associated neighborhood boundaries for each covariate. Because our neighborhood is defined as axis-parallel restrictions, these univariate displays would provide a sufficient description of the neighborhood. The code to implement our method is available at the Github account https://github.com/aekosel/lpop.

4.2. Illustration: longitudinal cohort study and prognosis

Low back pain is the leading cause of years lost to disability according to the 2013 Global Burden of Disease (Vos and others, 2013). Contemporary treatment strategies now use select clinical tools such as the STarT Back Screening Tool to classify patients with low back pain into three risk groups based on key prognostic indicators (Hill and others, 2008). Using data from a recent large-scale longitudinal cohort study, we aim to provide physicians and patients with a personalized estimate of the patient’s expected disability outcome over time. Specifically, we use the Back pain Outcomes using Longitudinal Data (BOLD) (Jarvik and others, 2012) which consists of approximately 5000 patients age 65 or older who present with a complaint of back pain. Standard of care is given to patients in this observational cohort, and longitudinal measures of pain and function are recorded for a 2 year follow-up period. In this article, we focus on using baseline data to predict patient disability at 1 year after their index primary care visit.

For our example, we first select a patient who is 75 years old (age variable) and has a baseline disability (roland_0 variable) of 10, on a scale from 0 to 24 points, with a higher score indicating increasing severity. We consider the patient’s disability at 1 year (roland_3) as our outcome. We specify a neighborhood size of m = 446 which is 10% of the total sample size. Figure 1 shows results for this subject and shows that the data-adaptive covariate neighborhood construction suggests that baseline disability is much more strongly related than age to disability at 1 year—the selected neighborhood is much narrower on the disability axis than the age axis when dynamically scaled. In comparison, if the covariates were not scaled, we would obtain a neighborhood with an age range of 71–79 and a baseline disability range of 6–14, in place of our scaled neighborhood ranging from 67 to 83 in age and 9 to 11 in baseline disability. Furthermore, we can compare data-adaptive (supervised) scaling to more traditional unsupervised covariate scaling strategies such as use of Mahalanobis distance. In this example, we find that the correlation between age and baseline disability is only r = 0.067 suggesting that Mahalanobis-based covariate ellipses would have minimal rotation. In addition, the standard deviation in age is 6.74 and the standard deviation for baseline disability is 6.36. Therefore, with Mahalanobis scaling one age unit is nearly equal to one baseline disability unit. However, when we use the 1-year disability outcome to construct semi-supervised neighborhoods we find that the linear regression coefficient of age is 0.12 while the linear regression coefficient for baseline disability is 0.62. Therefore, a one unit change in baseline disability is associated with a nearly five-fold greater impact on the mean outcome than a one-unit change in age. Consequently, our semi-supervised neighborhood is much tighter in the tolerance for baseline disability as compared with age, and such a scaling trade-off is not captured via typical unsupervised scaling approaches such as Mahalanobis distance.

In the globally scaled neighborhood, which is what we would use in practice, we see clear value in providing the complete predictive distribution. Importantly, we observe a large number of patients who recover completely (i.e. return to a disability score of 0) after 1 year, and note that the remainder of the patients form a roughly normal distribution around the chosen patient’s original disability score. In cases such as this, with a bimodal outcome distribution, a single summary measure such as the mean or median cannot adequately inform a clinician or patient’s expectations. It is important to be able to clearly visualize the space of possible outcomes.

Our non-parametric methodology directly provides distributional estimates, Inline graphic, but can also provide measures of uncertainty for standard summaries of this predictive distribution. For example, with our index subject characterized by Inline graphic(age = 75, baseline disability = 10) the estimated predictive 25th percentile has a 95% confidence interval of (3.09, 5.67) while the 50th and 75th percentiles have confidence intervals of 8.38–10.14 and 12.37–13.93. Herein, we use our asymptotic results establishing normality and then adopt bootstrap methods for computational convenience. Furthermore, if other predictive summaries are of interest such as the predictive mean, or prediction error then these can also be directly derived from the localized predictive distribution estimate.

We can also examine slightly different patients, as seen in Figure 2. When we consider a patient with a higher baseline disability score—15 instead of 10—we see that when the neighborhood remains approximately the same size (volume) and that the outcome maintains its bimodal shape, but the second peak is shifted to the right indicating greater disability. We can also examine a patient on the edge of the data; for example, a patient who still has a baseline disability score of 10, but who is age 95 instead of 75. When keeping the sample size at 200 similar to earlier estimates, we now obtain a much larger volume neighborhood, though our outcome distribution still has the same bimodal shape. Instead of an age range of 16 years, we obtain a range of 19 years; in place of a disability range of 2 points, we obtain a range of 6 points. Note also the marked asymmetry of the neighborhood. There simply do not exist patients with a similar disability score who are older than the chosen patient, and so the box cannot extend in that direction. We feel that these neighborhood characteristics are not deficiencies of our approach but rather they represent features of our method since the neighborhood information provides important disclosure regarding the inherent limitations of the data. While nearly all local statistical methods have difficulty on the boundary of the data, with our approach the user is clearly informed when the data about a given patient is sparse and when he might wish to be cautious in his interpretation.

Fig. 2.

Fig. 2.

In this example of our method, we display the change in neighborhood and the change in outcome distribution when a new patient is selected. We begin with a patient who has a baseline disability of 10 and an age of 75, then consider a patient of the same age with a baseline disability of 15, and then consider again a patient with a baseline disability of 10 but this time an age of 95. In all three cases we see a bimodal outcome distribution, though the neighborhood for the patient on the edge of the data is substantially larger than the other two neighborhoods.

Finally, with BOLD we consider application of localized estimation with multiple predictive covariates. One strategy is to derive a meaningful summary score and then to use that variable as a component of neighborhood construction. To illustrate the approach we consider use of baseline disability as the primary predictor, and then combine age, with both baseline back pain numerical rating and leg pain numerical rating into a derived variable. Using age and the pain scores we can use regression to derive a linear combination that predicts 1-year disability. In Figure 3, we show results for a subject with a baseline disability equal to 15 and an age and pain derived predicted mean outcome of 12. Careful choice of derived dimension reduction can retain patient-centered meaning and here we would present the estimate as representing the 1-year outcomes for “200 patients similar to you using your baseline disability score of 15, and your predicted mean outcome of 12 based on age and baseline back and leg pain ratings.”

Fig. 3.

Fig. 3.

In this example of our dynamic graphical tool, we display the estimated 1-year disability outcome distribution using two predictors: baseline disability; and a linear combination of other prognostic variables including age, baseline back pain numerical rating scale, and leg pain numerical rating. Herein, we address the potential for use of multiple predictors through supervised dimension reduction where an outcome mean regression is used to derive a predicted score based on a subset of covariates. The patient-centered interpretation is that we present estimates “for 200 patients similar to you using your baseline disability score of 15, and your predicted mean outcome of 12 based on age and baseline back and leg pain ratings.”

4.3. Illustration: comparison with regression quantile methods

Lastly, we use the BOLD data to compare our localized strategy to alternative parametric strategies. Specifically, in Table 2 we compare our LPOP quantile estimates using neighborhoods of m = 200 and m = 400 to quantile estimates based on flexible parametric regression using M-estimator methods implemented in the R package quantreg (Koenker, 2005). We separately estimate quantile regression functions for the 25th, 50th, and 75th percentiles. In order to use regression quantile methods with continuous predictors we used linear splines with knots at the three quartiles for both age and baseline disability. An additive model therefore uses 9 total parameters, and a more flexible bivariate predictive surface approach includes the additional interactions between the two predictors for a total of 25 parameters. Table 2 captures key expected trade-offs between non-parametric and regression-based parametric approaches. First, the LPOP estimates with m = 200 generally have the largest standard error since they are non-parametric and local. Second, there is generally good agreement between the flexible parametric estimates and the LPOP quantile estimates. The key exception to these patterns are estimates for an individual at the extreme of the covariate distribution such as the subject with age = 95. For this index individual, we find large uncertainty associated with the regression estimates and this is contrasted with the clear neighborhood transfer that the LPOP methods intentionally disclose (Figure 2). Finally, comparison with regression strategies highlight the intrinsic challenge of distributional estimation with moderate dimension predictors since flexible multivariate surfaces may generate high-dimensional bases.

Table 2.

Analysis results for n = 4459 subjects from the BOLD cohort study

  Subject = ( 75, 10 )   Subject = ( 75, 15 )   Subject = ( 95, 10 )  
  95% CI 95% CI 95% CI
Methods Estimate (SE) Lower Upper Estimate (SE) Lower Upper Estimate (SE) Lower Upper
25th Percentile
     LPOP (m = 200) 5 / 5.32 (1.05) 3.25 7.40 7 / 7.06 (0.93) 5.24 8.87 5 / 5.52 (0.78) 3.99 7.06
     LPOP (m = 400) 5 / 4.55 (0.72) 3.13 5.97 6 / 6.44 (0.81) 4.84 8.04 5 / 5.07 (0.49) 4.12 6.02
     RQ additive 4.40 (0.50) 3.42 5.38 7.65 (0.64) 6.39 8.91 4.40 (0.50) 3.42 5.38
     RQ interaction 5.52 (0.56) 4.28 6.50 8.24 (0.76) 6.79 9.75 3.33 (2.79) Inline graphic1.69 9.28
50th Percentile
     LPOP (m = 200) 10 / 9.53 (0.67) 8.21 10.85 12 / 12.19 (0.66) 10.89 13.49 10 / 9.94 (0.68) 8.61 11.28
     LPOP (m = 400) 9 / 9.40 (0.50) 8.42 10.37 12 / 12.31 (0.52) 11.28 13.34 10 / 9.77 (0.47) 8.84 10.70
     RQ additive 9.17 (0.28) 8.61 9.72 12.50 (0.34) 11.84 13.16 9.67 (0.61) 8.46 10.87
     RQ interaction 10.06 (0.38) 9.30 10.81 12.82 (0.42) 11.99 13.54 7.84 (1.83) 4.25 11.45
75th Percentile
     LPOP (m = 200) 13 / 13.25 (0.61) 12.05 14.45 16 / 16.18 (0.58) 15.03 17.32 13 / 13.13 (0.49) 12.18 14.09
     LPOP (m = 400) 13 / 13.31 (0.48) 12.38 14.24 16 / 16.35 (0.48) 15.41 17.30 13 / 13.04 (0.27) 12.51 13.57
     RQ additive 12.81 (0.27) 12.29 13.33 16.68 (0.25) 16.19 17.18 13.40 (0.54) 12.34 14.46
     RQ interaction 13.35 (0.38) 12.61 14.09 16.38 (0.36) 15.67 17.10 12.70 (1.66) 9.46 15.95

We show estimation results for select quantiles (25th, 50th, and 75th) and compare both our proposed method labeled LPOP and regression quantile methods labeled RQ. We provide details for three subjects characterized by (age, baseline, and disability) as shown in Figure 2. For LPOP, we consider neighborhoods of size m = 200 (Inline graphic 5%) and m = 400 (Inline graphic 10%), while for RQ, we consider models that use linear splines for both age and baseline disability either adopting an additive structure or including interactions between age and baseline disability to allow more parametric flexibility in order to compare to the non-parametric LPOP estimates. For each scenario, we present the point estimate, SE estimate, and 95% CI. For each method the quantile estimate is presented in bold, and for the LPOP estimates we also present the mean estimate from 1000 bootstrap replicates.

5. Discussion

By focusing on semi-supervised axis-parallel neighborhoods, our proposed methods provide an easily calculated estimate of an individual patient’s prognosis that is based on a subgroup of subjects chosen to be interpretable by both clinicians and patients. The increasing amount of EHR data that is becoming accessible to inform clinical predictions will allow patients and providers to obtain desired levels of precision despite the non-parametric nature of our method. We recognize that, when analyzing biomedical data, ensuring adequate access control to protected health information is an important concern and therefore would propose that only those who already have access to the database be given access to the tool. The major limitation of our work is that it is suitable for low-dimensional problems only. Due to the sparsity of data as dimension increases, neighborhoods quickly become so large as to be useless. However, marginal scaling and clinician guidance allow us to narrow down the set of candidate predictors.

In the future, we hope to extend our work to comparative prediction, so that patients may obtain an estimate of their prognosis under a variety of alternative treatment options. In addition, we recognize that patients often have multiple longitudinal measurements rather than a single cross-sectional measurement. Therefore, creating methods that can include individual patient histories or trajectories as input predictive data will be an important extension to consider. The inclusion of patient trajectories would require appropriate low dimensional summary measures for data that may be irregularly measured over time. Finally, although we focus on quantitative outcome measures the core methods can be easily extended to survival endpoints.

Supplementary Material

kxy015_Supplementary_Data

Acknowledgments

Conflict of Interest: None declared.

Appendix

A. Mathematical details

A.1. Construction of empirical processes

Suppose Inline graphic. Then their empirical distribution function (edf) is

graphic file with name M242.gif (A.1)

where we note that Inline graphic and hence

graphic file with name M244.gif (A.2)
graphic file with name M245.gif (A.3)

The uniform empirical process is then

graphic file with name M246.gif (A.4)

We let Inline graphic be a Brownian Bridge and, by the Central Limit Theorem,

graphic file with name M248.gif (A.5)

Now suppose we have Inline graphic for some distribution function Inline graphic. Then their edf is

graphic file with name M251.gif (A.6)

and their empirical process is Inline graphic. If we define Inline graphic for Inline graphic, where Inline graphic is as above, then Inline graphic and we have

graphic file with name M257.gif (A.7)
graphic file with name M258.gif (A.8)
graphic file with name M259.gif (A.9)

and, noting that Inline graphic,

graphic file with name M261.gif (A.10)

Therefore,

graphic file with name M262.gif (A.11)

Because Inline graphic has a zero mean and known variance, we can use these results to determine how close the empirical cdf should be to the true cdf (i.e. the confidence interval associated with the cdf is based on the variance of Inline graphic).

Suppose we have independent but not identically distributed random variables, i.e. Inline graphic where Inline graphic for Inline graphic. Then our definition of Inline graphic remains the same, but we introduce the average df

graphic file with name M269.gif (A.12)

and rewrite our empirical process as

graphic file with name M270.gif (A.13)

where, letting Inline graphic,

graphic file with name M272.gif (A.14)

Letting Inline graphic, we define Inline graphic. That Inline graphic for some Inline graphic is necessary and sufficient for Inline graphic where Inline graphic is a normal process with zero means and covariance function Inline graphic. Note that, for all Inline graphic, Inline graphicShorack and Wellner (2009).

A.2. Uncertainty in neighborhood location

For the uncertainty in neighborhood location, we turn to the Inline graphic-method Casella and Berger (2002). We assume that Inline graphic is differentiable with respect to Inline graphic and that its second derivative with respect to Inline graphic is bounded. We then use a Taylor expansion Casella and Berger (2002) to obtain

graphic file with name M286.gif (A.15)

where Inline graphic and hence determine that the asymptotic variance of our expression is

graphic file with name M288.gif (A.16)

where Inline graphic is the asymptotic variance of Inline graphic. Because Inline graphic is simply a quantile of the distance, we may use the formula for the asymptotic variance of a sample quantile Shorack and Wellner (2009) to obtain

graphic file with name M292.gif (A.17)
graphic file with name M293.gif (A.18)

where the second equality is due to the fact that Inline graphic. Combining the above results, our variance is

graphic file with name M295.gif (A.19)

Because the uncertainty due to location is orthogonal to the uncertainty in outcome, we may add the estimation error from each term to obtain the overall estimation error in the fixed percentage scenario. Note that this term is Inline graphic while the uncertainty in outcome is Inline graphic; taking advantage of the fact that Inline graphic, we multiply our variance by Inline graphic (i.e. multiply our entire expression by Inline graphic) to obtain a distribution of

graphic file with name M301.gif (A.20)

A.3. Uncertainty Due to Estimation of Parameters

Scaling parameter estimation introduces an extra component to the estimation error. We turn to van der Vaart and Wellner van der Vaart and Wellner (2007) to evaluate this component and we obtain, where Inline graphic is the covariance matrix of the scaling parameters,

graphic file with name M303.gif (A.21)

For the uncertainty due to estimation of functions of Inline graphic (i.e. Inline graphic), we turn to the same results to obtain, where Inline graphic is the covariance matrix of the functions,

graphic file with name M307.gif (A.22)

Note that these terms are correlated with the uncertainty in outcome and hence consideration of the covariance between the terms is also necessary.

A.4. Smoothing scenario

In the smoothing scenario, where we consider Inline graphic as Inline graphic, we turn to results from Stüte to handle the uncertainty in outcome (Expression 3.2) Stüte (1986a); see Appendix A.5 for a more detailed explanation of the mathematics. Assuming that Inline graphic, we obtain

graphic file with name M311.gif (A.23)

where Inline graphic is a centered Gaussian process with covariance

graphic file with name M313.gif (A.24)

The uncertainty in neighborhood is in this case of order Inline graphic and hence for this term we rewrite Inline graphic as Inline graphic. Because Inline graphic and

graphic file with name M318.gif (A.25)

converges to a finite distribution, the entire term is asymptotically negligible and we obtain

graphic file with name M319.gif (A.26)

which, because Inline graphic, is equivalent to

graphic file with name M321.gif (A.27)

A.5. Uncertainty in outcome: a detailed explanation

When evaluating the asymptotic distribution of our estimator, we need to consider the variability in outcome using results from Stüte (1986a), who tells us that the result is a Gaussian process. He first proves the result pointwise and then expands his results to a curve. For ease of explanation, the proof is presented for univariate Inline graphic, though the results are easily expanded to multivariate Inline graphic. We let Inline graphic be independent random observations with the same distribution function Inline graphic as Inline graphic. Suppose Inline graphic and let Inline graphic will be a sequence of bandwidths converging to zero such that Inline graphic as Inline graphic. We define

graphic file with name M331.gif (A.28)

and

graphic file with name M332.gif (A.29)

where Inline graphic is a one-sided kernel. Our initial goal is to prove that, where

graphic file with name M334.gif (A.30)

we obtain

graphic file with name M335.gif (A.31)

for Inline graphic-almost all Inline graphic. In other words, we prove our desired result for any given point. We will subsequently expand our proof to the curve as a whole. Our first step is to obtain, via a Taylor expansion,

graphic file with name M338.gif (A.32)
graphic file with name M339.gif (A.33)
graphic file with name M340.gif (A.34)
graphic file with name M341.gif (A.35)

where Inline graphic is between Inline graphic and Inline graphic. We will prove that Inline graphic is negligible and rewrite Inline graphic in a different form that is asymptotically equivalent. These two changes will allow us to rewrite Inline graphic as a whole in such a way that we can prove our theorem. We begin by showing that

graphic file with name M348.gif (A.36)

as Inline graphic. Because Inline graphic vanishes outside of some finite interval, our expansion of Inline graphic holds true with integration restricted to those Inline graphic for which Inline graphic. We know that Inline graphic is bounded and that Inline graphic with probability one. By previous results from Stute, we also know that, over the values of Inline graphic in question, Inline graphic is stochastically bounded as Inline graphic. We combine these facts to obtain our desired result.

Our next (and more involved) step is to rewrite Inline graphic into a more tractable expression—its asymptotic equivalent,

graphic file with name M360.gif (A.37)

We will move from Inline graphic to Inline graphic, then from Inline graphic to Inline graphic, then from Inline graphic to Inline graphic, and finally move from the derivative of the kernel to the kernel itself. We begin by defining

graphic file with name M367.gif (A.38)

where Inline graphic denotes the empirical process pertaining to Inline graphic. We let Inline graphic be the Inline graphic-field generated by the Inline graphic-data. Upon proving that Inline graphic as Inline graphic, we obtain

graphic file with name M375.gif (A.39)
graphic file with name M376.gif (A.40)
graphic file with name M377.gif (A.41)
graphic file with name M378.gif (A.42)

Our next goal is to move from Inline graphic to Inline graphic. We define

graphic file with name M381.gif (A.43)

with corresponding von Mises statistic

graphic file with name M382.gif (A.44)
graphic file with name M383.gif (A.45)

By results from previous works, Inline graphic as Inline graphic and hence

graphic file with name M386.gif (A.46)
graphic file with name M387.gif (A.47)

because Inline graphic is bounded and Inline graphic. Thus,

graphic file with name M390.gif (A.48)
graphic file with name M391.gif (A.49)
graphic file with name M392.gif (A.50)

where, we jump from the second to the third equality by using the previous result. We then integrate the quantity in the last equality with respect to Inline graphic to obtain

graphic file with name M394.gif (A.51)

We will now move from Inline graphic to Inline graphic. By results from previous works,

graphic file with name M397.gif (A.52)

as Inline graphic and hence

graphic file with name M399.gif (A.53)
graphic file with name M400.gif (A.54)
graphic file with name M401.gif (A.55)
graphic file with name M402.gif (A.56)

We then move on to our final goal of converting the derivative of the kernel to the kernel itself. We first note that, because the kernel vanishes outside a bounded region, Inline graphic. Therefore, recalling that Inline graphic and Inline graphic are constant with respect to Inline graphic and can be extracted from the integral,

graphic file with name M407.gif (A.57)
graphic file with name M408.gif (A.58)

We then apply integration by parts, i.e. Inline graphic. We assign Inline graphic and Inline graphic. Therefore,

graphic file with name M412.gif (A.59)
graphic file with name M413.gif (A.60)

because Inline graphic when evaluated at the limits of integration—here, the lower and upper bound of the region outside which the kernel vanishes – due to the fact that the empirical distribution of Inline graphic and Inline graphic itself are equivalent at Inline graphic and Inline graphic. Thus, only the Inline graphic term remains, and we obtain

graphic file with name M420.gif (A.61)

as desired. We are now in a position to substitute Inline graphic and our rewritten Inline graphic for Inline graphic in the left-hand side of our theorem to obtain Inline graphic, which we prove converges in distribution to the desired quantity. Observe that

graphic file with name M425.gif (A.62)
graphic file with name M426.gif (A.63)
graphic file with name M427.gif (A.64)
graphic file with name M428.gif (A.65)
graphic file with name M429.gif (A.66)
graphic file with name M430.gif (A.67)
graphic file with name M431.gif (A.68)

We then note that, for each Inline graphic, Inline graphic is a standardized sum of i.i.d. random variables with (using Inline graphic)

graphic file with name M435.gif (A.69)
graphic file with name M436.gif (A.70)
graphic file with name M437.gif (A.71)
graphic file with name M438.gif (A.72)

where Inline graphic. We, then, use results from previous works to show that, asymptotically, we can substitute Inline graphic for Inline graphic and that the second term is negligible, and hence Inline graphic for Inline graphic-almost all Inline graphic. Thus, it suffices to show that the array defining the Inline graphic’s satisfies the Lindeberg condition for Inline graphic-almost all Inline graphic; in other words, to prove that, as Inline graphic, and for all Inline graphic

graphic file with name M450.gif (A.73)

Because Inline graphic, the above will follow if, with Inline graphic and

graphic file with name M453.gif (A.74)

we can make the following expression arbitrarily small if a large enough Inline graphic is chosen:

graphic file with name M455.gif (A.75)

From standard results in differentiation theory, the above is equivalent to Inline graphic for Inline graphic-almost all Inline graphic, and hence may be made small by letting Inline graphic. In order to guarantee that Inline graphic is equicontinuous in a neighborhood about Inline graphic, and thus that the standardized process Inline graphic has continuous sample paths, we assume that

graphic file with name M463.gif (A.76)

To derive distributional results for Inline graphic, we must also assume that Inline graphic has uniform marginals. Then for Lebesgue-almost all Inline graphic

graphic file with name M467.gif (A.77)

Here Inline graphic is a centered Gaussian process on Inline graphic with continuous sample paths vanishing at the lower boundary of Inline graphic and covariance

graphic file with name M471.gif (A.78)

Funding

This research was partially supported by National Institutes of Health (R01 HL072966 and UL1 TR002319).

References

  1. Casella G. and Berger R. L. (2002). Statistical Inference, Volume 2.Pacific Grove CA: Duxbury. [Google Scholar]
  2. Chang W., Cheng J., Allaire J. J., Xie Y. and McPherson J. (2017). shiny: Web Application Framework for R. R package version 1.0.5. https://CRAN.R-project.org/package=shiny. [Google Scholar]
  3. D’Agostino R. B., Vasan R. S., Pencina M. J., Wolf P. A., Cobain M., Massaro J. M. and Kannel W. B. (2008). General cardiovascular risk profile for use in primary care the Framingham heart study. Circulation 117, 743–753. [DOI] [PubMed] [Google Scholar]
  4. Dickson E. R., Grambsch P. M., Fleming T. R., Fisher, L.D. and Langworthy A. (1989). Prognosis in primary biliary cirrhosis: model for decision making. Hepatology 10, 1–7. [DOI] [PubMed] [Google Scholar]
  5. Egan T. M., Murray S., Bustami R. T., Shearon T. H., McCullough K. P., Edwards L. B., Coke M. A., Garrity E. R., Sweet S. C., Heiney D. A.. and others (2006). Development of the new lung allocation system in the united states. American Journal of Transplantation 6, 1212–1227. [DOI] [PubMed] [Google Scholar]
  6. Hastie T., Tibshirani R. and Friedman J. H. (2009). The Elements of Statistical Learning. New York: Springer. [Google Scholar]
  7. Hill J. C., Dunn K. M., Lewis M., Mullis R., Main C. J., Foster N. E. and Hay E. M. (2008). A primary care back pain screening tool: identifying patient subgroups for initial treatment. Arthritis Care & Research 59, 632–641. [DOI] [PubMed] [Google Scholar]
  8. Ibrahim K. A., Paneth N., LaGamma E. and Reed P. L. (2009). Clinician opinion to design clinical trials that change standards-of-cares. Pediatric Research. [Google Scholar]
  9. Jarvik J. G., Comstock B. A., Bresnahan B. W., Nedeljkovic S., Nerenz D. R., Bauer Z., Avins A. L., James K., Turner J. A., Heagerty P. J.. and others (2012). Study protocol: The back pain outcomes using longitudinal data (BOLD) registry. BMC Musculoskeletal Disorders 13, 64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Jha A. K. (2010). Meaningful use of electronic health records: the road ahead. JAMA 304, 1709–1710. [DOI] [PubMed] [Google Scholar]
  11. Kamath P. S., Wiesner R. H., Malinchoc M., Kremers W., Therneau T. M., Kosberg C. L., D’Amico G., Dickson E. R. and Kim W. R. (2001). A model to predict survival in patients with end-stage liver disease. Hepatology 33, 464–470. [DOI] [PubMed] [Google Scholar]
  12. Keiding N. and Louis T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society) 179, 319–376. [Google Scholar]
  13. Koenker R. (2005). Quantile Regression, Number 38 Cambridgey, England: Cambridge University Press. [Google Scholar]
  14. Koenker R. and Bassett G. (1978). Regression quantiles. Econometrica: Journal of the Econometric Society, 46, 33–50. [Google Scholar]
  15. Levy W. C., Mozaffarian D., Linker D. T., Sutradhar S. C., Anker S. D., Cropp A. B., Anand I., Maggioni A., Burton P., Sullivan M. D.. and others (2006). The Seattle heart failure model prediction of survival in heart failure. Circulation 113, 1424–1433. [DOI] [PubMed] [Google Scholar]
  16. Petty R. E. and Cacioppo J. T. (1986). The Elaboration Likelihood Model of Persuasion. New York, NY: Springer. [Google Scholar]
  17. R Core Team. (2012). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
  18. Segal M. and Kedem K. (1998). Geometric applications of posets. Computational Geometry 11, 143–156. [Google Scholar]
  19. Shorack G. R. and Wellner J. A. (2009). Empirical Processes with Applications to Statistics, Volume 59 Philadephia, PA: SIAM. [Google Scholar]
  20. Skiena S. S. (2009). The Algorithm Design Manual. New York, NY: Springer. [Google Scholar]
  21. Stüte W. (1986a). Conditional empirical processes. The Annals of Statistics 14, 638–647. [Google Scholar]
  22. Stüte W. (1986b). On almost sure convergence of conditional empirical distribution functions. The Annals of Probability 14, 891–901. [Google Scholar]
  23. van der Vaart A. W. and Wellner J. A. (2007). Empirical processes indexed by estimated functions. Lecture Notes-Monograph Series, 234–252. [Google Scholar]
  24. Vos T., Flaxman A. D., Naghavi M., Lozano R., Michaud C., Ezzati M., Shibuya K., Salomon J. A., Abdalla S., Aboyans V.. and others (2013). Years lived with disability (YLDS) for 1160 sequelae of 289 diseases and injuries 1990–2010: a systematic analysis for the global burden of disease study 2010. The Lancet 380, 2163–2196. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxy015_Supplementary_Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES