Abstract
Cook (1977) proposed a diagnostic to quantify the impact of deleting an observation on the estimated regression coefficients of a General Linear Univariate Model (GLUM). Simulations of models with Gaussian response and predictors demonstrate that his suggestion of comparing the diagnostic to the median of the F for overall regression captures an erratically varying proportion of the values.
We describe the exact distribution of Cook’s statistic for a GLUM with Gaussian predictors and response. We also present computational forms, simple approximations, and asymptotic results. A simulation supports the accuracy of the results. The methods allow accurate evaluation of a single value or the maximum value from a regression analysis. The approximations work well for a single value, but less well for the maximum. In contrast, the cut-point suggested by Cook provides widely varying tail probabilities. As with all diagnostics, the data analyst must use scientific judgment in deciding how to treat highlighted observations.
Keywords: regression diagnostics, influence, residual analysis
1. INTRODUCTION
1.1 Motivation
A wide variety of applications in the medical, social, and physical sciences use regression models with continuous predictors. Often the predictors may plausibly be assumed to follow a multivariate Gaussian distribution. For example, a paleontologist may wish to model total skeleton length of fossils of a particular species, as a function of sizes for a limited number of bones. Many diagnostics have been suggested to aid in evaluating the validity of such models.
Most research in regression diagnostics has centered on the impact of deleting a single observation, with many different measures suggested. Cook (1977) recommended evaluating the standardized shift in the vector of estimated regression coefficients. He suggested comparing the statistic to the median of the F statistic for the test of all coefficients equal to zero. Such highlighted observations merit further examination in terms of their credibility and also their implications for validity of the model assumptions.
Belsley, Kuh, and Welsch (1980, p28) and Cook and Weisberg (1982, p114) discussed two alternatives for judging diagnostic statistics. Internal scaling involves judging a value with respect to the distribution in the sample at hand. External scaling involves judging a value with respect to the distribution that might occur over repeated samples. Both principles have merit in data analysis.
A standard approach for a diagnostic with know sampling distribution, such as studentized residuals, involves three steps. First, highlight observations by reference to the sampling distribution. Second, investigate the highlighted observations values and role in the analysis. Third, decide on the disposition of the observation, in light of all knowledge about the data. Possible actions include doing nothing, correcting a discovered error, or deleting an impossible value.
Data analysts first encountering p-values for regression diagnostics may hope to use them for automatic elimination of observations. Sophisticated analysts use the reference distributions to provide a common metric for the three step process (highlight, investigate, decide). Kleinbaum, Kupper, and Muller (1988, p201), in their introductory regression book, summarized their discussion of diagnostics by stating: “One should be cautioned that deleting the most deviant observations will in all cases slightly improve, and sometimes substantially improve, the fit of the model. One must be careful not to data snoop simply in order to polish the fit of the model by discarding troublesome data points.”
Although conceptually attractive to some observers, Cook’s statistic has not elicited universal enthusiasm. For example, Obenchain (1977) suggested ignoring the statistic and concentrating on its two components, the residual and the leverage. The difficulty in using the statistic stems from uncertainty as to what cut-point to use for highlighting troublesome observations. Our experience led us to the belief that the statistic flags only values already highlighted by residual analysis. Unpublished simulations (Chen Mok, 1993) confirmed the impression.
The ability to compute quantiles for Cook’s statistic based on Gaussian predictors, described in §2, provides an accurate metric for the statistic and hence allows the diagnostic to consistently highlight values worthy of further examination. The new results in this paper also imply a framework and approach for describing distributions and other properties of other diagnostics.
1.2 Related Earlier Work
Nearly all current regression texts consider regression diagnostics in some detail. Excellent book-length treatments include, in chronological order, Belsley, Kuh and Welsch (1980), Cook and Weisberg (1982), Atkinson (1985), and Chatterjee and Hadi (1986).
We consider two versions of the General Linear Univariate Model (GLUM) with iid Gaussian errors. For each observational unit the predictors will be assumed to be either a set of fixed values or to follow a multivariate Gaussian distribution. Sampson (1974) described the setting with fixed predictors as the conditional model, and the setting with Gaussian predictors as the unconditional model. As detailed in §2, the distribution and interpretation of Cook’s statistic depend directly on the distribution of the predictors. See Jensen and Ramirez (1996, 1997) for the distribution of Cook’s statistic for fixed predictors.
2. DISTRIBUTION THEORY
2.1 Notation and Definitions
In this section we present many standard results for regression diagnostics. Rather than cite a single source for each result, we recommend that the reader consult any of the book-length treatments just cited. LaMotte (1994) provided a “Rosetta Stone” for translating among the many names used for residuals.
A number of standard distributions must be considered. In general, indicate the cumulative distribution function (CDF) of the random variable U, which depends on parameters α1 through αk, as FU(t;α1…αk), with density fu(t; α1…αk) and pth quantile . For notational convenience write the CDF of U|V = v as FU|v(t;α1…αk). Resolution of conflict between random variable and matrix notation and the random or fixed nature of a variable will be specified when not obvious from context. Let N(μ, σ) indicate a multivariate Gaussian vector, with mean μ, non-singular covariance Σ, and CDF Φ(t; μ, Σ). Most results in this paper involve χ2, F, or β random variables (Johnson and Kotz, Chapter 17, 1970a; Chapters 24 and 26, 1970b). Let χ2(ν) indicate a central χ2 random variable on ν degrees of freedom, and let F(ν1, ν2) indicate a central F random variable on ν1 and ν2 degrees of freedom. Similarly let β(κ1,) indicate a β random variable, with support (0,1).
Most results for regression diagnostics concern fixed predictors, and hence the conditional model described by Sampson (1974). In particular, consider
(2.1) |
Let yi indicate the ith row of y, Xi the ith row of X, and ei the ith row of e. Here X contains fixed values, known conditionally on having designated the sampling units, β contains fixed unknown values, and Fe|X(t) = Φ(t; μ Σ). Assume throughout that N ⪢ q and that X has full rank of q. Let ν = (N – q) indicate the error degrees of freedom. Indicate the usual estimators as
(2.2) |
(2.3) |
Define
(2.4) |
the hat matrix because (Hoaglin and Welsch, 1978). Let hi indicate the ith diagonal element of H, the leverage for the ith observation:
(2.5) |
Refer to
(2.6) |
as the vector of residuals. Note that
(2.7) |
In turn define the ith squared standardized residual as
(2.8) |
Belsley, Kuh and Welsch (1980), Cook and Weisberg (1982) and Atkinson (1985) reviewed the algebra of deletion and properties of residuals. Let (−i) indicate deletion of the ith observation and index the N statistics generated by doing so. Let X(−)i indicate the (N – 1) × q matrix created by deleting the ith row, with corresponding leverage . The process creates sets of N estimates of , predicted values, , residuals, , and variance estimates, . The resulting squared and standardized residual, the studentized residual, equals
(2.9) |
with
(2.10) |
Cook’s statistic measures the standardized shift in predicted values and the shift in due to deleting the ith observation:
(2.11) |
Furthermore
(2.12) |
Finding d such that Pr{Di > d} = α would provide a metric for Cook’s statistic. This idea motivates the current work. The results also provide a test of whether a particular Di arose from the distribution of Di implied by the GLUM assumptions. As highlighted in §1.1 and §4.3, the latter interpretation has more risks than benefits in practical use for the diagnostic setting.
2.2 The Distribution of Cook’s Statistic for Fixed Predictors
For fixed predictors Ci does not vary randomly. Hence, conditional on X,
(2.13) |
Usually if i ≠ i’ then . The value of does not vary randomly with fixed predictors, but does vary with the ith leverage, hi, and hence typically varies across sampling units.
In order to provide a metric for judging Cook’s statistic it would seem natural to eliminate the heterogeneity between sampling units which occurs with fixed predictors. However, doing so eliminates the variability due to Ci and makes Di a simple multiple of , with no distinct information. At least with predictor values assigned by the experimenter, Obenchain’s (1977) preference for considering the leverages and residuals separately seems appealling. See Jensen and Ramirez (1996, 1997) for a thorough treatment of fixed predictors.
2.3 The Distribution of Cook’s Statistic for Gaussian Predictors
Theorem
Let a0 = [q(N – 1)−1. a1 = (q – 1)N[qν(N – 1)−1, and t0 = max(a0, d/ν). For d > 0 and Gaussian predictors
(2.14) |
with corresponding density
(2.15) |
Here
(2.16) |
Lemma 1
(Weisberg, 1985, p114) Conditional on knowing X (fixed X)
(2.17) |
Lemma 2
A leverage value from a model containing an intercept and (q – 1) multivariate Gaussian predictors, with each row iid, equals a one-to-one function of an F random variable.
Proof
Belsley, Kuh, and Welsch (p66, 1980) proved that
(2.18) |
Solving their result for hi yields
(2.19) |
Lemma 3
With Gaussian predictors, Ci = a0 + a1Fi, so that
(2.20) |
and
(2.21) |
Proof
For Gaussian predictors the expression in (2.19) for hi allows stating
(2.22) |
Lemma 4
Let X* = XT, with T a full rank q × q matrix of constants. Note that T−t = (T’)−1 = (T−1)’. Then H does not vary due to this transfonnation of the predictors.
Proof
Observe that
(2.23) |
Corollary 4.1
H does not vary due to the covariance matrix of iid random predictors.
Proof
Let Σx = F F’ indicate a factoring of the (q – 1) × (q – 1) covariance matrix of a row of random predictors, assumed full rank. Choosing
(2.24) |
corresponds to considering a new model with predictors X* = XT. The model contains an intercept and q – 1 random predictors, with Σx* = I.
Corollary 4.2
hi, , , Ri, Ci and Di do not vary due to full rank transformation of the predictors or the covariance matrix of random predictors.
Proof
Each quantity depends on X only through elements of H.
Lemma 5
With Gaussian predictors .
Proof
Consider in terms of three pieces: (1 – hi), and .
Obviously (1 – hi) depends on X only through hi.
Conditional on X, , and does not depend on X.
and therefore .
Conditional on X, by the nature of deletion and are statistically independent (LaMotte, 1994, example 1) and .
Combining i) through iv) completes the proof
Corollary 5.1
With Gaussian predictors .
Proof
Use the last line of (2.9) to write . Hence depends on X only through which depends on X only through hi.
Corollary 5.2
With Gaussian predictors .
Proof
Ci = hi/[q(1 – hi)] and hence depends on X only through hi.
Proof of the Theorem
Use the law of total probability to state
(2.25) |
Equation (2.17) describes the distribution function of conditional on X, which equals the distribution of conditional on Ci, by Corollary 5.2. Combining the distribution in (2.17) with (2.25) allows conduding that
(2.26) |
Note that t0 = max(a0, d/ν) and simplify. Finding the density requires differentiating each form in (2.26) separately, and recognizing that the lower limit depends on d. The two apparently distinct forms reduce to a single one upon noting that fβ[1; 1/2, (ν – 1)/2] = 0.
2.4 Computational Forms for Numerical Integration
Although tantalizing in form, the integral for the CDF of Di does not allow closed form integration. Numerical integration allows accurate and convenient computation of Pr{Di > d}. Both functions in the integral require careful consideration in order to produce a form amenable to computation. Among various forms considered, the ones used here provide the simplest proofs and least computational time for any level of accuracy, except perhaps for small values of Pr{Di ≤ d}. Interest usually centers on large values of Pr{Di ≤ d}.
Two distinct representations create a finite region of integration, which greatly simplifies numerical integration. First express the density of Ci in terms of an F. If u = (t – a0)/a1, so that t = a1u + a0 and u0 = (t0 – a0)/a1 then
(2.27) |
or equivalently
(2.28) |
The relationship of F and β random variables allows creating a finite region of integration. If z = (q – 1)u[ν + (q – 1)u]−1 then u = ν(q – 1)−1z(1 – z)−1 and Z0 = (q – 1)u0[ν + (q – 1)u0]−1. Also let
(2.29) |
With this transformation
(2.30) |
A second useful representation results from applying the transformation w = u/(1 + u) to the integral in (2.28). With w0 = u0/(1 + u0) and
(2.31) |
it follows that
(2.32) |
2.5 Approximations
Equation (2.27) allows recognizing that Pr{Di > d} equals the expected value of a function of a random variable whenever t0 = a0. For fixed q . Consequently the expected value interpretation holds, at least asymptotically, in all cases. The accuracy of a series based on treating the integral as an expected value depends both on the remainder term and on any discrepancy due to d/ν > a0.
Creating a two term Taylor’s series approximation for (2.30) involves noting that εβ[(q – 1)/2, ν/2] = (q – 1)/(ν + q – 1). Ignoring any discrepancy due to d/ν > a0 yields
(2.33) |
Applying a series expansion for an F random variable, using (2.27) or (2.28), requires ν > 2k to insure finite kth moment. If ν > 2 then εF[(q – 1), ν] = ν/(ν – 2) and, ignoring discrepancy due to d/ν > a0, a two term series equals
(2.34) |
For ν ≤ 2, a one term F based expansion about the number 1 yields
(2.35) |
which corresponds to the two term expansion for the β representation (in 2.34). The approximate probability of (2.35) will never be greater than that of (2.34).
The probability approximations imply approximations for quantiles of Di:
(2.36) |
Here m = 1 for (2.34) and (2.36), or m = ν/(ν – 2) for (2.35). Assigning m the value of the median, , or mode, ν(q – 3)/[(q – 1)(ν + 2)], for q > 3, also provides a one term approximation.
One convenient form for creating a long series arises from (2.28):
(2.37) |
In turn
(2.38) |
2.6 Large Sample Properties
The behavior of Di in large samples merits separate consideration. The results have both analytic and computational value. Rather than study Di directly, consider Di* = ν·Di. Then
(2.39) |
with d = d*/ν. Using (2.28) the distribution function for Di* may be expressed as
(2.40) |
with
(2.41) |
u0* = [t0(d*/ν)– a0]/a1, and t0(d*/ν) = max(a0, d*/ν2).
Consider Di* as N → ∞. In that case
(2.42) |
That and combine to imply . Therefore
(2.43) |
Let w = (q – 1)u, so that dw = (q – 1)du. Then
(2.44) |
A Taylor’s series about εW = (q – 1) yields the two term approximation
(2.45) |
Also, with d = d*/ν, for large N
(2.46) |
with corresponding quantile approximation
(2.47) |
The F based approximation in (2.36) provides more accuracy, except in large samples. Additional terms are required for the approximation to vary with q.
Three conclusions follow. First, as N increases Di converges to a degenerate random variable with all mass at zero. Second, Di* converges to a non-degenerate random variable. Third, calculations of quantiles in terms of Di* can greatly reduce numerical difficulties with large samples.
2.7 The Maximum of N Values of Cook’s Statistic
Fitting a linear model leads to considering N values of Di. The non-independence of the set of Di makes an analytic description of their joint distribution unclear, and computing associated probabilities rather onerous. Despite that, ignoring the multiple testing problem would lead to spuriously rejecting valid data. A Bonferroni correction provides the simplest strategy.
A multiple-testing correction for Di with fixed predictors reduces to consideration of the same issue for residuals. Cook and Prescott (1981) examined the accuracy of a Bonferroni correction in evaluating N residuals, for fixed X. They provided useful lower bounds, based on residual correlations, to complement the Bonferroni upper bound. The accuracy of the Bonferroni correction decreases as correlations among the residuals increase. Experimental designs with purposeful confounding can create extremely high correlations among some pairs of residuals. Recall that, given X, the residuals have covariance matrix (IN – H)σ2. The Bonferroni correction seems more likely to be universally applicable with Gaussian predictors. As described in §2.3, for the study of {Di} the covariance matrix for each row of Gaussian predictors may be assumed to be Iq–1. Hence the expected correlation for any pair of residuals should be modest and asymptotically zero. The excellent performance of the Bonferroni correction with independent events of small probability promises good accuracy here.
3. NUMERICAL EVALUATIONS
3.1 Exact Probability and Quantile Computations
All exact probabilities reported in this paper were computed by applying Simpson’s rule to equation (2.32). All calculations were expressed in terms of the variable Di* = ν · Di in order to provide better numerical accuracy for large sample cases. All exact quantiles were computed via a bisection algorithm (Thisted, 1988, p169) applied to equation (2.32), in terms of Di*. An approximate quantile from equation (2.35) provided the starting value. Equation (2.34) or (2.35) provided starting values. Properties of the function were exploited to refine the code, merely to speed convergence. See Kennedy and Gentle (1980) or Thisted (1988) for a descriptions of Simpson’s rule, as well as general discussions of numerical integration, the use of transformations to finite regions such as the one used here, and function inversion algorithms.
3.2 A Simulation
A small simulation study was conducted in order to verify the accuracy of the computational strategy detailed at the end of §2.4, and to assess the accuracy of a Bonferroni correction in evaluating N values of Di. Assumptions followed those in §2: y = Xβ + e holds, with {ei} iid Gaussian, β fixed and unknown, {Xi} iid multivariate Gaussian and independent of {ei}. In such cases, finding d such that Pr{Di > d} = α depends only on N, q, and α. The value of d provides a test of whether a particular Di arose from the hypothesized distribution.
All data were generated under the stated assumptions. Empirical size of the test of Di was tabulated for each replicate. Two factors were varied in a factorial design: N ∈ {25, 50, 100} and q ∈ {2, 4, 8}. For each replicate the first Di was tested at α ∈ {.01,.05} and the largest Di was tested at α ∈ {.01/N,.05/N}. The pseudo-random generation of data, under valid assumptions, insures that the first Di represents a pseudo-randomly selected value. In contrast, the distribution of the largest Di depends on the remaining N – 1 values.
Let Z = [e G] indicate an N × q matrix, with rowi(Z) = N(O, I). For each combination of N and q a total of 20,000 replicates of Z were created in SAS IML©, using the function NORMAL. Next X = [1 G], y = Xβ + e, and {Di} were computed for each, with β = 0q (which implies y = e). The first Di, the largest Di, N and q were stored for each replicate.
Table I summarizes the empirical size for the tests of Di with Gaussian predictors, as a function of N, q, and α. The formulas derived in §2 provided accurate probabilities for the simulations of a single value. Furthermore the Bonferroni approximation was quite accurate.
TABLE I.
Single |
Largest |
||||
---|---|---|---|---|---|
q | N | α = .01 | .05 | .01/N | .05/N |
2 | 25 | .010 | .051 | .010 | .048 |
50 | .011 | .049 | .010 | .049 | |
100 | .009 | .050 | .011 | .051 | |
4 | 25 | .010 | .049 | .010 | .047 |
50 | .011 | .050 | .011 | .050 | |
100 | .011 | .051 | .010 | .051 | |
8 | 25 | .010 | .047 | .010 | .049 |
50 | .010 | .052 | .010 | .048 | |
100 | .011 | .050 | .009 | .051 |
3.3 Comparisons of Approximations
Table II contains probabilities of Di exceeding , and N times the probabilities. Test size systematically and rapidly decreases with N. Ideally a cut-point allows consistent interpretation across regression analyses. The median, or any other quantile of F(q – 1,ν), does not allow such consistency.
TABLE II.
q
|
||||
---|---|---|---|---|
2 | 4 | 8 | 16 | |
N |
Pr{Di > } |
|||
25 | 7.02 · 10−3 | 1.53 · 10−3 | 1.39· 10−3 | 1.19 · 10−2 |
50 | 7.46 · 10−4 | 3.47 · 10−5 | 6.02· 10−6 | 5.61 · 10−5 |
100 | 3.59 · 10−5 | 1.85 · 10−7 | 3.10· 10−9 | 1.43 · 10−10 |
200 | 5.55 · 10−7 | 1.20 · 10−10 | <1 · 10−14 | <1 · 10−14 |
N · Pr{Di > } |
||||
25 | 1.76 · 10−1 | 3.82 · 10−2 | 3.46 · 10−2 | 2.98 · 10−1 |
50 | 3.73 · 10−2 | 1.73 · 10−3 | 3.01 · 10−4 | 2.81 · 10−4 |
100 | 3.59 · 10−3 | 1.85 · 10−5 | 3.10 · 10−7 | 1.43 · 10−8 |
200 | 1.11 · 10−4 | 2.41 · 10−8 | 1.39 · 10−1 | <1 · 10−14 |
The approximate quantile in equation (2.36) also provides a cut-point requiring only one evaluation of . Such values were computed for N and q as in Table II, with target test sizes of .01 and .05. Equation (2.32) was integrated with Simpson’s rule to compute exact probabilities of exceeding the approximate quantiles. In order to approximately evaluate a Bonferroni correction, the same process was followed for target test sizes of .01/N and .05/N, with the additional step of multiplying the probabilities by N. As can be seen in Table III, the exact test size ranges from .052 to .058 for a target α of .05, and from .014 to .026 for a target α of .01. The results corresponding to a Bonferroni correction (in the right half of Table III) involve smaller tail probabilities and were much less accurate. An overall target α of .01 gave approximate test sizes ranging from .073 to .398, while a target α of .05 gave approximate test sizes ranging from .200 to .709 (for the conditions examined). Accuracy improves with increasing sample size and number of predictors.
TABLE III.
q
|
||||||||
---|---|---|---|---|---|---|---|---|
2 | 4 | 8 | 16 | 2 | 4 | 8 | 16 | |
N |
Pr{Di < }1 |
N·Pr{Di > }1 |
||||||
25 | .022 | .023 | .022 | .026 | .182 | .166 | .149 | .296 |
50 | .020 | .020 | .018 | .016 | .212 | .168 | .113 | .085 |
100 | .019 | .019 | .017 | .014 | .298 | .203 | .118 | .069 |
200 | .019 | .019 | .016 | .014 | .398 | .268 | .142 | .073 |
Pr{Di > }1 |
N · Pr{Di > }1 |
|||||||
25 | .054 | .058 | .057 | .058 | .295 | .286 | .260 | .403 |
50 | .053 | .057 | .055 | .054 | .369 | .323 | .240 | .192 |
100 | .053 | .056 | .055 | .053 | .498 | .405 | .270 | .181 |
200 | .053 | .056 | .055 | .053 | .709 | .539 | .330 | .200 |
, with m = ν/(ν – 2) if ν > 2 and m = 1 otherwise.
Table IV provides exact critical values of ν·Di for α ∈ {.01/N,.05/N} and a range of N and q. Quantiles in Table IV were computed by a simple bisection algorithm (Thisted, 1988, p 169) applied to equation (2.32). An approximate quantile from equation (2.35) provided the starting value. Algorithmic stability across a large range of sample sizes required using d* = ν × d.
TABLE IV.
q – 1 = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 15 | 20 | 40 | 80 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
N | α = .05/N | |||||||||||||
10 | 15.2 | 16.6 | 18.2 | 21.3 | 28.4 | 50.2 | 192.9 | . | . | . | . | . | . | . |
15 | 15.6 | 15.6 | 15.5 | 15.6 | 15.9 | 16.7 | 18.0 | 20.3 | 24.8 | 34.7 | . | . | . | . |
20 | 16.4 | 16.0 | 15.5 | 15.1 | 15.0 | 14.9 | 15.0 | 15.3 | 15.7 | 16.4 | 40.0 | . | . | . |
25 | 17.2 | 16.6 | 15.9 | 15.3 | 15.0 | 14.7 | 14.6 | 14.5 | 14.5 | 14.6 | 16.9 | 44.6 | . | . |
30 | 17.9 | 17.1 | 16.3 | 15.7 | 15.2 | 14.8 | 14.6 | 14.4 | 14.3 | 14.2 | 14.6 | 17.5 | . | . |
40 | 19.2 | 18.2 | 17.2 | 16.4 | 15.8 | 15.4 | 15.0 | 14.7 | 14.5 | 14.4 | 13.9 | 14.1 | . | . |
60 | 21.3 | 19.9 | 18.6 | 17.7 | 17.0 | 16.4 | 16.0 | 15.7 | 15.3 | 15.1 | 14.3 | 13.9 | 14.5 | . |
80 | 23.0 | 21.2 | 19.8 | 18.7 | 17.9 | 17.3 | 16.8 | 16.4 | 16.1 | 15.7 | 14.9 | 14.3 | 13.8 | . |
100 | 24.3 | 22.3 | 20.7 | 19.6 | 18.7 | 18.0 | 17.5 | 17.1 | 16.7 | 16.4 | 15.4 | 14.8 | 13.9 | 15.6 |
200 | 28.8 | 26.1 | 24.1 | 22.5 | 21.4 | 20.5 | 20.0 | 19.3 | 18.0 | 18.4 | 17.2 | 16.5 | 15.2 | 14.6 |
400 | 33.9 | 30.2 | 27.6 | 25.8 | 24.4 | 23.5 | 22.4 | 22.1 | 21.3 | 20.5 | 19.4 | 18.5 | 16.8 | 16.0 |
800 | 40.2 | 34.0 | 32.1 | 29.3 | 28.2 | 26.7 | 25.8 | 24.4 | 24.3 | 23.3 | 21.8 | 20.3 | 18.8 | 17.5 |
α = .01/N | ||||||||||||||
10 | 28.7 | 31.1 | 35.1 | 44.1 | 66.8 | 50.5 | 964.1 | . | . | . | . | . | . | . |
15 | 26.9 | 26.1 | 25.7 | 25.8 | 26.7 | 28.5 | 31.8 | 37.8 | 50.1 | 80.7 | . | . | . | . |
20 | 27.2 | 25.7 | 24.2 | 23.6 | 23.2 | 23.1 | 23.3 | 23.9 | 24.9 | 26.5 | 92.1 | . | . | . |
25 | 27.9 | 25.8 | 24.3 | 23.2 | 22.5 | 22.0 | 21.7 | 21.6 | 21.6 | 21.8 | 27.0 | 102.3 | . | . |
30 | 28.7 | 26.4 | 24.4 | 23.3 | 22.4 | 21.8 | 21.3 | 21.0 | 20.7 | 20.6 | 21.5 | 27.7 | . | . |
40 | 30.2 | 27.4 | 25.4 | 23.9 | 22.8 | 22.2 | 21.4 | 20.9 | 20.5 | 20.1 | 19.4 | 19.7 | . | . |
60 | 32.3 | 29.3 | 26.6 | 25.2 | 24.2 | 23.0 | 22.2 | 21.6 | 21.2 | 20.7 | 19.4 | 18.9 | 19.8 | . |
80 | 34.4 | 31.1 | 28.1 | 26.4 | 25.0 | 23.9 | 23.3 | 22.4 | 21.8 | 21.3 | 19.8 | 19.2 | 17.8 | . |
100 | 36.1 | 32.6 | 29.2 | 27.3 | 25.8 | 24.4 | 24.2 | 23.3 | 22.2 | 22.1 | 20.2 | 19.2 | 18.0 | 20.7 |
200 | 41.2 | 37.3 | 34.2 | 31.3 | 29.4 | 28.4 | 26.9 | 25.8 | 25.6 | 24.5 | 22.4 | 21.3 | 19.3 | 18.6 |
400 | 49.4 | 45.0 | 37.6 | 35.3 | 34.1 | 31.0 | 31.0 | 29.3 | 28.2 | 28.2 | 25.6 | 23.3 | 21.2 | 20.1 |
800 | 68.4 | 57.7 | 52.6 | 40.6 | 36.9 | 36.9 | 33.6 | 33.6 | 30.5 | 30.5 | 27.7 | 25.2 | 22.9 | 22.9 |
Different rows in Table IV have different patterns. The range reflects a varying distance from a boundary condition. The studentized residual embedded in Di requires N – q – 1 > 0. The critical value of Di or ν · Di may be taken to be infinity for N – 1 > q. The table covers q ≤ 80. Rows with N ≤ 80 include the boundary and show a marked upturn in rightmost value. Rows with N ≥ 200 have no entries near the boundary, and hence display a monotone pattern.
3.4 Comments on Algorithms
Computing and verifying the results in this paper led us to program and evaluate both formulations described in §2.4. Equation (2.32) needed less computation for a fixed accuracy. The advantage of the transformation in (2.32) arises from the shape of the function as both N and q get large. The good performance of the transformation reflects the nature of the random variable W = U/(1 + U), with U following an F distribution. Even though both numerator and denominator degrees of freedom increase, the distribution function of W does not degenerate to a point mass. In contrast, the other two formulations involve convergence (as sample size increases) to degenerate random variables and hence degenerate functions. Two difficulties with (2.32) should be noted. First, q = 2 creates a singularity at zero, which often represent the end-point of the interval of integration. Second, extremely large values of d (corresponding to values far beyond those in Table IV) may increase the computational burden.
The care required to insure reasonable numerical performance across a wide range of conditions should not be surprising, given the random variables involved. Kennedy and Gentle (§5.5 and .5.6, 1980) discussed the difficulties in computing F and β probabilities and quantiles. They concluded that no single approach works with all parameter combinations. Thisted (§S.2.2, 1988) provided related material.
4. DISCUSSION
4.1 The Role of Sample Size in Regression Diagnostics
Considering Di* = νDi rather than Di creates a computational advantage. The two alternatives also reflect two mutually exclusive behaviors for regression diagnostics. For a fixed amount of deviance, distinctions among observations shrink as sample size increases for hi and Di (both converge to zero). In contrast, a fixed amount of deviance yields an interpretation essentially constant as sample size increases for , and Ri. In order to emphasize the distinction, compare hi to
(4.1) |
Obviously hi* exhibits the second type of behavior, across N and q. Note that hi* corresponds to the Mahalanobis distance from the origin.
Both types of behavior have merit. Statistics of the first type better reflect the impact of a single observation on the total analysis. With the first type of statistic, the misleading effect of a single observation eventually drowns in rising sample size. Statistics of the second type highlight a given deviant observation, no matter what the sample size. The second type’s consistent range of values across sample sizes simplifies interpretation. For example, no matter what the sample size, Ri = 7 would demand further attention to the observation.
Sample size also plays a familiar role in the interpretation of Di* and Di. As always, one must distinguish between statistical “significance” (a small p-value, reflecting rarity of the value) and scientific importance (a difference of consequence in practice). In the present context, some data analysts judge importance by the size of an estimated regression coefficient, Others consider the standardized version, , the corresponding semi-partial correlation coefficient, r(Y,Xj|{X1…Xj–1,Xj+1…Xq–1}), or the corresponding sums of squares. A regression coefficient may significantly differ from zero but have no practical importance. Recall that , and captures the shift in (standardized) regression coefficients. Hence to judge the importance of an observation highlighted by Di one should examine the shift in , sum of squares, or multiple correlation, due to deleting the observation.
4.2 Open Questions and Potential Applications
Di represents one example of a closely related set of diagnostic statistics, including DFFITSi and DFBETASj(i) (Cook and Weisberg, 1982), and a modification of Cook’s Di (Atkinson, 1985). The approach presented here for computing probabilities and quantiles appears to allow similar computations for at least some of the related statistics.
The distribution of the predictors and the purpose of the analysis strongly affects the interpretation of the diagnostics. The results cover only two situations, all fixed or all Gaussian variable predictors plus the special fixed predictor, the intercept. Random but non-Gaussian predictors were also not considered here. Do the results provide an approximation whose quality improves as ν increases? Are the results robust with respect to the form of the distribution?
Another generalization involves models with both fixed and Gaussian predictors. Consider, for example, ANCOVA models, which contain one or more fixed effects, and one or more Gaussian predictors. The fixed parameters for a single fixed factor or for any factorial design can be expressed, without loss of generality, as a set of G cell means, with corresponding columns in X of full rank (G). The statistical independence between rows of data allows the likelihood to be separated into G components. The theory in §1-4 treats the special case of G = 1. Consequently the new results apply to each of the G sets of data, considered separately. However, such an analysis would only identify influence within a group, not with respect to all observations in the analysis. Considering all observations simultaneously appears to require additional theoretical results.
More general models may have fixed parameters not expressible as a full-rank cell mean coding (such as a fixed-block design), and/or contain interactions between fixed and Gaussian or Gaussian and Gaussian predictors. Again new theoretical results appear necessary.
Two or more observations may mask the influence of each other. Consequently some research on diagnostics has focused on the impact of deleting two or more observations. Although very appealing, they usually create substantially greater analytic and computational difficulty. Cook and Weisberg (1982) discussed generalizing Di in this fashion. The theory described here does not accommodate their generalization in any straightforward fashion. The more general result seems worth pursuing. See Jensen and Ramirez (1996, 1997) for the fixed predictor case. Furthermore generalizing the results stated here to multivariate regression (two or more responses) also has merit.
Computational algorithms for probabilities and quantiles of Di deserve more attention. A series representation would likely provide the best solution, although an even better behaved function for integration might suffice.
4.3 Abusing the Results for Data Analysis
The new results should never be used for automatically discarding an observation. Some widely cited meteorology illustrate the danger of automatic deletion. In order to help process the flood of data from U. S. weather satellites, automatic outlier detection and rejection was applied as part of data formatting and reduction (Kenward, 1988). British scientists (Farman, Gardiner, and Shanklin, 1985), using ground station data, reported a dramatic downward trend across time in ozone levels over the Antarctic. U. S. NASA scientists confirmed the infamous “hole” in the ozone by re-examining their accumulated satellite data, with automatic outlier detection disabled.
4.4 Using the Results for Data Analysis
As discussed in § 1.1, using any diagnostic involves a three step process: 1) highlight bothersome values, 2) investigate the highlighted values, and 3) decide on a disposition, using scientific principles. Whenever the Gaussian predictors assumption seems reasonable, we recommend using the probability and quantile computations for ν·Di to highlight observations worthy of investigation. As indicated, explicitly compute and compare and. As discussed in §4.1, examining the shift in sum of squares or correlation also has appeal. We believe the results presented here provide a useful metric for Di and valuable insight into its nature and performance.
ACKNOWLEDGMENTS
Muller’s work supported in part by NCI grant P01 CA47 982-04, NIH grant M01 RR000-46-33, and NIEHS grant N01-ES-35356. The authors gratefully acknowledge comments on earlier drafts by anonymous reviewers.
Contributor Information
Keith E. Muller, Dept. of Biostatistics, CB#7400 University of North Carolina Chapel Hill, North Carolina, 27599
Mario Chen Mok, Dept. of Biostatistics, CB#7400 University of North Carolina Chapel Hill, North Carolina, 27599.
BIBLIOGRAPHY
- Atkinson AC. Plots, Transformations, and Regression. Clarendon Press; Oxford: 1985. [Google Scholar]
- Belsley DA, Kuh E, Welsch RE. Regression Diagnostics: Identifying Influential Data alld Sources of Collinearity. Wiley; New York: 1980. [Google Scholar]
- Chatterjee, Sand Hadi AS. Influential Observations, High Leverage Points, and Outliers in Linear Regression. Statistical Science. 1986;1:379–416. [Google Scholar]
- Chen Mok M. Evaluating Cook’s D Statistic in Theory and Practice: A Simulation Study. Department of Biostatistics, University of North Carolina; Chapel Hill: 1993. Unpublished Master’s Paper. [Google Scholar]
- Cook RD. Detection of Influential Observations in Linear Regression. Technometrics. 1977;19:15–18. [Google Scholar]
- Cook RD, Prescott P. On the Accuracy of Bonferroni Significance Levels for Detecting Outliers in Linear Models. Technometrics. 1981;23:59–63. [Google Scholar]
- Cook RD, Weisberg S. Residuals and Influence in Regression. Chapman and Hall; New York: 1982. [Google Scholar]
- Farman JC, Gardiner BG, Shanklin JD. Large losses of total ozone in Antarctica reveal seasonal CIOI/NOI interaction. Nature. 1985;315:207–210. [Google Scholar]
- Hoaglin DC, Welsch RE. The Hat Matrix in Regression and ANOVA. American Statistician. 1978;32:17–22. [Google Scholar]
- Jensen DR, Ramirez DE. Computing the CDF of Cook’s DI Statistic. In: Prat A, Ripoll E, editors. Proceedings of the 12th Symposium in Computational Statistics; Barcelona, Spain. Instituto de Estadística de Catalunya; 1996. pp. 65–66. [Google Scholar]
- Jensen DR, Ramirez DE. Some exact properties of Cook’s DI. In: Rao CR, Balakrishnan N, editors. Handbook of Statistics-16: Order Statistics and Their Applications. North-Holland; Amsterdam: 1997. in press. [Google Scholar]
- Johnson NL, Kotz S. Continuous Univariate Distributions - 1. Wiley; New York: 1970a. [Google Scholar]
- Johnson NL, Kotz S. Continuous Univariate Distributions - 2. Houghton Mifflin; Boston: 1970b. [Google Scholar]
- Kennedy WJ, Jr., Gentle JE. Statistical Computing. Marcel Dekker; New York: 1980. [Google Scholar]
- Kenward M. Surprise, Surprise. New Scientist. 1988;117(1606):16. [Google Scholar]
- Kleinbaum DG, Kupper LL, Muller KE. Applied Regression Analysis and Other Multivariable Methods. Second Edition Duxbury Press; Boston: 1988. [Google Scholar]
- LaMotte LR. A Note on the Role of Independence in t Statistics Constructed From Linear Statistics in Regression Models. American Statistician. 1994;48:238–240. [Google Scholar]
- Obenchain RL. Letter to the Editor. Technometrics. 1977;19:348–349. [Google Scholar]
- Sampson AR. A Tale of Two Regressions. Journal of the American Statistical Association. 1974;69:682–689. [Google Scholar]
- Thisted RA. Elements of Statistical Computing. Chapman and Hall; New York: 1988. [Google Scholar]
- Weisberg S. Applied Linear Regression. Wiley; New York: 1985. [Google Scholar]