Abstract
The density power divergence, indexed by a single tuning parameter α, has proved to be a very useful tool in minimum distance inference. The family of density power divergences provides a generalized estimation scheme which includes likelihood-based procedures (represented by choice for the tuning parameter) as a special case. However, under data contamination, this scheme provides several more stable choices for model fitting and analysis (provided by positive values for the tuning parameter α). As larger values of α necessarily lead to a drop in model efficiency, determining the optimal value of α to provide the best compromise between model-efficiency and stability against data contamination in any real situation is a major challenge. In this paper, we provide a refinement of an existing technique with the aim of eliminating the dependence of the procedure on an initial pilot estimator. Numerical evidence is provided to demonstrate the very good performance of the method. Our technique has a general flavour, and we expect that similar tuning parameter selection algorithms will work well for other M-estimators, or any robust procedure that depends on the choice of a tuning parameter.
Keywords: Optimal tuning parameter, pilot estimator, summed mean square error, one-step Warwick–Jones algorithm, iterated Warwick–Jones algorithm
1. Introduction
In statistical inference, the two concepts of efficiency and robustness are often at odds, and it is a delicate task for the statistician to balance them both in a suitable manner. While the issue of robustness is a real concern in the present age of big data, this robustness should not come at the cost of a high efficiency loss at the model. Most robust procedures require the choice of a tuning parameter, which determines the trade-off between robustness and efficiency. Selecting this tuning parameter ‘optimally’ is a problem of great practical interest, which can protect the experimenter/statistician in both eventualities.
In this paper, we will concentrate on the density power divergence (DPD) family in [1], which has had a major impact in the area of minimum distance estimation over the last two decades. There have been a few attempts to select the robustness tuning parameter of this family; in particular, the method proposed by [15] has seen application in subsequent data analysis problems, as have others; see, e.g. [3,6,10]. The method described in [5] is also useful in this respect. We will refer to the method in [15] as the Warwick and Jones method, and the method in [5] as the Hong and Kim method. However, in some sense, these methods are still not completely satisfactory, as the first one depends, sometimes quite heavily, on the choice of a pilot estimator and the second one sometimes leads to very non-robust estimators.
In this paper, we make a proposal which, by refining the approach in [15], attempts to remove these deficiencies. This may eliminate the most important drawback in the application of the density power divergence estimator and make it more universally acceptable. However, while we illustrate it with the DPD, the applicability of our method is not limited to this divergence alone. It can, in fact, be applied to all M-estimator-based methods which depend upon the choice of a variable tuning parameter.
To motivate the proposal, we begin with the following example where the relevant problem will be posed and the difficulties with classical analysis will be pointed out. The more appropriate analysis based on our method will be taken up again in Section 5 once our proposal has been described.
Example 1.1 Life Expectancy Data —
This example deals with the relationship between life expectancy at birth (in years) and health spending per capita (in USD PPP, where costs in local currency units are converted to international dollars using purchasing power parity exchange rates) in 17 developed countries (14 European countries, USA, Australia and New Zealand). The data are obtained from the OECD (Organisation for Economic Co-operation and Development) Health Statistics 2017 publication (although the actual data refer to 2015). Normally, higher health spending per capita is expected to lead to higher life expectancy at birth. For these data, if we fit a linear relationship between the logarithm of health spending per capita and life expectancy at birth by the method of least squares, the fitted relationship, surprisingly, has a negative slope.
Scrutiny of the data indicates that the observation corresponding to the USA represents a strong outlier in relation to the rest of the observations. It is clear that a better understanding of the general relationship between health spending and life expectancy at birth would be provided by a robustly fitted regression line which respects the expected positive relationship between health spending per capita and life expectancy at birth in most developed countries. The DPD method, described and illustrated in detail in the subsequent sections, provides one principled method of doing so but, like most robust methods, depends on an unspecified tuning parameter. For an objective analysis of the data, we also require the tuning parameter to be automatically, and reliably, specified. This is what the work of this article provides. (The alternative method of simply removing the outlier is available in this case, but outliers are not always so obvious and cannot be automatically accommodated.) In Section 5, we will take up this example again to demonstrate the selection of the optimal value of the tuning parameter which gives robust estimates of the regression coefficients through refitting a linear regression model to these data based on the DPD using our proposed algorithm.
The rest of the paper is organized as follows. In Section 2, we give a brief description of the density power divergence and the corresponding estimator; we also introduce here the two methods mentioned above. In Section 3, we describe the numerical implementation of these methods, and propose our suggested refinement. In Section 4, we compare the three algorithms and put forward detailed arguments concerning the practical behaviour of our refined estimator while in Section 5 we illustrate our method with the data considered in Example 1 and a number of other real-life data examples with and without outliers. In Section 6, we compare the three methods described in Section 3 through simulation. In Section 7, we have discussed the computational cost associated with our proposal and finally some concluding remarks are presented in Section 8.
2. The density power divergence (DPD)
The density power divergence is a family of divergence measures between two probability density functions which are defined with respect to the same measure. This family was developed by [1], the density power divergence between two densities g and f being defined as
| (1) |
indexed by a non-negative tuning parameter α. In practice, however, the value of α is restricted to lie in , since even larger values of α lead to estimators with unacceptably low efficiency. For the limiting case , the divergence in Equation (1) converges to
| (2) |
Note that is a version of the Kullback–Leibler divergence.
Suppose that we have an independently and identically distributed (i.i.d.) sample obtained from the true unknown distribution G (having density g), while represents a parametric family of densities. Our aim is to model the unknown true density g by the parametric class of densities . When using the DPD to do this modelling, we view g in Equation (1) as the data density (to be estimated non-parametrically from the sample data), and f as the model density, to be replaced by the model element . The minimum DPD estimator of the parameter θ can be derived by minimizing the empirical version of the objective function given in Equation (1) after the above replacements. The last term on the right-hand side of this equation does not involve the parameter θ; hence, we eliminate this term from the objective function. The first term on the right-hand side of Equation (1) is independent of g, whereas the second term is linear in g. Hence, for a given tuning parameter α, we will actually minimize the empirical objective function
| (3) |
where is the empirical distribution function, to obtain the minimum DPD estimator of θ.
Since the divergence attains its minimum value zero when g and f are identical, it is clear that the minimum DPD estimator is Fisher consistent. In the following, we will let represent the ‘best fitting parameter ’, defined by .
2.1. The proposed method
As α tends to 0, the DPD converges to the Kullback–Leibler divergence , and given a sequence of i.i.d. observations , the corresponding empirical divergence measure equals the negative of the log-likelihood (plus a constant). Thus, the maximum likelihood estimator, asymptotically the most efficient estimator at the model under standard regularity conditions, belongs to the class of minimum DPD estimators. Larger values of α provide greater robustness and outlier stability, although the efficiency decreases with increasing α. Since robustness is a prime concern for us, we do not necessarily assume that the true distribution G belongs to the model; rather, we acknowledge that in reality, small deviations from the model are expected. At the same time, we hope to develop a procedure where these small deviations would not seriously degrade its statistical utility. Large values of α also protect the procedure against instability due to small deviations, but still at the cost of a drop in model efficiency. We, therefore, wish to choose a data-driven value of α in an ‘optimal’ way which balances the concerns of robustness and efficiency. We wish to choose a large value of α only when it is necessary.
Consider the set-up described in the previous subsection. Under differentiability of the model, the objective function in Equation (3) will lead to the estimating equation
| (4) |
where denotes the likelihood score function. The solution of this equation generates , the minimum DPD estimator (MDPDE) for a fixed α. Now, under certain regularity conditions given in [1],
| (5) |
where denotes the best fitting parameter corresponding to a pre-fixed α, and,
| (6) |
| (7) |
where , , and the superscript T represents ‘transpose’.
We have already noted that the minimum DPD procedure is Fisher consistent and when g belongs to the model, so that for some particular value of θ, simplified expressions for J and K may be obtained by replacing g with the model density in the above expressions. In this case, . As the robustness issue is a matter of concern for us, we will allow g to be the contaminated density , where δ is the Dirac delta function; this was also the approach taken by [15]. Here, is the true target parameter and estimators will be judged by their mean square error around . In general, of course, g may not involve per se but, to keep a clear focus in our presentations, we will present almost all of our results with this contamination formulation in mind. (In the simulations of Section 6, we replace the delta function contamination with alternative contamination.) As in [15], we will evaluate the performance of the estimator through its summed mean square error which has the asymptotic formula
| (8) |
where denotes trace of a matrix.
In the [5] approach, the relevant objective function is the estimated asymptotic variance of the estimator. These authors, therefore, drop the (squared) bias component in the objective function considered in Equation (8). We will discuss the pros and cons of these methods in the following sections.
3. Search for the ‘optimal’ estimator
In the following, we discuss the Warwick and Jones method and present our algorithm as a refined version of it. The original [15] suggestion for the selection of the optimal α consists of the following steps.
- First, the asymptotic variance is estimated by substituting with and by replacing the true distribution G with the empirical distribution in the forms of J and K to get , and , where
(9)
Thus, the contribution of the variances to the summed MSE is estimated as(10) (11) To estimate the asymptotic bias, is substituted with in the bias part of Equation (8). However, for the unknown , some suitable pilot estimator has to be used.
The estimates of variance and (squared) bias are added to get the summed empirical MSE (as a function of the tuning parameter α and the pilot estimator ).
The summed empirical MSE is minimized over a fine grid of α values to obtain the optimal value of α as a function of the pilot estimator .
In the rest of the paper, we will refer to the above approach as the ‘one-step Warwick–Jones algorithm’ or, in short, the OWJ algorithm, as opposed to the approach (which we will shortly describe) that uses the parameter estimate at a given step as the pilot estimate in the next step. Note that in the OWJ algorithm the choice of the pilot can have a significant impact on the optimal tuning parameter, as the pilot invariably draws the final estimator towards itself. On the basis of repeated simulations, Warwick and Jones [15] suggested the minimum distance estimator () as the pilot estimator. Ghosh and Basu [4] preferred a one-step algorithm with as the pilot of their choice. But the essential issue of pilot-dependence is not bypassed in either case.
We take the view that if we are ready to accept the estimate obtained after the one-step algorithm as the ‘optimal’ estimate for the true unknown , we should also be prepared to view it as an updated pilot estimate for the continuation of the process. Thus, we propose to start the process with a suitable robust pilot estimator but instead of terminating the algorithm after one step, the estimator obtained at the end of the step should be used as the updated pilot for the next step. The process should be continued until there is no further change in the estimate of (or, correspondingly, the estimate of the tuning parameter). If it can be demonstrated that the final converged estimate is independent of the initial choice of the pilot, it will provide us an ‘optimal’, pilot-independent estimate. In the following, we will refer to our proposed algorithm as the ‘iterated WJ algorithm’, or, in short, the IWJ algorithm.
4. The three algorithms: some comparisons
We now have three algorithms at our disposal to arrive at an estimate of the optimal tuning parameter. Through the present commentary, we want to set-up the proper context where these methods may be compared in terms of their usefulness.
First, we consider the Hong and Kim (HK) algorithm. It is clear that this algorithm will perform well in case of ‘pure’, outlier-free data since the asymptotic variance is the only relevant quantity here. In case of contaminated data also, it often (but not always) works well. This may be explained as follows. A good robust estimator may often be expected to be closer to the true parameter under contamination compared to a non-robust estimator which is likely to show more variability. Thus, a robust estimator may be expected to have a smaller variance compared to the non-robust estimator, and, in many cases, the minimization of the asymptotic variance will recover a reasonably robust solution.
However, most robust estimators are devised to primarily control the bias under contamination; since the HK objective function has no bias component there is no absolute guarantee that the criterion of low estimated asymptotic variance will necessarily lead to the desired optimal solution or even a robust solution. So although it is not a frequent phenomenon, the HK algorithm will, occasionally, fail and produce a highly non-robust solution.
Lack of robustness, on the other hand, is not a problem for the OWJ algorithm. In this case, any robust initial pilot – such as an estimator within the DPD family which may work as a robust estimator of θ in its own right – will produce a robust solution. The disadvantage here is that different robust pilot estimators can lead us to distinct, sometimes fairly disparate, solutions. In addition, the OWJ solution is more conservative than the HK solution under pure data and frequently produces a larger value of α as the optimal solution compared to the HK method in order to give adequate importance to the robustness provision.
The IWJ method, as we will see in the following sections, appears to overcome this pilot-dependence. We will loosely consider all MDPDEs with as potential robust pilots. Our numerical illustrations, in each of the considered cases, will demonstrate that all robust pilots lead to the same IWJ optimal solution. Thus, the IWJ method eliminates pilot-dependence.
Yet, the IWJ algorithm produces the same solution as the HK algorithm in a large number of cases. We discuss this in detail later in this section, but this imposes the responsibility on us to justify why the iterated method would still give a superior solution. We will provide some glimpses with real examples to show how the IWJ method produces a good compromise over the different scenarios based on estimated asymptotic variances. All these real data examples will be analysed over different initial pilots in Section 5. In this section, we will look at the estimated asymptotic variance curve over ; more generally, we will consider the trace of the asymptotic covariance matrix in multi-parameter situations as given in Equation (11). We consider the following cases; the IWJ optimal values in the following correspond to robust pilots, over which they are invariant.
Case 1: The curve of the estimated asymptotic variance has a single, global minimum. In this case, the IWJ algorithm and the HK algorithm lead to identical solutions. This happens, for example, in the cases of Drosophila data (Day 28) and Peritonitis data (Examples 5.2 and 5.3) with the common optimals being , respectively (Figure 1).
Case 2: The curve has more than one minimum with the global minimum at or at some α close to 0 with no other local minimum to the left of it. In such situations, the HK solution corresponding to the global minimum of the asymptotic variance generally provides a non-robust solution. Starting from a robust pilot, the IWJ method converges to a larger value of α corresponding to a local minimum with a robust solution. Here, the HK optimal value and the IWJ optimal value are distinct. This happens, for example, in the cases of Star Cluster data and Salinity data (Examples 5.8 and 5.10) with the IWJ optimals being , respectively (Figure 2).
Case 3: The curve has more than one minimum with the global minimum at or some other non-zero value of α with no other local minimum to the right of it. Here, the IWJ as well as the HK algorithms correspond to the global minimum and thus the solutions are identical. This happens, for example, in the cases of Short's data and Telephone-line Fault data (Examples 5.4 and 5.6) with the common optimals being , respectively (Figure 3).
Theorem 4.1
Suppose that the variance function has a unique minimum at . Then if the current pilot is distinct from the IWJ algorithm must take a step in the direction of and not remain stuck at .
Proof.
Consider the graph in Figure 4. According to the IWJ algorithm, the pilot at the i th step, namely will be the MDPDE corresponding to the argmin (over α) of
(12) Whenever , the corresponding tuning parameter will be the optimal solution. Suppose that is the current solution (and hence the pilot for the next step) of the IWJ process. At the next step, notice that, at any , is greater than and, since is now the pilot, the bias at any such α is non-zero. Thus, at any such α, and hence, the algorithm cannot take a step to the right of .
We will show that the algorithm must take a (positive) step in the direction of and will not stay put at . Let us choose some α close but to the left of and between and such that
(13) Here, . Now, considering a Taylor series expansion (up to first order) of around , we get
(14) Evidently, the term is negative. From expression (14), we can say that, for small enough ε,
(15) Thus, the estimated MSE function is strictly smaller to the left of for small enough ε. Hence, the algorithm must take a step to the left rather than staying put at . On the other hand, if the pilot is it is obvious that the algorithm cannot make any move in either direction any more.
One can similarly prove that if the current pilot is on the left of , the IWJ algorithm must take a step to the right at the next stage.
Figure 1.
Asymptotic variance plots corresponding to Case 1.
Figure 2.
Asymptotic variance plots corresponding to Case 2.
Figure 3.
Asymptotic variance plots corresponding to Case 3.
Figure 4.
Asymptotic variance of the MDPDEs against α.
5. Applications
To further study the IWJ algorithm, we provide an extensive numerical study involving real data that conform to many different models and generally contain one or more outliers. In each example, we use (at least) eleven initial pilot estimates which will be the MDPDEs with . As all the pilot estimates are from the minimum DPD class, we will let the expression ‘pilot ’ indicate that the pilot estimate is the MDPDE with . Here, for each pilot, we have considered a fine grid of 101 values of α – 0, 0.01, 0.02, …, 1.0 – over which we are to find the optimal α by minimizing the empirical version of the objective function in Equation (8). In the tables, we will present the sequence over which the estimates of the tuning parameter progress in the iterated algorithm for each initial choice of the pilot.
For brevity, the background of each of these datasets and the resulting parameter estimates are described in the Supplementary Material. The full data for all the examples are also presented in the Supplementary Material, except where given in the manuscript itself. Our consistent observation in these examples is that the IWJ algorithm provides the same optimal estimate over a large range of initial pilot values of α, always containing the range .
In our real data examples, we will deal with both i.i.d. data models and linear regression models with normal errors. The approach to handling the i.i.d. data case has been described in the previous sections. In the linear regression model, our observations are conditionally independent and, given , , . Hence the observations are not identically distributed. Here, we are interested in the MDPDEs of . The corresponding estimating equations, as well as the asymptotic covariance matrices, are given in [3] based on which the criterion in Equation (8) can be constructed. See [3] for an extended discussion of the linear regression case.
We first complete our study of the example introduced for motivational purposes in Section 1.
Example 5.1 Continued)(Life Expectancy Data —
The Life Expectancy dataset has already been introduced and discussed in the Introduction section. The application of the IWJ algorithm leads to an optimal tuning parameter of for the calculation of the robust coefficient estimates of the linear regression model. The scatter plot and the fitted lines are given in Figure 5. As the figure shows, the robust linear regression fit based on the optimal DPD tuning parameter is quite different from the least-squares fit to the full dataset (and very similar to the least-squares fit to the data when the outlier is removed), displaying a much more reasonable and informative positive slope. The iterative steps in the optimal tuning parameter selection rule are sequenced in Table 1.
Figure 5.

A few different linear regression fits for Life Expectancy Data.
Table 1. Tuning parameter sequence (Life Expectancy Data).
| Pilot | Iteration | Iteration | Iteration |
|---|---|---|---|
| 0.01 | |||
We now further illustrate the performance of the tuning parameter selection method through a host of other real data examples.
5.1. I.I.D. data examples
Example 5.2 Drosophila Data —
We consider a segment of data on drosophila (a type of fruit fly). The experimental protocol is described in [17]. The data were previously analysed by [13], and are presented in Table 2. Here, the variable of interest is the number of daughters having a recessive lethal mutation on the X chromosome which was noted for two experimental runs one on day 28 and the other on day 177. We fitted a Poisson model to the data on frequencies for both experimental runs.
On day 28, there are two moderate outliers at x = 3 and x = 4. On day 177, there is a single but huge outlier at x = 91. For these data, the results for the optimal tuning parameter selection through the IWJ algorithm are given in Tables 3 and 4. All the values from the point where the tuning parameter shows no further change are given in bold font.
In Tables 3 and 4, we observe that the final converged value of the tuning parameter is the same for every initial pilot under the IWJ algorithm. The HK algorithm produces the same optimal solution in these two cases. The corresponding parameter estimates, presented in the Supplementary Material, are seen to be adequately robust.
Table 2. Recessive lethal count.
| No. of daughters | 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|---|
| No. of males (Day 28) | 23 | 3 | 0 | 1 | 1 | 0 |
| No. of males (Day 177) | 23 | 7 | 3 | 0 | 0 |
Table 3. Tuning parameter sequence (Drosophila Data, day 28).
| Pilot | Iteration | ||||||
|---|---|---|---|---|---|---|---|
| α | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 0.01 | 0.08 | 0.16 | 1 | 0.99 | 0.99 | ||
| 0.1 | 0.18 | 1 | 0.99 | 0.99 | 0.99 | ||
| 0.2−0.3 | 1 | ||||||
| 0.4 | 0.83 | 0.89 | 0.93 | 0.95 | 0.96 | ||
| 0.5 | 0.78 | 0.86 | 0.91 | 0.94 | 0.96 | ||
| 0.6 | 0.79 | 0.87 | 0.91 | 0.94 | 0.96 | ||
| 0.7 | 0.83 | 0.89 | 0.93 | 0.95 | 0.96 | ||
| 0.8 | 0.87 | 0.91 | 0.94 | 0.96 | |||
| 0.9 | 0.93 | 0.95 | 0.96 | ||||
| 1 | |||||||
Table 4. Tuning parameter sequence (Drosophila Data, day 177).
| pilot α | iteration 1 | iteration 2 | iteration 3 |
|---|---|---|---|
| 0.01 | 0.02 | ||
| 0.1−1 |
Example 5.3 Peritonitis Data —
This example involves the incidence of peritonitis in 390 kidney patients. The data are available in Table 2.4 in [2]. A geometric model with success probability θ has been fitted to these frequency data. Two observations of this dataset are mild to moderate outliers. In this case, the final optimal solutions for the tuning parameters under the IWJ algorithm are all the same except for the most non-robust initial pilot (Table 5). In this example, the HK solution corresponds to , being different from the IWJ optimal value. This is a numerical difference caused by the discreteness of the α-grid. If we consider a grid of 100,000 values over instead of the grid of 100 values, then the iterated WJ optimal corresponds to , which is the HK optimal value also.
Table 5. Tuning parameter sequence (Peritonitis Data).
| Pilot α | Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 | Iteration 5 |
|---|---|---|---|---|---|
| 0.01 | 0.03 | 0.04 | |||
| 0.1 | 0.07 | 0.06 | |||
| 0.2 | 0.1 | 0.07 | |||
| 0.3 | 0.12 | 0.08 | |||
| 0.4 | 0.14 | 0.08 | |||
| 0.5−0.6 | 0.16 | 0.09 | 0.07 | ||
| 0.7 | 0.18 | 0.09 | 0.07 | ||
| 0.8−0.9 | 0.19 | 0.1 | 0.07 | ||
| 1 | 0.2 | 0.1 | 0.07 |
Example 5.4 Short's Data —
These data are presented in [14]. We have assumed that the raw observations follow a model. In Table 6, the tuning parameter sequences for the IWJ algorithm for the simultaneous estimation of the two normal parameters are presented.
The estimated asymptotic summed variance curve is given in the left panel of Figure 4. Here, the HK, OWJ and IWJ algorithms all lead to the same optimal solution provided, in the case of IWJ, that its pilot value is .
Table 6. Tuning parameter sequence (Short's Data).
| Pilot α | Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 | Iteration 5 | Iteration 6 | Iteration 7 |
|---|---|---|---|---|---|---|---|
| 0.01 | |||||||
| 0.1 | 0.04 | ||||||
| 0.2 | 0.18 | 0.16 | 0.13 | 0.09 | 0.04 | ||
| 0.3 | 0.92 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
| 0.4−1 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 | 0.98 |
Example 5.5 Newcomb's Data —
These light speed data were also analysed by [14]. We model the data with a two-parameter normal distribution. In this case, each pilot with leads to the same eventual optimal tuning parameter (Table 7) under the IWJ algorithm (which is also the HK optimal).
Table 7. Tuning parameter sequence (Newcomb's Data).
| Pilot α | Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 | Iteration 5 |
|---|---|---|---|---|---|
| 0.01 | 0.01 | 0.01 | 0 | ||
| 0.1 | 0.16 | 0.2 | 0.22 | ||
| 0.2 | 0.22 | 0.23 | |||
| 0.3 | 0.24 | 0.23 | |||
| 0.4 | 0.26 | 0.24 | |||
| 0.5 | 0.28 | 0.24 | |||
| 0.6 | 0.3 | 0.24 | |||
| 0.7 | 0.33 | 0.25 | |||
| 0.8 | 0.35 | 0.25 | |||
| 0.9 | 0.39 | 0.26 | 0.24 | ||
| 1 | 0.42 | 0.26 | 0.24 |
Example 5.6 Telephone-line Fault Data —
These data have been analysed by Welch [16], and here we fit a two-parameter normal model to them. They contain one enormous outlier. The optimal IWJ solution corresponds to for all robust pilots (Table 8), which equals the HK solution and slightly differs from the OWJ optimal (corresponds to in case of pilot).
Table 8. Tuning parameter sequence (Telephone-line Fault Data).
| Pilot α | Iteration 1 | Iteration 2 | Iteration 3 |
|---|---|---|---|
| 0.01 | |||
| 0.1 | 0.13 | ||
| 0.2−0.4 | 0.2 | ||
| 0.5−0.7 | 0.21 | ||
| 0.8−1 | 0.22 |
Example 5.7 Insulating Fluid Data —
Here, we provide a non-normal example in continuous models. The data contain times to breakdown of an insulating fluid between electrodes recorded at seven different voltages. The data are presented in [9]. Here, we have taken the times associated with insulation corresponding to 34 kV, which are assumed to follow an exponential distribution. The data contain one extreme outlier and four moderately severe outliers. If the initial pilot is non-robust, the optimal MDPDE is the maximum likelihood estimate (MLE; corresponding to ). On the other hand, initial pilots corresponding to moderate to large α lead us to the minimum distance estimate (Table 9) as the optimal one, which is the HK optimal also.
Table 9. Tuning parameter sequence (Insulating Fluid Data).
| Pilot α | Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 | Iteration 5 |
|---|---|---|---|---|---|
| 0.01 | |||||
| 0.1 | 0.04 | ||||
| 0.2 | 0.14 | 0.09 | 0.03 | ||
| 0.3−1 |
5.2. Linear regression models
Example 5.8 Hertzsprung–Russell Star Cluster Data —
The Hertzsprung–Russell diagram of the star cluster CYG OB1 containing 47 stars in the direction of Cygnus has been given in [11] . The results are given in Table 10.
Here, the IWJ optimal value of α is , whereas the HK optimal value being gives a distinct solution. The star cluster data asymptotic variance plot is given in the left panel of Figure 2.
Table 10. Tuning parameter sequence (Star Cluster Data).
| Pilot α | Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 |
|---|---|---|---|---|
| 0.01−0.2 | ||||
| 0.3 | 0.72 | 0.76 | ||
| 0.4 | 0.74 | |||
| 0.5−0.7 | 0.75 | |||
| 0.8−1 |
Example 5.9 Gesell Adaptive Score Data —
These data were analysed by [7]. The dataset consists of data corresponding to 21 children where the explanatory variable is the age (in months) at which a child utters its first word and the response variable is its Gesell adaptive score (a measure of cognitive development). All IWJ optimal values are the same () and equal to the HK optimal value; the OWJ optimal value is different (Table 11).
Table 11. Tuning parameter sequence (Gesell Adaptive Score Data).
| Pilot α | Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 | Iteration 5 | Iteration 6 |
|---|---|---|---|---|---|---|
| 0.01 | 0.26 | 0.32 | ||||
| 0.1 | 0.28 | 0.32 | ||||
| 0.2 | 0.3 | 0.32 | ||||
| 0.3 | 0.32 | |||||
| 0.4 | 0.35 | 0.34 | ||||
| 0.5 | 0.38 | 0.34 | ||||
| 0.6 | 0.41 | 0.35 | 0.34 | |||
| 0.7 | 0.46 | 0.36 | 0.34 | |||
| 0.8 | 0.53 | 0.39 | 0.35 | 0.34 | ||
| 0.9 | 0.62 | 0.42 | 0.35 | 0.34 | ||
| 1 | 0.73 | 0.48 | 0.37 | 0.34 |
Example 5.10 Salinity Data —
These salinity data represent an extension of earlier examples to multiple linear regression. These data have been originally presented in [11,12] modelled these data using a multiple linear regression model. Here, the IWJ algorithm leads to a solution of for all robust pilots (Table 12), but the optimal HK solution is .
Table 12. Tuning parameter sequence (Salinity Data).
| Pilot α | Iteration 1 | Iteration 2 | Iteration 3 |
|---|---|---|---|
| 0.01–0.1 | 0 | ||
| 0.2 | 0.30 | ||
| 0.3–1 | 0.31 |
5.3. No outlier performance
All the datasets that we have analysed here can be strongly argued to contain one or more outliers. What would happen if these outliers were absent and the data exhibited much better model conformity? To what extent are the optimal α values pushed closer to zero? Take Newcomb's data, for example. The removal of the two largest outliers produces a nice bell-shaped structure which is almost perfectly symmetric and exhibits no obvious aberrations from the assumed normal model. There is no apparent reason to use anything other than the maximum likelihood estimator in this case. However, does our algorithm lead us to the maximum likelihood estimator in this case? In the following, we investigate such issues further.
Table 13 provides a comparison of the three algorithms by listing the full data optimal values of α and the outlier deleted optimal values of α for the three methods. The numbers demonstrate that for the data involving outliers, the optimal tuning parameters obtained by the IWJ and OWJ algorithms are often close. However, for the outlier deleted data, the IWJ algorithm is more successful in pushing the optimal tuning parameter closer to zero. In fact, in all of our i.i.d. data examples, the deletion of outliers leads to being the optimal tuning parameter for the IWJ algorithm. Even for the regression examples, the drop in the value of α due to outlier deletion is more considerable for the iterated algorithm. (The one-step method actually leads to an increase in the value of the optimal α in two cases). The life expectancy dataset is the only exception. On the whole, it appears that for pure data, the iterated version provides a more suitable optimal value of α. HK provides even better choices of α for the pure data but at the expense of failing to be sufficiently robust for some datasets.
Table 13. The three optimal values of α for the full data as well as the outlier deleted data corresponding to all datasets.
| Full data optimal αs | Outlier deleted optimal αs | ||||||
|---|---|---|---|---|---|---|---|
| Dataset | Outlier deletion | IWJ | OWJ | HK | IWJ | OWJ | HK |
| Life expectancy | One: index | 0.98 | 0.98 | 0.98 | 0.92 | 0.92 | 0.92 |
| Drosophila (first run) | Two: values 3, 4 | 0.99 | 0.99 | 0.99 | 0 | 0 | 0 |
| Drosophila (second run) | One: value 91 | 0.03 | 0.03 | 0.03 | 0 | 0 | 0 |
| Peritonitis | Two: values 10, 12 | 0.06 | 0.2 | 0.05 | 0 | 0 | 0 |
| Short | Five: values 5.76, 9.87, 9.71, 7.8, 7.71 | 0.98 | 0.98 | 0.98 | 0 | 0.17 | 0 |
| Newcomb | Two: values | 0.23 | 0.42 | 0.23 | 0 | 0.54 | 0 |
| Telephone-line fault | One: value | 0.2 | 0.22 | 0.2 | 0 | 0 | 0 |
| Insulating fluid | Five: indices 15, 16, 17, 18, 19 | 1 | 1 | 1 | 0 | 0.25 | 0 |
| Hertzsprung–Russell Star cluster | Four: indices 11, 20, 30, 34 | 0.76 | 0.76 | 0 | 0.68 | 0.70 | 0 |
| Gesell adaptive Score | One: index 19 | 0.33 | 0.73 | 0.33 | 0.03 | 0.77 | 0.03 |
| Salinity | One: index 16 | 0.3 | 0.31 | 0 | 0.09 | 0.49 | 0 |
In Table 13, the IWJ and OWJ optimal values correspond to as the initial pilot. However, the IWJ optimal values are invariant for all pilot , unlike the OWJ algorithm.
6. Simulation study
In this section, we present the results of a small simulation study comparing the values of the tuning parameter provided by HK, IWJ and OWJ methods in both pure and contaminated cases. In each case, the initial pilot value of α is taken to be 1. Our simulation scenarios are as follows:
Case 1: We draw independent samples of size 50 from the standard normal distribution, . The process is replicated 1000 times. For each sample the optimal values of α under each of the three algorithms are determined. The process is then repeated, with the same sample size and number of replications, for the – Poisson with mean 2 – distribution under the Poisson model with parameter θ.
Case 2: Here, the set-up is exactly the same as that for Case 1 except that the normal data, in each sample, are contaminated with 10% of observations from . Similarly, the Poisson data, in each sample, are contaminated with 10% of observations from .
Our comparison comprises all pairwise scatterplots of HK, IWJ and OWJ optimal values. The results for the pure data cases are depicted in Figure 6 and for the contaminated data cases in Figure 7.
Figure 6.
Comparison among optimal tuning parameters of the three algorithms for samples from pure (top panel) and (bottom panel) models.
Figure 7.
Comparison among optimal tuning parameters of the three algorithms for samples from (top panel) and (bottom panel) models.
From the graphs, we make the following observations.
In the pure data case, the HK and IWJ algorithms match for the vast majority of cases (Figure 6, left panel, top and bottom). For the normal data, all estimated optimal values of α are below 0.25 for either algorithm, except for one sample where HK yields and IWJ yields . For the Poisson case also the match between the two algorithms is near total. However, in this case some larger optimal values of α under the IWJ algorithm are observed, and there are at least eight cases where the HK solution equals a small value () but the IWJ solution equals 1.
Comparison with the OWJ algorithm demonstrates that the IWJ and HK algorithms lead to smaller optimal values in practically all cases involving pure data. In particular, the latter two algorithms often lead to as the optimal value while OWJ leads to positive, sometimes fairly large positive, optimal values.
An inspection of the graphs in Figure 7 shows that under contamination, the optimal values of α, are, in general, higher for all three algorithms. There are extremely few samples with as the optimal solution, even for the HK algorithm.
For both the contaminated normal and Poisson models, HK and IWJ again show a high degree of match. But occasional cases where the HK solution is a low value and the IWJ solution (or the OWJ solution) equals , indicate the occasional failure of the HK scheme.
In the case of the normal model, the OWJ optimal value is, in general, higher than the IWJ optimal value or the HK optimal value under contamination. There are quite a few cases where the OWJ optimal corresponds to the minimum estimate, whereas the HK as well as the IWJ optimals are substantially smaller than the OWJ optimal.
For the contaminated Poisson model, certain very small OWJ optimal values are associated with comparatively larger HK (or IWJ) optimal values. But for larger OWJ optimal values, the corresponding HK or IWJ solutions are usually smaller.
A point to be noted here is that occasionally we will come across cases where the HK optimal solution will be but the IWJ algorithm leads to an optimal value of . Consider the normal distribution part of our simulation set-up. For pure data, we observe that between 4 and 5 of the time we encounter the phenomenon that the IWJ optimal is 1 and the HK optimal is 0. On the other hand, this never happens in our simulations in case of contaminated data. It may be worthwhile, though, to scrutinize one of the cases (under pure data) where the HK optimal is and the IWJ optimal is . The histogram of the data for this particular sample and the corresponding asymptotic variance curve over α are given in Figure 8. Although the data are generated from the pure normal model, there is clearly a substantially longer tail on the left. The overlaid normal density curves show that the HK optimal value of (corresponding to the minimum asymptotic variance) tries to accommodate the entire data, whereas the IWJ optimal at (corresponding to a local minimum) robustly fits the majority of the data ignoring the tail on the left.
Figure 8.
The histogram and the asymptotic variance curve of a particular dataset which leads to an optimal for HK algorithm and an optimal for IWJ algorithm.
7. Computational cost
We have already demonstrated the advantages of the IWJ proposal in the previous sections of this paper. However, as the process is an iterative one, the experimenter would like to know about the computational cost involved in this procedure. To study this, we consider the number of iterations needed for the convergence of the procedure. Clearly, a smaller number of iterations will indicate that the algorithm is more time-efficient. However, if the number of iterations is n it does not mean that the computational complexity of the IWJ algorithm is n-times that of the OWJ algorithm as all the groundwork is done in the first step of the iteration including calculation of the estimates over a fine grid of α-values and the evaluation of the asymptotic variance curve. Thus, the subsequent iterations require only a very small fraction of the computational effort of the first iteration.
If we consider our real-life data examples then, in most of the cases the IWJ algorithm converges in five iterations or fewer when starting from a robust pilot. For example, if the pilot is the MDPDE at , the number of iterations in our real data examples (including the two cases of Example 2) are, in the order of the examples, 2, 2, 2, 5, 2, 5, 3, 2, 2, 6, 3. The worst-case observed in these examples is in the Gesell Adaptive Score Data where the process takes six iterations to converge when starting from the MDPDE at as the robust pilot. However, the computational time needed in this case is only 1.02 times the computational time required for the OWJ algorithm starting with the same pilot.
For our simulated data also the process converges, most of the time (80 of the time or more), within 2–5 iterations both for pure and contaminated data. In Figure 9, we present the frequency distributions of the number of iterations required for the IWJ algorithm to converge over 1000 replications for both pure and contaminated normal data. For pure data, the IWJ algorithm converges in just two iterations in more than 20 of the time although the mode of this frequency distribution is at four. In case of contaminated data also the mode is at four but now the algorithm rarely converges in just two steps. On the other hand, in some rare cases, the algorithm may take a large number of iterations to converge in case of pure data. For example, in one stray pure data sample the IWJ algorithm took 32 iterations to converge starting with the robust pilot at . Further scrutiny shows that in this case the asymptotic variance curve is very flat and the optimal solution is ; the passage of the algorithm from to over a flat asymptotic variance curve takes a while. Such large values are absent in the frequency distribution for contaminated data because in this case the optimal solutions are generally substantially higher than zero which is reached in fewer steps starting from . On the whole the mean numbers of iterations for convergence are approximately the same for pure and contaminated data but the variance is larger in case of pure data.
Figure 9.
Number of iterations needed to obtain IWJ estimator under pure and contaminated normal model.
8. Concluding remarks
In this paper, we have proposed an iterated WJ algorithm for the selection of the optimal tuning parameter in the class of MDPDEs. Our findings show that when the pilot estimators are within the MDPDE class, all robust pilots lead to the same iterated optimal. In this sense, the iterated algorithm eliminates the dependence on the pilot estimator. The IWJ optimal value of α is frequently (but not always) equal to the HK optimal value. However, the advantage of the IWJ procedure is that it picks up a robust solution when it is appropriate but the HK algorithm fails to do so.
Our findings also indicate that for clean data, the IWJ algorithm provides more suitable optimal values than the OWJ algorithm. For contaminated data, the one-step and iterated algorithms give closer results.
On the whole, we feel that the IWJ optimal solution is successful in eliminating dependence on the pilot estimator and provides a good robust outcome where necessary. It also provides more efficient optimal solutions under pure data compared to the OWJ algorithm. It is, therefore, without doubt the best of the three algorithms for choosing the tuning parameter in minimum DPD estimation with which we have been concerned in this paper.
Supplementary Material
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Basu A., Harris I.R., Hjort N.L., and Jones M.C., Robust and efficient estimation by minimising a density power divergence, Biometrika 85 (1998), pp. 549–559. doi: 10.1093/biomet/85.3.549 [DOI] [Google Scholar]
- 2.Basu A., Shioya H., and Park C., Statistical Inference: The Minimum Distance Approach, CRC Press, Boca Raton, FL, 2011. [Google Scholar]
- 3.Ghosh A. and Basu A., Robust estimation for independent non-homogeneous observations using density power divergence with applications to linear regression, Electron. J. Stat. 7 (2013), pp. 2420–2456. doi: 10.1214/13-EJS847 [DOI] [Google Scholar]
- 4.Ghosh A. and Basu A., Robust estimation for non-homogeneous data and the selection of the optimal tuning parameter: The density power divergence approach, J. Appl. Stat. 42 (2015), pp. 2056–2072. doi: 10.1080/02664763.2015.1016901 [DOI] [Google Scholar]
- 5.Hong C. and Kim Y., Automatic selection of the tuning parameter in the minimum density power divergence estimation, J. Korean Stat. Soc. 30 (2001), pp. 453–465. [Google Scholar]
- 6.Kang J. and Lee S., Minimum density power divergence estimator for Poisson autoregressive models, Comput. Stat. Data Anal. 80 (2014), pp. 44–56. doi: 10.1016/j.csda.2014.06.009 [DOI] [Google Scholar]
- 7.Mickey M.R., Dunn O.J., and Clark V., Note on the use of stepwise regression in detecting outliers, Comput. Biomed. Res. 1 (1967), pp. 105–111. doi: 10.1016/0010-4809(67)90009-2 [DOI] [PubMed] [Google Scholar]
- 8.Nelson W., Applied Life Data Analysis, John Wiley & Sons, New York, 1982. [Google Scholar]
- 9.OECD Health Statistics , 2017. Available at 10.1787/888933602272. [DOI]
- 10.Park J-H and Sriram T.N., Robust estimation of conditional variance of time series using density power divergences, J. Forecast. 36 (2017), pp. 703–717. doi: 10.1002/for.2465 [DOI] [Google Scholar]
- 11.Rousseeuw P.J. and Leroy A.M., Robust Regression and Outlier Detection, John Wiley & Sons, New York, 1987. [Google Scholar]
- 12.Ruppert D. and Carroll R.J., Trimmed least squares estimation in the linear model, J. Am. Stat. Assoc. 75 (1980), pp. 828–838. doi: 10.1080/01621459.1980.10477560 [DOI] [Google Scholar]
- 13.Simpson D.G., Minimum Hellinger distance estimation for the analysis of count data, J. Am. Stat. Assoc. 82 (1987), pp. 802–807. doi: 10.1080/01621459.1987.10478501 [DOI] [Google Scholar]
- 14.Stigler S.M., Do robust estimators work with real data? Ann. Stat. 5 (1977), pp. 1055–1098. doi: 10.1214/aos/1176343997 [DOI] [Google Scholar]
- 15.Warwick J. and Jones M.C., Choosing a robustness tuning parameter, J. Stat. Comput. Simul. 75 (2005), pp. 581–588. doi: 10.1080/00949650412331299120 [DOI] [Google Scholar]
- 16.Welch W.J., Rerandomizing the median in matched-pairs designs, Biometrika 74 (1987), pp. 609–614. doi: 10.1093/biomet/74.3.609 [DOI] [Google Scholar]
- 17.Woodruff R.C., Mason J.M., Valencia R., and Zimmering S., Chemical mutagenesis testing in drosophila-I: Comparison of positive and negative control data for sex-linked recessive lethal mutations and reciprocal translocations in three laboratories, Environ. Mol. Mutagen. 6 (1984), pp. 189–202. doi: 10.1002/em.2860060207 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.








