Distance-Based Estimation Methods for Models for Discrete and Mixed-Scale Data

Elisavet M Sofikitou; Ray Liu; Huipei Wang; Marianthi Markatou

doi:10.3390/e23010107

. 2021 Jan 14;23(1):107. doi: 10.3390/e23010107

Distance-Based Estimation Methods for Models for Discrete and Mixed-Scale Data

Elisavet M Sofikitou ¹, Ray Liu ², Huipei Wang ¹, Marianthi Markatou ^1,^*

PMCID: PMC7829708 PMID: 33466744

Abstract

Pearson residuals aid the task of identifying model misspecification because they compare the estimated, using data, model with the model assumed under the null hypothesis. We present different formulations of the Pearson residual system that account for the measurement scale of the data and study their properties. We further concentrate on the case of mixed-scale data, that is, data measured in both categorical and interval scale. We study the asymptotic properties and the robustness of minimum disparity estimators obtained in the case of mixed-scale data and exemplify the performance of the methods via simulation.

Keywords: contingency tables, disparity, mixed-scale data, pearson residuals, residual adjustment function, robustness, statistical distances

1. Introduction

Minimum disparity estimation has been studied extensively in models where the scale of the data is either interval or ratio (Beran [1], Basu and Lindsay [2]). It has also been studied in the discrete outcomes case. Specifically, when the response variable is discrete and the explanatory variables are continuous, Pardo et al. [3] introduced a general class of distance estimators based on $ϕ$ -divergence measures, the minimum $ϕ$ -divergence estimators, and they studied their asymptotic properties. The estimators can be viewed as an extension/generalization of the Maximum Likelihood Estimator (MLE). Pardo et al. [4] used the minimum $ϕ$ -divergence estimator in a $ϕ$ -divergence statistic to perform goodness-of-fit tests in logistic regression models, while Pardo and Pardo [5] extended the previous works to address solving problems for testing in generalized linear models with binary scale data.

The case where data are measured on discrete scale (either on ordinal or generally categorical scale) has also attracted the interest of other researchers. For instance, Simpson [6] demonstrated that minimum Hellinger distance estimators fulfill desirable robustness properties and for this reason can be effective in the analysis of count data prone to outliers. Simpson [7] also suggested tests based on the minimum Hellinger distance for parametric inference which are robust as the density of the (parametric) model can be nonparametrically estimated. In contrast, Markatou et al. [8] used weighted likelihood equations to obtain efficient and robust estimators in discrete probability models and applied their methods to logistic regression, whereas Basu and Basu [9] considered robust penalized minimum disparity estimators for multinomial models with good small sample efficiency.

Moreover, Gupta et al. [10], Martín and Pardo [11] and Castilla et al. [12] used the minimum $ϕ$ -divergence estimator to provide solution to testing problems in polytomous regression models. Working in a similar fashion, Martín and Pardo [13] studied the properties of the family of $ϕ$ -divergence estimators for log-linear models with linear constraints under multinomial sampling in order to identify potential associations between various variables in multi-way contingency tables. Pardo and Martín [14] presented an overview of works associated with contigency tables of symmetric structure on the basis of minimum $ϕ$ -divergence estimators and minimum $ϕ$ -divergence test statistics. Additional works include Pardo and Pardo [15] and Pardo et al. [16]. Alternative power divergence measures have been introduced by Basu et al. [17].

The class of f or $ϕ -$ divergences was originally introduced by Csiszár [18]. The structural characteristics of this class and their relationship to the concepts of efficiency and robustness were studied, for the case of discrete probability models, by Lindsay [19]. Basu and Lindsay [2] studied the properties of estimators derived by minimizing $f -$ divergences between continuous models and presented examples showing the robustness results of these estimates. We also note that Tamura and Boos [20] studied the minimum Hellinger distance estimation for multivariate location and covariance. Additionally, formal robustness results were presented in Markatou et al. [8,21] in connection with the introduction of weighted likelihood estimation.

If G is a real valued, convex function, defined on $[0, \infty)$ and such that $G (u)$ converges to 0 as $u \to \infty$ , $0 G (0 / 0) = 0$ , $0 G (u / 0) = u G_{\infty}$ , $G_{\infty} = lim_{u \to \infty} (G (u) / u)$ , the class of $ϕ -$ divergences is defined as

ρ (τ, m_{β_{0}}) = \sum G (\frac{τ (t)}{m_{β_{0}} (t)}) m_{β_{0}} (t),

where $τ (\cdot)$ , $m_{β_{0}} (\cdot)$ are two probability models. Notice that we define $ρ (τ, m_{β_{0}})$ on discrete probability models first, where $T = {0, 1, 2, \dots, T}$ is a discrete sample space, T possibly infinite, and $m_{β_{0}} (t) \in M = \{m_{β} (t) : β \in B\}$ , $B$ is the parameter space $B \subseteq R^{d}$ . Furthermore, different forms of the function $G (u)$ provide different statistical distances or divergences.

We can change the argument of the function G from $\frac{τ (t)}{m_{β_{0}} (t)}$ to $\frac{τ (t)}{m_{β_{0}} (t)} - 1$ . Then, G is a function of the Pearson residual which is defined as $δ (t) = \frac{τ (t)}{m_{β_{0}} (t)} - 1$ , and takes values in $[- 1, \infty)$ . If the measurement scale is interval/ratio, then the Pearson residuals are modified to reflect and adjust for the discrepancy of scale between data, that are always discrete, and the assumed continuous probability model (see Basu and Lindsay [2]).

The Pearson residual is used by Lindsay [19], Basu and Lindsay [2] and Markatou et al. [8,21] in investigating the robustness of the minimum disparity and weighted likelihood estimators, respectively. This residual system allows one to identify distributional errors. If, in the equation of Pearson residual, we replace $τ (t)$ with its best nonparametric representative $d (t)$ , the proportion of observations in a sample with value t, then $δ (t) = \frac{d (t)}{m_{β_{0}} (t)} - 1$ . We note that the Pearson residuals are called so because $n \sum δ^{2} (t) m (t)$ is Pearson’s chi-squared distance. Furthermore, these residuals are not symmetric since they take values in $[- 1, \infty]$ and are not standardized to have identical variances.

How does robustness fit into this picture? In the robustness literature, there is a denial of the model’s truth. Following this logic, the framework based on disparities starts with goodness-of-fit by identifying a measure that assesses whether the model fits the data adequately. Then, we examine whether this measure of adequacy is robust and in what sense. A fundamental tool that assists in measuring the degree of robustness is the Pearson residual, because it measures model misspecification. That is, Pearson residuals provide information about the degree to which the specified model $m_{β}$ fits the data. In this context, outliers are defined as those data points that have a low probability of occurrence under the hypothesized model. Such probabilistic outliers are called surprising observations (Lindsay [19]). Furthermore, the robustness of estimators obtained via minimization of the divergence measures we discuss here is indicated by the shape of the associated Residual Adjustment Function (RAF), a concept that is reviewed in Section 2. Of note is that in contingency table analysis, the generalized residual system is used for examination of sources of error in models for contingency tables, see, for example, Haberman [22], Haberman and Sinharay [23]. The concept of generalized residuals in the case of generalized linear models is discussed, for example, in Pierce and Schafer [24].

Data sets are comprised of data measured on both categorical (ordinal or nominal) scale and interval/ratio scale. We can think of these data as realizations of discrete and continuous random variables respectively. Examples of data sets that include mixed-scale data are electronic health records containing diagnostic codes (discrete) and laboratory measurements (e.g., blood pressure, alanine amino transferase (ALT) measurements on interval/ratio scale) and marketing data (customer records include income and gender information). Additional examples include data from developmental toxicology (Aerts et al. [25]), where fetal data from laboratory animals include binary, categorical and continuous outcomes. In this context, the joint density of the discrete and continuous random variables is given as $m_{β} (x, y) = f_{β_{1}} (y | x) g_{β_{2}} (x)$ , where $β^{T} = (β_{1}^{T}, β_{2}^{T})$ are parameter vectors indexing the joint, conditional on x and probability density function of x.

Work on the analysis of mixed-scale data is complicated by the fact that is difficult to identify suitable joint probability distributions to describe both measurement scales of the data, although a number of ad hoc methods to the analysis of mixed-scale data have been used in applications. Olkin and Tate [26] proposed multivariate correlation models for mixed-scale data. Copulas also provide an attractive approach to modeling the joint distribution of mixed-scale data, though copulas are less straightforward to implement, and there are subtle identifiability issues that complicate the specification of a model (Genest and Nešlehová [27]).

To formulate the joint distribution in the mixed-scale variables case one can either specify the marginal distribution of the discrete variables and the conditional distribution of the continuous variables. Alternatively, one can specify the marginal distribution of the continuous variables and the conditional distribution of the discrete variables given the continuous variables. Of note here is that the direction of factorization generally yields distinct model interpretations and results. The first approach has received much attention in the literature, in the context of the analysis of data with mixtures of categorical and continuous variables. Here, the continuous variables follow different multivariate normal distributions for each possible setting of the categorical variable values; the categorical variables then follow an arbitrary marginal multinomial distribution. This model is known in the literature as the conditional Gaussian distribution model and is central in the discussion of graphical association models with mixed-scale variables (Lauritzen and Wermuth [28]). A very special case of this model is used in our simulations.

In this paper, we develop robust methods for mixed-scale data. Specifically, Section 2 reviews basic concepts in minimum disparity estimation, Section 3 defines Pearson residuals for data measured in discrete, interval/ratio and mixed-scale, and studies their properties. Section 4 establishes the optimization problem for obtaining estimators of the model parameters, while Section 5 and Section 6 establish the robustness and asymptotic properties of these estimators. Finally, Section 7 presents simulations showing the performance of these methods and Section 8 offers discussions. The Appendix A includes proofs of the theoretical results.

2. Concepts in Minimum Disparity Estimation

Beran [1] introduced a robust method to estimate the parameters of a statistical model, called minimum Hellinger distance estimation. The parameter estimator is obtained by minimizing the Hellinger distance between a parametric model density and a nonparametric density estimator. Lindsay [19] extended the aforementioned method to incorporate many other distances, and introduced the concept of the residual adjustment function in the context of minimum disparity estimation. The Minimum Distance Estimators (MDE) of a parameter vector $β$ are obtained by minimizing over $β$ , the distance (or disparity)

ρ (d, m_{β}) = \sum_{x} G (δ (x)) m_{β} (x),

(1)

where the assumed model $m_{β}$ is a probability mass function. When the model $m_{β}$ is continuous, the MDE of the parameter vector $β$ is obtained by minimizing over $β$ the quantity

ρ (f^{*}, m_{β}^{*}) = \int G (δ (x)) m_{β}^{*} (x) d x,

(2)

where $f^{*} (x) = \int k (x; t, h) d \hat{F} (t)$ , $m_{β}^{*} (x) = \int k (x; t, h) m_{β} (t) d t$ , $\hat{F}$ is the empirical distribution function obtained from the data and k is a smooth family of kernel functions. One example is the normal density with mean t and standard deviation h. Furthermore, $δ (x)$ is the Pearson residual defined as $δ (x) = f^{*} (x) / m^{*} (x) - 1$ . Lindsay [19] and Basu and Lindsay [2] discuss the efficiency and robustness properties of these estimators.

If $G (δ) = \frac{1}{λ (1 + λ)} \{{(1 + δ)}^{(λ + 1)} - 1\}$ we obtain the class of power divergence measures. Notice that we have $G (0) = 0$ . Different values of $λ$ offer different measures; for example, when $λ = - 2$ we obtain Neyman’s chi-squared divided by 2 measure, while $λ = - 1, - 1 / 2$ return the Kullback-Leibler and Hellinger distances, respectively.

Under appropriate conditions, (1) and (2) can be written as

\sum A (δ (x)) m_{β} (x) = 0,

\int A (δ (x)) \nabla m_{β}^{*} (x) d x = 0,

where $A (δ) = (δ + 1) G^{'} (δ) - G (δ)$ and the prime denotes differentiation with respect to $δ$ .

Lindsay [19] has shown that the structural characteristics of the function $A (δ)$ play an important role in the robustness and efficiency properties of these methods. Furthermore, without loss of generality, we can center and rescale $A (δ)$ , and define the RAF as follows.

Definition 1

(Lindsay [19]). Let $A (δ)$ be an increasing and twice differentiable function on $[- 1, \infty)$ defined as

$\begin{matrix} A (δ) & = (δ + 1) G^{'} (δ) - G (δ), \\ A (0) & = 0, \\ A^{'} (0) & = 1, \end{matrix}$

where G is strictly convex and twice differentiable with respect to δ on $[- 1, \infty)$ with $G (0) = 0$ . Then, $A (δ)$ is called residual adjustment function.

Remark 1.

Since $A^{'} (δ) = (1 + δ) G^{″} (δ)$ , the second order differentiability of G, in addition to its strict convexity, implies that $A (δ)$ is strictly increasing function of δ on $[- 1, \infty)$ . Thus, we can define $A (δ)$ as above without changing the solutions of the aforementioned estimating equations in the discrete case (see Lindsay [19], p. 1089). In the continuous case, such standardization does not change the estimating properties of the associated disparities (see Basu and Lindsay [2], p. 687).

Two fundamental and at the same time conflicting goals in robust statistics are the goals of robustness and efficiency. In the traditional literature on robustness, first order efficiency is sacrificed and, instead, safety of the estimation or testing method against outliers is guaranteed. Here, one adheres to the notion that information about robustness of a method is carried by the influence function. In our setting, using the influence function to characterize the robustness properties of the associated estimation procedures is misleading. Instead, the shape of the RAF, $A (\cdot)$ , provides information to the extent of which our procedures can be characterized as robust. The interested reader is directed to Lindsay [19] for further discussion on this topic.

3. Pearson Residual Systems

In this section, we define various Pearson residuals, appropriate for the measurement scale of the data. We introduce our notation first.

Let $(y_{i}, x_{i}), i = 1, 2, \dots, n$ be realizations from n independent and identically distributed random variables that follow a distribution with density $m_{β} (x, y)$ . Recall that we use the word density to denote a general probability function, independently of whether the random variables $X, Y$ are discrete, continuous or mixed. In what follows, we define different Pearson residual systems that account for the measurement scale of the data and study their properties.

Case 1: Both X and Y are discrete.

In this case, the pairs $(y_{i}, x_{i})$ follow a discrete probability mass function $m_{β} (x_{i}, y_{i})$ . Define the Pearson residual as

δ (x, y) = \frac{\frac{n_{x, y}}{n}}{m_{β} (y | x) π_{x}} - 1,

where $π_{x} = P (X = x) = g (x)$ , and $n_{x, y}$ is the number of observations in the cell with $Y = y$ and $X = x$ .

Note that this definition of the Pearson residual is nonparametric on the discrete support of X. In the case of regression, one can carry out a semiparametric argument to obtain the estimators of the vector $β$ and $π_{x}$ .

We now establish that, under correct model specification, the residual $δ (x, y)$ converges, almost surely, to zero.

Proposition 1.

When the model is correctly specified and as $n \to \infty$ ,

$δ (x, y) \overset{a . s .}{\to} 0 .$

Proof.

Write

$\begin{matrix} δ (x, y) & = \frac{\frac{n_{x, y}}{n}}{m_{β} (y | x) π_{x}} - 1 \\ = \frac{\frac{n_{x, y}}{n_{x}} \cdot \frac{n_{x}}{n}}{m_{β} (y | x) π_{x}} - 1 . \end{matrix}$

Then

$\begin{matrix} \frac{n_{x}}{n} & = \frac{(# of observations in the sample equal to x)}{n} \\ = \frac{1}{n} \sum_{i = 1}^{n} I (x_{i} = x), \end{matrix}$

where $I (\cdot)$ is the indicator function. Furthermore,

$E [\frac{1}{n} I (X_{i} = x)] = P (X = x) < \infty,$

and by the strong law of large numbers

$\frac{n_{x}}{n} \to_{n \to \infty}^{a . s .} E [I (X = x)] = P (X = x) = π_{x} .$

Similarly,

$\frac{n_{x, y}}{n_{x}} \overset{a . s .}{\to} m_{β} (y | x),$

therefore

$δ (x, y) \to_{n \to \infty}^{a . s .} 0$

under correct model specification. □

Case 2: Y is continuous and X is discrete.

This is the case in some ANOVA models. We can still define the Pearson residual in this setting as

δ (x, y) = \frac{f_{n} (y, x)}{m_{β} (y, x)} - 1,

where

\begin{matrix} f_{n} (y, x) & = f_{n}^{*} (y | x) g (x) \\ = \{\int k (y, t, h) d {\hat{F}}_{n} (t | x)\} \frac{n_{x}}{n} \end{matrix}

and

\begin{matrix} m_{β} (y, x) & = m_{β}^{*} (y | x) g (x) \\ = \{\int k (y, t, h) d M_{β} (t | x)\} π_{x} . \end{matrix}

Then,

δ (x, y) = \frac{f_{n}^{*} (y | X = x) \frac{n_{x}}{n}}{m_{β}^{*} (y | X = x) π_{x}} - 1 .

Proposition 2.

Assume the model is correctly specified and $k (y, t, h)$ is a continuous function. Then,

$δ (x, y) \to_{n \to \infty}^{a . s .} 0 .$

Proof.

Under the strong law of large numbers

$\frac{n_{x}}{n} \to_{n \to \infty}^{a . s .} π_{x} .$

Under the correct model specification, continuity of the kernel function and the fact that ${\hat{F}}_{n}$ converges completely to F (implication of Glivenko-Cantelli theorem),

$lim_{n \to \infty} \int k (y; t, h) d {\hat{F}}_{n} (t | x) \to \int k (y; t, h) d F (t | x) = \int k (y; t, h) d M_{β} (t | x) = m_{β}^{*} (y | x)$

(extension of Helly-Bray lemma). Therefore,

$\frac{\frac{n_{x}}{n} f_{n}^{*} (y | x)}{π_{x} m_{β}^{*} (y | x)} \overset{a . s .}{\to} \frac{π_{x}}{π_{x}} \cdot \frac{m_{β}^{*} (y | x)}{m_{β}^{*} (y | x)} = 1$

and hence

$δ (x, y) = \frac{\frac{n_{x}}{n} f_{n}^{*} (y | x)}{π_{x} m_{β}^{*} (y | x)} - 1 \overset{a . s .}{\to} 1 - 1 = 0 .$

□

Case 3: Y is continuous and X is continuous.

In this case, the pairs $(y_{i}, x_{i})$ follow a continuous probability distribution. The Pearson residual is then defined as

δ (x, y) = \frac{f_{n}^{*} (y, x)}{m_{β}^{*} (y, x)} - 1,

where

\begin{matrix} f_{n}^{*} (x, y) & = \int k (x, y; t_{1}, t_{2}) d {\hat{F}}_{n} (t_{1}, t_{2}), \\ m_{β}^{*} (x, y) & = \int k (x, y; t_{1}, t_{2}) m_{β} (t_{1}, t_{2}) d t_{1} d t_{2} . \end{matrix}

As an example, we take the linear regression model with random carriers X, and $ϵ_{i} \sim N (0, 1)$ . Furthermore, assume that the random carriers follow a normal distribution with mean vector $μ$ and covariance matrix $Σ$ . In this case, $y_{i} = x_{i}^{T} β + ϵ_{i}$ and the quantities $z_{i} = (y_{i} - x_{i}^{T} β) / σ$ are independent, identically distributed random variables when $β$ represents the vector of true parameters. Hence, the $z_{i}$ ’s represent realizations of a random variable Z that has a completely known density $f (z)$ . Thus,

m_{β} (x, y) = m_{β} (z | x) \cdot g (x), z = (y - x^{T} β) / σ

and hence

\begin{matrix} m_{β}^{*} (x, y) & = m_{β}^{*} (y - x^{T} β | X = x) g^{*} (x), \\ m_{β}^{*} (y - x^{T} β | X = x) & = m_{β}^{*} (z | x) = \int k (z, t, h) d M_{β} (t | x), \\ g^{*} (x) & = \int k^{'} (x, t^{'}, h^{'}) g (t^{'}) d t^{'} . \end{matrix}

The kernel $k (z, t, h)$ is selected so that it facilitates easy computation. Kernels that do not entail loss of information when they are used to smooth the assumed parametric model are called transparent kernels (Basu and Lindsay [2]). Basu and Lindsay [2] provide a formal definition of transparent kernels and an insightful discussion on the point of why transparent kernels do not exhibit information loss when convoluted with the hypothesized model (see Section 3.1 of Basu and Lindsay [2]).

4. Estimating Equations

In this section, we concentrate on cases 1, 2 presented in the previous section. We carefully outline the optimization problems and discuss the associated estimating equations for these two cases. The case where both X and Y are continuous has been discussed in the literature, see, for example, Markatou et al. [21].

Case 1: Both X and Y are discrete.

In this case, the minimum distance estimators of the parameter vector $β$ and $π_{x}$ are obtained by solving the following optimization problem

min_{β, π_{x}} ρ (d, m_{β})

(3)

subject to

\sum_{x} π_{x} = 1 .

Optimization problem (3) is equivalent to the problem

min \sum_{x, y} G (δ (x, y)) m_{β} (x, y)

subject to

\sum_{x} π_{x} = 1 .

The class of G functions that we use creates distances that belong in the family of $ϕ$ -divergences.

Proposition 3.

The estimating equations for β and $π_{x}$ are given as:

$\begin{matrix} \sum_{x, y} w (δ (x, y)) n_{x, y} u (y | x; β) = 0, \\ \sum_{x, y} w (δ (x, y)) n_{x, y} \{\frac{I (X = x)}{π_{x}} - 1\} = 0 . \end{matrix}$ (4)

The function $w (δ (x, y))$ is a weight function, such that $0 \leq w (δ (x, y)) \leq 1$ , and it is defined as

$w (δ (x, y)) = min \{\frac{{[A (δ (x, y)) + 1]}^{+}}{δ (x, y) + 1}, 1\}$

with ${[\cdot]}^{+}$ indicating the positive part of the function $A (δ (x, y)) + 1$ .

Proof.

The main steps of the proof are provided in the Appendix A.1. □

Remark 2.

1.
The above two estimating equations can be solved with respect to β and $π_{x}$ . In an iterative algorithm, we can solve the second equation (4) explicitly for $π_{x}$ to obtain
$π_{x} = \frac{\sum_{y} w (δ (x, y)) n_{x, y}}{\sum_{x, y} w (δ (x, y)) n_{x, y}} .$
This means that if the model does not fit any of the y, observed at a particular x well, the weight for this x will drop as well.

2.
When $A (δ (x, y)) = δ (x, y)$ the corresponding estimating equation for β becomes $\sum_{x, y} n_{x, y} u (y | x; β) = 0$ and the MLE is obtained. This is because the corresponding weight function $w (δ (x, y)) = 1$ . In this case, the estimating equations for the $π_{x}$ s become $\sum n_{x, y} [\frac{I (X = x)}{π_{x}} - 1] = 0$ , the estimating equations for the MLEs of $π_{x}$ .

3.
The Fisher consistency property of the function that introduces the estimates guarantees that the expectation of the corresponding estimating function is 0, under the correct model specification.

Case 2: Y is continuous and X is discrete.

In this case, the estimates of the parameters $β$ and $π_{x}$ are obtained by solving the following optimization problem

min_{β, π_{x}} \sum_{x} \int G (δ (x, y)) m_{β}^{*} (y, x) d y

subject to

\sum_{x} π_{x} = 1 .

In general $m_{β}^{*} (y, x) = m_{β}^{*} (y | x) π_{x}$ ; in the case where $y, x$ are independent $m_{β}^{*} (y, x) = m_{β}^{*} (y) π_{x}$ , and the optimization problem stated above is equivalent to

min_{β, π_{x}} \sum_{x} π_{x} \int G (δ (x, y)) m_{β}^{*} (y) d y

(5)

subject to

\sum_{x} π_{x} = 1 .

Proposition 4.

The estimating equations for β and $π_{x}$ in the case of independence of $y, x$ are given as follows:

$\begin{matrix} \sum_{x} π_{x} \int A (δ (x, y)) \nabla_{β} m_{β}^{*} (y) d y = 0, \\ \sum_{x} π_{x} \int A (δ (x, y)) [\frac{I (X = x)}{π_{x}} - 1] m_{β}^{*} (y) d y = 0, \end{matrix}$ (6)

where $A (δ)$ is the residual adjustment function (RAF) that corresponds to the function G, and $G^{'} (δ)$ is the derivative of G with respect to δ.

Proof.

Straightforward, after differentiating the Lagrangian with respect to $β$ and $π_{x}$ . □

Case 3: Y is continuous and X is continuous.

In this case, we refer the reader to Basu and Lindsay [2].

5. Robustness Properties

Hampel et al. [29] and Hampel [30,31] define robust statistics as the “statistics of approximate parametric models”, and introduce one of the fundamental tools of robust statistics, the concept of the influence function, in order to investigate the behavior of a statistic $T_{n}$ expressed as a functional $T (G)$ . The influence function is a heuristic tool with the intuitive interpretation of measuring the bias caused by an infinitesimal contamination at a point x on the estimate standardized by the mass of contamination. Its formal definition is as follows:

Definition 2.

The influence function of a functional T at the distribution F is given as

$I F (x; T, F) = lim_{t \to 0} \frac{T ((1 - t) F + t Δ_{x}) - T (F)}{t},$

in those $x \in X$ where the limit exists, $0 \leq t \leq 1$ and $Δ_{x}$ is the Dirac measure defined as

$Δ_{x} (u) = \{\begin{matrix} 1, & u = x, \\ 0, & u \neq x . \end{matrix}$ (7)

If an estimator has a bounded influence function, the estimator is considered to be robust to outliers, that is data which is away from the pattern set by the majority of the data. The effect of bounding the influence function is the sacrifice of efficiency; estimators with bounded influence function, while are not affected by outlying points, are not fully efficient under the correct model specification.

Our goal in calculating the influence function is to show the full efficiency of the proposed estimators. That is, the influence function of the proposed estimators, under correct model specification, equals the influence function of the corresponding maximum likelihood estimators. In our context, robustness of the estimators is quantified by the associated RAFs (see Lindsay [19] and Basu and Lindsay [2]).

In what follows, we will derive the influence function of the estimators for the parameter vector $β$ in the case where both $y, x$ are discrete. Similar calculations provide the influence functions of estimators obtained under the remaining scenarios. To do so, we need to resort to the estimators’ functional form, denoted by $β_{ϵ}$ , with corresponding estimating equations

\sum_{s, t} w (δ_{ϵ} (s, t)) u (t | s; β_{ϵ}) d_{ϵ} (s, t) = 0,

where $d_{ϵ} (s, t) = (1 - ϵ) d (s, t) + ϵ Δ_{x, y} (s, t) .$ The influence function is then obtained by differentiating the aforementioned estimating equations with respect to $ϵ$ and then evaluating the derivative at $ϵ = 0$ .

Proposition 5.

The influence function of the β estimator is given by

$β_{0}^{'} = {[A (d)]}^{- 1} B (x, y; d),$

where

$\begin{matrix} A (d) = & \sum_{s, t} [δ_{0} (t) + 1] w^{'} (δ_{0} (s, t)) u (t | s; β_{0}) u^{T} (t | s; β_{0}) d (s, t) \\ - \sum_{s, t} w (δ_{0} (s, t)) \nabla u (t | s; β_{0}) d (s, t), \end{matrix}$

$\begin{matrix} B (x, y; d) = & \sum_{s, t} [\frac{I (s = x, t = y)}{m_{β_{0}} (t | s) π_{s}} - \frac{d (s, t)}{m_{β_{0}} (t | s) π_{s}} w^{'} (δ_{0} (s, t))] u (t | s; β_{0}) d (s, t) \\ - \sum_{s, t} w (δ_{0} (s, t)) u (t | s; β_{0}) d (s, t) + w (δ_{0} (x, y)) u (t | s; β_{0}), \end{matrix}$

with $u (t | s; β) = \nabla ln m_{β} (t | s)$ , and the subscript 0 indicates evaluation at a parametric model.

Proof.

The proof is obtained via straightforward differentiation and its main steps are provided in the Appendix A.2. □

Proposition 6.

Under the assumption that the model is correct, the influence function derived, reduces to the influence function of the MLE of β.

Proof.

Under the assumption that the adopted model is the correct model, the density $d (s, t)$ is $m_{β_{0}} (s, t)$ , so that $δ (s, t) = 0$ . Now recall that $w (0) = 1$ and $w^{'} (0) = 0$ , so the expression $A (d)$ reduces to

$\begin{matrix} A (d) & = - \sum_{s, t} \nabla u (t | s; β_{0}) m_{β_{0}} (s, t) \\ = i (β, x, y) . \end{matrix}$ (8)

Furthermore, the expression $B (x, y; d)$ reduces to $u (y | x; β_{0})$ , where we assume exchangeability of differentiation and integration and use the fact that $u (t | s; β_{0}) = u (s, t; β_{0})$ . Hence, the influence function is given as

$i^{- 1} (β; x, y) u (y | x; β_{0}),$

which is exactly the influence function of the MLE. Therefore, full efficiency is preserved under the model. □

6. Asymptotic Properties

In what follows, we establish asymptotic normality of the estimators in the case of discrete variables. The techniques for obtaining asymptotic normality in the mixed-scale case are similar and not presented here.

Case 1: Both X and Y are discrete.

Recall that the $k -$ th estimating equation is given as $\sum_{x, y} w (δ_{β} (x, y)) n_{x, y} u_{k} (y | x; β) = 0$ , which can be expanded in Taylor series in the neighborhood of the true parameter $β_{0}$ to obtain:

\frac{1}{n} \sum_{x, y} w (δ_{β} (x, y)) n_{x, y} u_{k} (y | x; β) ≅ A_{n} + {(β - β_{0})}^{T} B_{n} + \frac{1}{2} {(β - β_{0})}^{T} C_{n} (β - β_{0}),

(9)

where

\begin{matrix} A_{n} & = \frac{1}{n} \sum_{x, y} w (δ_{β} (x, y)) n_{x, y} u_{k} (y | x; β_{0}), \\ B_{n} & = \nabla_{β} \{\frac{1}{n} \sum_{x, y} w (δ_{β} (x, y)) n_{x, y} u_{k} (y | x; β)\} |_{β_{0}}, \end{matrix}

(10)

$C_{n}$ is a $p \times p$ Hessian matrix whose $(t, e) -$ th element is given as

\frac{\partial^{2}}{\partial β_{t} \partial β_{e}} \{\frac{1}{n} \sum_{x, y} w (δ_{β} (x, y)) n_{x, y} u_{k} (y | x; β)\} |_{β_{0}} .

Under assumptions 1–8, listed in the Appendix A.3, we have the following theorem.

Theorem 1.

The minimum disparity estimators of the parameter vector β are asymptotically normal with asymptotic variance $I^{- 1} (β_{0})$ , where $I (\cdot)$ indicates the Fisher information matrix.

7. Simulations

The simulation study presented below has two aims. The first one, is to indicate the versatility of the disparity methods for different data measurement scales. The second aim is to exemplify and study the robustness of these methods under different contamination scenarios.

Case 1: Both X and Y are discrete.

The Cressie-Read family of power divergence is given by

P W D (d, m_{β}) = \sum m_{β} (x, y) \cdot \frac{{[1 + δ (x, y)]}^{λ + 1} - 1}{λ (λ + 1)} = \sum d (x, y) \cdot \frac{{[d (x, y) / m_{β} (x, y)]}^{λ} - 1}{λ (λ + 1)},

where $d (x, y) = n_{x, y} / n$ is the proportion of observations with value $x, y$ and $m_{β} (x, y) = m_{β} (y | x) π_{x}$ is the density function of the model of interest.

To evaluate the performance of our algorithmic procedure, we use the following disparity measures, that is,

\begin{matrix} Likelihood disparity (λ = 0) : \\ L D (d, m_{β}) = \sum d (x, y) \cdot \{log [d (x, y) / m_{β} (x, y)]\}, \\ Twice - squared {Hellinger}^{'} s (λ = - 1 / 2) : \\ H D (d, m_{β}) = 2 \cdot \sum {[\sqrt{d (x, y)} - \sqrt{m_{β} (x, y)}]}^{2}, \\ Pearson ’ s chi - squared divided by 2 (λ = 1) : \\ P C S (d, m_{β}) = \sum \frac{{[d (x, y) - m_{β} (x, y)]}^{2}}{2 \cdot m_{β} (x, y)}, \\ Symmetric chi - squared (G (δ (x, y)) = \frac{2 {[δ (x, y)]}^{2}}{δ (x, y) + 2}) : \\ S C S (d, m_{β}) = 2 \cdot \sum \frac{{[m_{β} (x, y) - d (x, y)]}^{2}}{[m_{β} (x, y) + d (x, y)]} . \end{matrix}

The data are generated in four different ways using three different sample sizes N, say $N = 100; N = 1000$ and $N =$ 10,000. The data format used can be represented in a $5 \times 5$ contingency table, with $n_{i, j}$ , $i = 1, 2, \dots, 5$ ; $j = 1, 2, \dots, 5$ denoting the counts in the $i j$ -th cell, $n_{i •}$ and $n_{• j}$ representing the row and column totals, respectively. Furthermore, the variable x indicates columns, while y indicates the rows. In each of the aforementioned cases/scenarios, 10,000 tables were generated and that corresponds to the number of Monte Carlo (MC) replications. Our purpose is to get the mean values of the estimates of the parameters $m_{β} (y | x)$ ’s and $π_{x}$ ’s along with their corresponding standard deviations (SDs). Notice that, in this setting, the estimation of $π_{x}$ and $m_{β} (y | x)$ is completely nonparametric, that is, no model is assumed for estimating the marginal probabilities of X and Y.

The table was generated by using either a fixed total sample size N or fixed marginal probabilities. These two data generating schemes imply two different sampling schemes that could have generated the data with consequences for the probability model one would use. For example, with fixed total sample size the distribution of the counts is multinomial, or if the row margin is fixed in advance the distribution of the counts is a product binomial distribution. In the former case of fixed N, we explored two different scenarios: a balanced and an imbalanced one. The imbalanced scenario allows for the presence of one zero cell in the contingency table, whereas the balanced scenario does not. In the latter case of fixed marginal probabilities, the row marginal probabilities ( $m_{β} (y | x)$ ’s) were fixed, while the column marginals ( $π_{x}$ ’s) were randomly chosen and these values were used to obtain the contingency table. In this case, we also explored a balanced and an imbalanced scenario based on whether the row marginal probabilities were chosen so that to be equal to each other or not, respectively.

Specifically, under Scenario Ia, where the total sample size N was fixed and the balanced design was exploited, none of the $n_{i j}$ ’s ( $n_{i j} \neq 0, \forall i, j = 1, 2, 3, 4, 5$ ) was set equal to zero, with equal row and column marginal probabilities. Table 1 presents the mean of 10,000 estimates and the corresponding SDs for all four distances ( $P C S, H D, S C S, L D$ ) when N is fixed under the balanced scenario. Table 1 clearly shows that all distances provide estimates approximately equal to 0.200 regardless of the sample size used. Furthermore, as the sample size increases, the SDs decrease noticeably.

Table 1.

Scenario Ia: Means and standard deviations (SDs) of 4 distances ( $P C S, H D, S C S, L D$ ). A $5 \times 5$ contingency table was generated having fixed the total sample size N under a balanced design with $n_{i j} \neq 0, \forall i, j = 1, 2, 3, 4, 5$ . The number of Monte Carlo (MC) replications used is 10,000.

N	Statistical Distance	Summary	Estimates
			Means and SDs over 10,000 Replications
			${\hat{m}}_{β_{1}}$	${\hat{m}}_{β_{2}}$	${\hat{m}}_{β_{3}}$	${\hat{m}}_{β_{4}}$	${\hat{m}}_{β_{5}}$	${\hat{π}}_{x_{1}}$	${\hat{π}}_{x_{2}}$	${\hat{π}}_{x_{3}}$	${\hat{π}}_{x_{4}}$	${\hat{π}}_{x_{5}}$
100	PCS	Mean	0.199	0.199	0.201	0.201	0.200	0.201	0.200	0.199	0.200	0.201
		SD	0.038	0.041	0.039	0.039	0.039	0.038	0.038	0.037	0.038	0.038
	HD	Mean	0.199	0.200	0.200	0.200	0.201	0.200	0.200	0.200	0.200	0.200
		SD	0.037	0.041	0.037	0.037	0.037	0.037	0.037	0.035	0.036	0.037
	SCS	Mean	0.199	0.201	0.200	0.200	0.200	0.200	0.200	0.199	0.200	0.201
		SD	0.037	0.041	0.038	0.038	0.038	0.032	0.033	0.030	0.031	0.032
	LD	Mean	0.199	0.200	0.200	0.200	0.200	0.200	0.002	0.200	0.200	0.200
		SD	0.035	0.039	0.036	0.036	0.036	0.035	0.036	0.036	0.034	0.035
1000	PCS	Mean	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200
		SD	0.014	0.015	0.016	0.016	0.014	0.017	0.015	0.015	0.013	0.016
	HD	Mean	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200
		SD	0.013	0.015	0.013	0.013	0.013	0.013	0.012	0.012	0.012	0.013
	SCS	Mean	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200
		SD	0.014	0.015	0.013	0.013	0.013	0.008	0.009	0.011	0.012	0.008
	LD	Mean	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200
		SD	0.013	0.015	0.013	0.013	0.013	0.013	0.013	0.012	0.012	0.013
10,000	PCS	Mean	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200
		SD	0.008	0.007	0.006	0.006	0.009	0.010	0.010	0.007	0.008	0.006
	HD	Mean	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200
		SD	0.004	0.005	0.004	0.004	0.004	0.004	0.004	0.004	0.004	0.004
	SCS	Mean	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200
		SD	0.004	0.005	0.004	0.004	0.004	0.007	0.005	0.008	0.008	0.004
	LD	Mean	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200	0.200
		SD	0.004	0.005	0.004	0.004	0.004	0.004	0.004	0.004	0.004	0.004

Open in a new tab

In Scenario IIa, where the total sample size N was fixed and the contingency table was structured using the imbalanced design, the presence of a zero cell ( $n_{11} = 0$ ) was allowed. The results of this scenario are presented in Table 2, where the estimates were calculated exploiting all disparity measures. For the $L D$ , $n_{11}$ was set equal to $10^{- 8}$ . The presence of zero cells in contingency tables has a large history in the relevant literature on contingency tables analysis, where several options are provided for the analysis of these tables (Fienberg [32], Agresti [33], Johnson and May [34], Poon et al. [35]). From Table 2, one could infer that the different distances handle differently the zero cell. This difference is reflected in the estimate of ${\hat{m}}_{β (y_{1} | x)} = {\hat{m}}_{β_{1}}$ , because it is affected by the zero value of $n_{11}$ . The strongest control is provided by the Hellinger and symmetric chi-squared distances. All distances estimate the parameters $π_{x_{i}}$ similarly, with the bias in their estimation been between $2.7 %$ and $5.2 %$ . The SDs are almost the same for all distances per estimate and their values are ameliorated for $N =$ 10,000.

Table 2.

Scenario IIa Means and SDs of 4 distances ( $P C S, H D, S C S, L D$ ). A $5 \times 5$ contingency table was generated having fixed the total sample size N under an imbalanced design with $n_{11} = 0$ . The number of MC replications used is 10,000.

N	Statistical Distance	Summary	Estimates
			Means and SDs over 10,000 Replications
			${\hat{m}}_{β_{1}}$	${\hat{m}}_{β_{2}}$	${\hat{m}}_{β_{3}}$	${\hat{m}}_{β_{4}}$	${\hat{m}}_{β_{5}}$	${\hat{π}}_{x_{1}}$	${\hat{π}}_{x_{2}}$	${\hat{π}}_{x_{3}}$	${\hat{π}}_{x_{4}}$	${\hat{π}}_{x_{5}}$
100	PCS	Mean	0.052	0.197	0.198	0.198	0.355	0.165	0.173	0.172	0.245	0.245
		SD	0.028	0.045	0.044	0.044	0.053	0.041	0.039	0.044	0.044	0.047
	HD	Mean	0.026	0.202	0.202	0.202	0.368	0.156	0.168	0.168	0.254	0.254
		SD	0.019	0.049	0.045	0.045	0.054	0.041	0.042	0.041	0.046	0.049
	SCS	Mean	0.033	0.209	0.209	0.209	0.340	0.166	0.172	0.171	0.245	0.246
		SD	0.022	0.047	0.045	0.045	0.051	0.036	0.036	0.033	0.038	0.040
	LD	Mean	0.040	0.200	0.200	0.200	0.360	0.160	0.170	0.170	0.250	0.250
		SD	0.020	0.043	0.040	0.040	0.048	0.037	0.038	0.036	0.042	0.044
1000	PCS	Mean	0.044	0.197	0.197	0.197	0.365	0.164	0.170	0.170	0.248	0.248
		SD	0.011	0.017	0.014	0.014	0.018	0.013	0.014	0.013	0.015	0.015
	HD	Mean	0.034	0.203	0.202	0.202	0.359	0.156	0.170	0.170	0.252	0.252
		SD	0.005	0.015	0.013	0.013	0.016	0.011	0.012	0.012	0.013	0.014
	SCS	Mean	0.038	0.210	0.210	0.210	0.332	0.166	0.169	0.169	0.248	0.248
		SD	0.006	0.015	0.014	0.014	0.016	0.014	0.013	0.011	0.013	0.014
	LD	Mean	0.040	0.200	0.200	0.200	0.360	0.160	0.170	0.170	0.250	0.250
		SD	0.006	0.015	0.013	0.013	0.016	0.012	0.012	0.011	0.013	0.014
10,000	PCS	Mean	0.044	0.197	0.196	0.196	0.367	0.164	0.170	0.170	0.248	0.248
		SD	0.002	0.006	0.007	0.007	0.010	0.007	0.006	0.005	0.007	0.008
	HD	Mean	0.034	0.203	0.202	0.202	0.359	0.156	0.171	0.171	0.252	0.252
		SD	0.002	0.005	0.004	0.004	0.005	0.004	0.004	0.004	0.004	0.005
	SCS	Mean	0.038	0.210	0.210	0.210	0.332	0.166	0.169	0.169	0.248	0.248
		SD	0.002	0.005	0.004	0.004	0.005	0.007	0.006	0.004	0.006	0.006
	LD	Mean	0.040	0.200	0.200	0.200	0.360	0.160	0.170	0.170	0.250	0.250
		SD	0.002	0.005	0.004	0.004	0.005	0.004	0.004	0.004	0.004	0.004

Open in a new tab

A referee suggested that in certain cases interest may be centered on smaller samples. We generated $2 \times 3$ tables with fixed total sample size of 50 and 70 observations. Table 3 and Table 4 describe the results when the contingency tables were generated under a balanced and an imbalanced design with associated respective Scenarios Ib and IIb. More precisely, Table 3 presents the estimators of the marginal row and column probabilities obtained when $P C$ , $H D$ , $S C S$ and $L D$ distances are used. We notice that the increase in the sample size provides for a decrease in the overall absolute bias in estimation, defined as $\sum_{ℓ = 1}^{L} | {\hat{θ}}_{ℓ} - θ_{0, ℓ} |$ , where ${\hat{θ}}_{ℓ}$ is the estimate of the ℓ-th component of an $L \times 1$ vector $θ$ and $θ_{0, ℓ}$ is the corresponding true value. In our case, $θ^{T} = (m_{β_{1}}, m_{β_{2}}, π_{x_{1}}, π_{x_{2}}, π_{x_{3}})$ . This observation applies to all distances used in our calculations. Table 4 presents results associated with the imbalanced case. The generated $2 \times 3$ tables contain two empty cells ( $n_{12} = n_{21} = 0$ ). Once again, for calculating the $L D$ , cells $n_{12} = n_{21} = 10^{- 8}$ . We notice that the bias associated with the estimates is rather large for all the distances, and an increased sample size does not alleviate the observed bias. Basu and Basu [9] have proposed an empty cell penalty for the minimum power-divergence estimators. This penalty leads to estimators with improved small sample properties. See also Alin and Kurt [36] for a discussion of the need of penalization in small samples.

Table 3.

Scenario Ib: Means and Biases of 4 distances ( $P C S, H D, S C S, L D$ ). A $2 \times 3$ contingency table was generated having fixed the total sample size N under a balanced design with $n_{i j} \neq 0, \forall i = 1, 2, j = 1, 2, 3$ . The number of MC replications used is 10,000.

N	Statistical Distance	Summary	Estimates
			Means and Biases over 10,000 Replications
			${\hat{m}}_{β_{1}}$	${\hat{m}}_{β_{2}}$	${\hat{π}}_{x_{1}}$	${\hat{π}}_{x_{2}}$	${\hat{π}}_{x_{3}}$
50	PCS	Mean	0.5008	0.4992	0.3339	0.3336	0.3325
		Abs.Biases	0.0008	0.0008	0.0006	0.0003	0.0009
		Overall Bias			0.0034
	HD	Mean	0.5008	0.4992	0.3339	0.3335	0.3326
		Abs.Biases	0.0008	0.0008	0.0006	0.0002	0.0007
		Overall Bias			0.0031
	SCS	Mean	0.5007	0.4993	0.3338	0.3335	0.3326
		Abs.Biases	0.0007	0.0007	0.0005	0.0002	0.0007
		Overall Bias			0.0028
	LD	Mean	0.5008	0.4992	0.3339	0.3335	0.3326
		Abs.Biases	0.0008	0.0008	0.0006	0.0002	0.0008
		Overall Bias			0.0032
70	PCS	Mean	0.4998	0.5002	0.3333	0.3331	0.3337
		Abs.Biases	0.0002	0.0002	0.0001	0.0003	0.0003
		Overall Bias			0.0011
	HD	Mean	0.4998	0.5002	0.3333	0.3330	0.3336
		Abs.Biases	0.0002	0.0002	0.0000	0.0003	0.0003
		Overall Bias			0.0009
	SCS	Mean	0.4998	0.5002	0.3334	0.3331	0.3335
		Abs.Biases	0.0002	0.0002	0.0000	0.0002	0.0002
		Overall Bias			0.0008
	LD	Mean	0.4999	0.5001	0.3333	0.3330	0.3336
		Abs.Biases	0.0001	0.0001	0.0000	0.0003	0.0003
		Overall Bias			0.0009

Open in a new tab

Table 4.

Scenario IIb: Means and Biases of 4 distances ( $P C S, H D, S C S, L D$ ). A $2 \times 3$ contingency table was generated having fixed the total sample size N under an imbalanced design with $n_{12} = n_{21} = 0$ . The number of MC replications used is 10,000.

N	Statistical Distance	Summary	Estimates
			Means and Biases over 10,000 Replications
			${\hat{m}}_{β_{1}}$	${\hat{m}}_{β_{2}}$	${\hat{π}}_{x_{1}}$	${\hat{π}}_{x_{2}}$	${\hat{π}}_{x_{3}}$
50	PCS	Mean	0.6391	0.3609	0.3489	0.2278	0.4234
		Abs.Biases	0.0276	0.0276	0.0155	0.0611	0.0766
		Overall Bias			0.2084
	HD	Mean	0.7815	0.2185	0.3346	0.0497	0.6157
		Abs.Biases	0.1149	0.1149	0.0013	0.1170	0.1157
		Overall Bias			0.4638
	SCS	Mean	0.6420	0.3580	0.3510	0.2726	0.3765
		Abs.Biases	0.0247	0.0247	0.0176	0.1059	0.1235
		Overall Bias			0.2964
	LD	Mean	0.6677	0.3323	0.3342	0.1660	0.4998
		Abs.Biases	0.0010	0.0010	0.0009	0.0007	0.0002
		Overall Bias			0.0038
70	PCS	Mean	0.6377	0.3623	0.3483	0.2297	0.4220
		Abs.Biases	0.0290	0.0290	0.0150	0.0631	0.0780
		Overall Bias			0.2141
	HD	Mean	0.7812	0.2188	0.3328	0.0491	0.6180
		Abs.Biases	0.1145	0.1145	0.0005	0.1175	0.1180
		Overall Bias			0.4650
	SCS	Mean	0.6395	0.3605	0.3505	0.2739	0.3756
		Abs.Biases	0.0271	0.0271	0.0172	0.1072	0.1244
		Overall Bias			0.3030
	LD	Mean	0.6657	0.3343	0.3331	0.1671	0.4998
		Abs.Biases	0.0010	0.0010	0.0002	0.0004	0.0002
		Overall Bias			0.0028

Open in a new tab

Table 5 provides the results obtained under Scenario III. In this case, the parameter estimates were calculated using the $P C S$ , $H D$ , $S C S$ and $L D$ distances when the $5 \times 5$ contingency table was constructed by fixing the row marginal probabilities so that they were all set at 0.20, that is, $(0.20, 0.20, 0.20, 0.20, 0.20)$ . The column marginals were randomly chosen in the interval $[0, 1]$ and summed to 1. In this case, the produced column marginal probabilities were $(0.1472, 0.2365, 0.3196, 0.2370, 0.0597)$ . The simulation study reveals that the estimates of the parameters $m_{β} (y | x)$ ’s and $π_{x}$ ’s do not differ substantially from the respective row and column marginal probabilities for any of the four distances utilized. The SDs are approximately the same and they get lower values for larger N.

Table 5.

Scenario III: Means and SDs of 4 distances ( $P C S, H D, S C S, L D$ ). A $5 \times 5$ contingency table was generated having fixed the row marginal probabilities at (0.20, 0.20, 0.20, 0.20, 0.20). The number of MC replications used is 10,000.

N	Statistical Distance	Summary	Estimates
			Means and SDs over 10,000 Replications
			${\hat{m}}_{β_{1}}$	${\hat{m}}_{β_{2}}$	${\hat{m}}_{β_{3}}$	${\hat{m}}_{β_{4}}$	${\hat{m}}_{β_{5}}$	${\hat{π}}_{x_{1}}$	${\hat{π}}_{x_{2}}$	${\hat{π}}_{x_{3}}$	${\hat{π}}_{x_{4}}$	${\hat{π}}_{x_{5}}$
100	PCS	Mean	0.199	0.200	0.200	0.200	0.201	0.153	0.230	0.302	0.229	0.086
		SD	0.037	0.037	0.037	0.037	0.037	0.034	0.039	0.043	0.039	0.023
	HD	Mean	0.200	0.200	0.200	0.200	0.200	0.147	0.230	0.311	0.230	0.082
		SD	0.039	0.040	0.039	0.039	0.040	0.033	0.043	0.037	0.042	0.019
	SCS	Mean	0.200	0.200	0.200	0.200	0.200	0.153	0.230	0.302	0.230	0.085
		SD	0.039	0.085	0.038	0.038	0.038	0.033	0.039	0.043	0.039	0.022
	LD	Mean	0.200	0.200	0.200	0.200	0.200	0.150	0.230	0.307	0.230	0.083
		SD	0.038	0.038	0.038	0.038	0.038	0.033	0.041	0.045	0.040	0.019
1000	PCS	Mean	0.200	0.200	0.200	0.200	0.200	0.148	0.236	0.319	0.236	0.061
		SD	0.013	0.013	0.013	0.013	0.014	0.012	0.014	0.017	0.015	0.011
	HD	Mean	0.200	0.200	0.200	0.200	0.200	0.147	0.237	0.320	0.237	0.059
		SD	0.013	0.013	0.013	0.013	0.013	0.011	0.014	0.015	0.014	0.008
	SCS	Mean	0.200	0.200	0.200	0.200	0.200	0.148	0.236	0.319	0.237	0.060
		SD	0.015	0.015	0.015	0.015	0.015	0.011	0.014	0.016	0.014	0.013
	LD	Mean	0.200	0.200	0.200	0.200	0.200	0.147	0.237	0.320	0.237	0.059
		SD	0.013	0.013	0.013	0.013	0.013	0.011	0.014	0.015	0.013	0.008
10,000	PCS	Mean	0.200	0.200	0.200	0.200	0.200	0.147	0.236	0.320	0.237	0.060
		SD	0.006	0.006	0.006	0.006	0.006	0.008	0.006	0.011	0.006	0.008
	HD	Mean	0.200	0.200	0.200	0.200	0.200	0.147	0.236	0.320	0.237	0.060
		SD	0.004	0.004	0.004	0.004	0.004	0.004	0.004	0.005	0.004	0.002
	SCS	Mean	0.200	0.200	0.200	0.200	0.200	0.147	0.236	0.320	0.237	0.060
		SD	0.005	0.005	0.005	0.005	0.005	0.004	0.006	0.008	0.006	0.008
	LD	Mean	0.200	0.200	0.200	0.200	0.200	0.147	0.236	0.320	0.237	0.060
		SD	0.004	0.004	0.004	0.004	0.004	0.004	0.005	0.005	0.005	0.002

Open in a new tab

Finally, in Table 6 the data generation was done by exploiting Scenario IV, that is, by having fixed the row marginal probabilities, which were not equal to each other; while, the column marginals were randomly chosen in the interval $[0, 1]$ so that they sum to 1. In particular, the row marginal probabilities were fixed at values $(0.04, 0.20, 0.20, 0.20, 0.36)$ , while the column marginals used were $(0.2171, 0.1676, 0.2347, 0.1178, 0.2628)$ . When $N = 100$ , the value of ${\hat{m}}_{β} (y_{1} | x) = {\hat{m}}_{β_{1}}$ is not approximately 0.07 and not equal to 0.04 for all distances. However, when $N = 1000$ or $N =$ 10,000, we get better estimates irrespectively of the disparity measure choice. The SDs are approximately the same and they become smaller as the sample size increases.

Table 6.

Scenario IV: Means and SDs of 4 distances ( $P C S, H D, S C S, L D$ ). A $5 \times 5$ contingency table was generated having fixed the row marginal probabilities at (0.04, 0.20, 0.20, 0.20, 0.36). The number of MC replications used is 10,000.

N	Statistical Distance	Summary	Estimates
			Means and SDs over 10,000 Replications
			${\hat{m}}_{β_{1}}$	${\hat{m}}_{β_{2}}$	${\hat{m}}_{β_{3}}$	${\hat{m}}_{β_{4}}$	${\hat{m}}_{β_{5}}$	${\hat{π}}_{x_{1}}$	${\hat{π}}_{x_{2}}$	${\hat{π}}_{x_{3}}$	${\hat{π}}_{x_{4}}$	${\hat{π}}_{x_{5}}$
100	PCS	Mean	0.074	0.197	0.197	0.197	0.335	0.214	0.173	0.228	0.132	0.253
		SD	0.022	0.037	0.038	0.038	0.045	0.038	0.035	0.039	0.031	0.041
	HD	Mean	0.070	0.194	0.195	0.195	0.346	0.215	0.170	0.231	0.126	0.258
		SD	0.015	0.039	0.039	0.039	0.048	0.041	0.037	0.042	0.030	0.044
	SCS	Mean	0.074	0.194	0.195	0.195	0.342	0.214	0.173	0.229	0.131	0.253
		SD	0.015	0.039	0.039	0.039	0.048	0.038	0.035	0.040	0.030	0.041
	LD	Mean	0.071	0.195	0.196	0.196	0.342	0.214	0.172	0.230	0.128	0.256
		SD	0.015	0.037	0.038	0.038	0.046	0.040	0.036	0.041	0.030	0.042
1000	PCS	Mean	0.042	0.200	0.200	0.200	0.358	0.217	0.168	0.234	0.119	0.262
		SD	0.011	0.014	0.013	0.013	0.017	0.014	0.013	0.014	0.014	0.015
	HD	Mean	0.039	0.200	0.200	0.200	0.361	0.217	0.167	0.235	0.118	0.263
		SD	0.006	0.013	0.013	0.013	0.015	0.013	0.012	0.013	0.010	0.014
	SCS	Mean	0.039	0.200	0.200	0.200	0.361	0.217	0.168	0.234	0.118	0.263
		SD	0.007	0.013	0.013	0.013	0.016	0.016	0.013	0.014	0.010	0.015
	LD	Mean	0.040	0.200	0.200	0.200	0.360	0.217	0.167	0.235	0.118	0.263
		SD	0.006	0.013	0.013	0.013	0.015	0.013	0.012	0.013	0.010	0.014
10,000	PCS	Mean	0.040	0.200	0.200	0.200	0.360	0.217	0.167	0.235	0.118	0.263
		SD	0.008	0.005	0.007	0.007	0.009	0.006	0.005	0.005	0.007	0.006
	HD	Mean	0.040	0.200	0.200	0.200	0.360	0.217	0.167	0.235	0.118	0.263
		SD	0.002	0.004	0.004	0.004	0.005	0.004	0.004	0.004	0.003	0.004
	SCS	Mean	0.040	0.200	0.200	0.200	0.360	0.217	0.167	0.235	0.118	0.263
		SD	0.002	0.004	0.004	0.004	0.005	0.006	0.005	0.007	0.003	0.008
	LD	Mean	0.040	0.200	0.200	0.200	0.360	0.217	0.167	0.235	0.118	0.263
		SD	0.002	0.004	0.004	0.004	0.005	0.004	0.004	0.005	0.003	0.005

Open in a new tab

We also notice from Table 1, Table 5 and Table 6 that in all cases the standard deviation associated with the estimates obtained when we use other than likelihood distances, is approximately the same with the standard deviation that corresponds to the likelihood estimates, thereby showing the asymptotic efficiency of the disparity estimators.

All calculations were performed using the R language. Given that the problem described in this section can be viewed as a general non-linear optimization problem, the solnp function of the Rsolnp package (Ye [37]) was used to obtain the aforementioned estimates. For our calculations, we tried using a variety of different initial values ( ${\hat{π}}_{x}^{(0)}$ ’s and ${\hat{m}}_{β}^{(0)} (y | x)$ ’s); we notice that no matter how the initial values were chosen, the estimates were always pretty similar and very close to the observed values ( $n_{i •} / N$ and $n_{• j} / N$ for $i, j = 1, 2, 3, 4, 5$ ). Only the number of iterations needed for convergence is slightly affected. Consequently, random numbers from a Uniform distribution in the interval $[0, 1]$ were set as initial values (which were not necessarily summing to 1). The solnp function has a built-in stopping rule and there was no need to set our own stopping rule. We only set the boundary constraints to be in the interval $[0, 1]$ for all estimates which were also subject to $\sum π_{x} = \sum m_{β} (y | x) = 1$ .

Other functions may also be used to obtain the estimates. For example, we used the auglag function of the nloptr package with local solvers “lbfgs” or “SLSQP” (Conn et al. [38], Birgin and Martínez [39]) which emulates Augmented Lagrangian multipliers. However, the convergence using the solnp function (the number of iterations was on average 2) was extremely faster than using the auglag function (the average number of iterations was approximately 100). For this reason, the results presented in Table 1, Table 2, Table 3, Table 4, Table 5 and Table 6 were based only on the function solnp.

Case 2: X is discrete and Y is continuous

In this section, we are interested in solving the optimization problem (5) when X is discrete, Y is continuous and $X, Y$ are independent of each other. To evaluate the performance of our procedure, we used Hellinger’s distance, which in this case takes on the following form:

H D (f^{*}, m_{β}^{*}) = \int \sum_{x} {[\sqrt{f_{N}^{*} (x, y)} - \sqrt{m_{β}^{*} (x, y)}]}^{2} d y = \int \sum_{x} {[\sqrt{f_{Y}^{*} (y) \cdot \frac{n_{X}}{N}} - \sqrt{m_{X} (x) \cdot m_{Y}^{*} (y)}]}^{2} d y .

The aim of this simulation is to obtain the minimum Hellinger distance estimators of $π_{x}$ and $μ$ assuming (without loss of generality) that $σ^{2}$ is known to be equal to 1. All calculations were performed in R language.

For this purpose, we generated mixed-type data of size N using the package OrdNor (Amatya and Demirtas [40]). More precisely, the data are comprised of one categorical variable X with three levels and probability vector $(1 / 3, 1 / 3, 1 / 3)$ , while the continuous part is coming from a trivariate normal distribution; symbolic $Y = (Y_{1}, Y_{2}, Y_{3}) \sim M V N_{3} (μ, I_{3})$ , where $μ^{T} = (μ_{1}, μ_{2}, μ_{3})$ . We used two different mean vectors: $μ^{T} = (0, 0, 0)$ and $μ^{T} = (0, 3, 6)$ . The set of ordinal and normal variables were generated concurrently using an overall correlation matrix $Σ$ , which consists of three components/sub-matrices: $Σ_{O O}$ , $Σ_{O N}$ and $Σ_{N N}$ , with O and N corresponding to “Ordinal” and “Normal” variables, respectively. More precisely, the overall correlation matrix $Σ$ used is the following

Σ = (\begin{matrix} 1 & ρ_{O N} & ρ_{O N} & ρ_{O N} \\ ρ_{O N} & 1 & 0 & 0 \\ ρ_{O N} & 0 & 1 & 0 \\ ρ_{O N} & 0 & 0 & 1 \end{matrix}),

where $Σ_{O O} = 1$ , $Σ_{N N} = I_{3}$ , $Σ_{O N} = (\begin{matrix} ρ_{O N} & ρ_{O N} & ρ_{O N} \end{matrix})$ and $ρ_{O N}$ represents the polyserial correlations for the $O N$ combinations (for more information on polyserial correlations refer to Olsson et al. [41]). Since $X, Y$ were assumed to be independent, we set $ρ_{O N} = 0.0$ . However, we also used weak correlations, say $ρ_{O N} = 0.1$ and $0.2$ , to investigate whether the estimates we receive in these cases remain reasonable.

The kernel function was the multivariate normal density $M V N_{3} (0, H)$ with $H$ being estimated by the data using the kde function of the ks package (Duong [42]), $m_{Y}^{*} (y)$ represented the multivariate normal density $M V N_{3} (μ, Σ + H)$ and $m_{X} (x)$ was the multinomial mass function. This choice of smoothing parameter, stemmed from the fact that we were interested in evaluating the performance, in terms of robustness, of standard bandwidth selection.

To solve the optimization problem, the solnp function of the Rsolnp package (Ye [37]) was used. Specifically, the initial values set for the probabilities $π_{x_{1}}, π_{x_{2}}, π_{x_{3}}$ associated with the X variable were random uniform numbers in the interval $[0, 1]$ , while the initial values for the means $μ_{y_{1}}, μ_{y_{2}}, μ_{y_{3}}$ were random numbers in the interval $[Q 1 (Y_{i}), Q 3 (Y_{i})]$ for $i = 1, 2, 3$ , where $Q 1$ and $Q 3$ stand for the respective 25th and the 75th quantile per component of the continuous part. Following the same procedure with the one of Basu and Lindsay [2] in the univariate continuous case, here (in the mixed-case) the numerical evaluation of the integrals was also done on the basis of the Simpson’s 1/3rd rule using the sintegral function of the Bolstad2 package (Bolstad [43]). Moreover, we calculated the mean values, the SDs, as well as the percentages of bias of the mean and the probability vectors for three different sample sizes: $N = 100$ ; $N = 1000$ and $N = 1500$ over 1000 MC replications. The bias is defined as the difference of the estimates from their “true” values, that is, $b i a s (μ_{y_{i}}) = {\hat{μ}}_{y_{i}} - μ_{i}$ and $b i a s (π_{x_{i}}) = {\hat{π}}_{x_{i}} - 1 / 3$ for $i = 1, 2, 3$ . The results are shown in Table 7 and Table 8.

Table 7.

Means, Absolute Biases and Overall Absolute Bias of the Hellinger’s distance ( $H D$ ). The data were concurrently generated with a given correlation structure (an overall correlation matrix $Σ$ ) and consist of a discrete variable X with marginal probability vector $(1 / 3, 1 / 3, 1 / 3)$ and a continuous vector $Y = (Y_{1}, Y_{2}, Y_{3}) \sim M V N_{3} (μ, I_{3})$ , where $μ^{T} = (0, 0, 0)$ and $I_{3}$ is a $(3 \times 3)$ identity matrix. The number of MC replications used is 1000.

$ρ_{ON}$	N	Summary	Estimates
			Means, Biases over 1000 Replications
			${\hat{π}}_{x_{1}}$	${\hat{π}}_{x_{2}}$	${\hat{π}}_{x_{3}}$	${\hat{μ}}_{y_{1}}$	${\hat{μ}}_{y_{2}}$	${\hat{μ}}_{y_{3}}$
0.0	50	Mean	0.332	0.340	0.329	0.016	0.011	−0.011
		Abs. Biases	0.001	0.007	0.004	0.016	0.011	0.011
		Overall Bias	0.050
	100	Mean	0.330	0.350	0.320	0.017	−0.018	−0.010
		Abs. Biases	0.003	0.017	0.013	0.017	0.018	0.010
		Overall Bias	0.078
	1000	Mean	0.324	0.337	0.339	0.001	−0.008	0.007
		Abs. Biases	0.009	0.004	0.006	0.001	0.008	0.007
		Overall Bias	0.035
0.1	50	Mean	0.351	0.320	0.329	−0.006	0.003	0.005
		Abs. Biases	0.018	0.013	0.004	0.006	0.003	0.005
		Overall Bias	0.049
	100	Mean	0.330	0.323	0.347	0.001	0.005	−0.004
		Abs. Biases	0.003	0.010	0.014	0.001	0.005	0.004
		Overall Bias	0.037
	1000	Mean	0.327	0.343	0.330	−0.021	0.008	0.003
		Abs. Biases	0.006	0.010	0.003	0.021	0.008	0.003
		Overall Bias	0.051

Open in a new tab

Table 8.

Means, Absolute Biases and Overall Absolute Bias of the Hellinger’s distance ( $H D$ ). The data were concurrently generated with a given correlation structure (an overall correlation matrix $Σ$ ) and consist of a discrete variable X with marginal probability vector $(1 / 3, 1 / 3, 1 / 3)$ and a continuous vector $Y = (Y_{1}, Y_{2}, Y_{3}) \sim M V N_{3} (μ, I_{3})$ , where $μ^{T} = (0, 3, 6)$ and $I_{3}$ is a $(3 \times 3)$ identity matrix. The number of MC replications used is 1000.

$ρ_{ON}$	N	Summary	Estimates
			Means, Biases over 1000 Replications
			${\hat{π}}_{x_{1}}$	${\hat{π}}_{x_{2}}$	${\hat{π}}_{x_{3}}$	${\hat{μ}}_{y_{1}}$	${\hat{μ}}_{y_{2}}$	${\hat{μ}}_{y_{3}}$
0.0	50	Mean	0.340	0.328	0.332	−0.004	2.606	5.227
		Abs. Biases	0.007	0.005	0.001	0.004	0.394	0.773
		Overall Bias	1.184
	100	Mean	0.313	0.350	0.337	−0.004	2.777	5.593
		Abs. Biases	0.020	0.017	0.004	0.004	0.223	0.407
		Overall Bias	0.675
	1000	Mean	0.338	0.334	0.328	0.012	2.972	5.958
		Abs. Biases	0.005	0.001	0.005	0.012	0.028	0.042
		Overall Bias	0.093
0.1	50	Mean	0.347	0.323	0.330	−0.021	2.628	5.249
		Abs. Biases	0.014	0.010	0.003	0.021	0.372	0.751
		Overall Bias	1.171
	100	Mean	0.317	0.343	0.340	0.017	2.817	5.615
		Abs. Biases	0.016	0.010	0.007	0.017	0.183	0.385
		Overall Bias	0.618
	1000	Mean	0.334	0.320	0.346	−0.013	2.988	5.956
		Abs. Biases	0.001	0.013	0.013	0.013	0.012	0.044
		Overall Bias	0.096
0.2	50	Mean	0.324	0.333	0.343	−0.004	2.589	5.240
		Abs. Biases	0.009	0.000	0.010	0.004	0.411	0.760
		Overall Bias	1.194
	100	Mean	0.329	0.350	0.321	0.024	2.763	5.549
		Abs. Biases	0.004	0.017	0.012	0.024	0.237	0.451
		Overall Bias	0.745
	1000	Mean	0.337	0.344	0.319	−0.011	2.971	5.951
		Abs. Biases	0.004	0.011	0.014	0.019	0.029	0.049
		Overall Bias	0.118

Open in a new tab

In particular, Table 7 illustrates the mean values, the SDs and the bias percentages of the corresponding minimum Hellinger distance estimators, over 1000 MC replications, for the three different sample sizes and polyserial correlations, when $μ = {(0, 0, 0)}^{T}$ . The estimates for the $π_{x_{i}}$ are approximately equal to $1 / 3 = 0.333$ , while the $μ_{y_{i}}$ estimates are almost zero, even in the cases of weak correlations. When $ρ_{O N} = 0.0$ , the sample size choice does not seem to affect the values of the estimates either overall or per component of $X, Y$ variables. Specifically, we observe that the total absolute bias, computed as the sum of the individual component-wise absolute biases of the vectors $π^{T} = (π_{1}, π_{2}, π_{3})$ and $μ^{T} = (μ_{1}, μ_{2}, μ_{3})$ are approximately the same, with larger samples providing slightly less biases at the expense of a higher computational cost.

In Table 8, analogous results are presented with the difference that the mean vector used was $μ = {(0, 3, 6)}^{T}$ . The $π_{x_{i}}$ estimates are very close to $1 / 3 (= 0.333)$ for all X components, no matter which sample size or correlation is used. On the contrary, the interpretation of the $μ_{i}$ estimates slightly differs in this case. We also calculated the overall absolute bias as well as the individual, per parameter, absolute biases. In this case, larger samples clearly provide estimates with smaller bias for both parameter vectors $π$ , $μ$ and for both cases, the case of independence as well as the case of weak correlations. However, the computational time increases.

In what follows, we also present -for illustration purposes- a small simulation example using a mixed-type, contaminated data set of size $N = 1000$ , which was generated using OrdNor package setting $ρ_{O N} = 0.0$ . Once again, the data were comprised of one categorical variable X with three levels and probability vector $(1 / 3, 1 / 3, 1 / 3)$ , and a trivariate continuous vector $Y = (Y_{1}, Y_{2}, Y_{3})$ . The contamination is happening only in the continuous part on the basis of $α \in {1.00, 0.95, 0.90, 0.85, 0.80}$ , as follows: $Y \sim α \times M V N_{3} (0, I_{3}) + (1 - α) \times M V N_{3} (μ, I_{3})$ , where $μ^{T} = (3, 3, 3)$ . This means that, $N_{1} = α \times N$ data were generated with Y coming from multivaraiate standard normal and the remaining $N_{2} = N - N_{1}$ subset of the data followed a multivaraiate normal distribution with mean vector $μ^{T} = (3, 3, 3)$ . It goes without saying that when $α = 1.00$ , there is no contamination. Here, we are still considering the same optimization problem with the one described above and, consequently, we are interested in evaluating the minimum Hellinger distance estimators over 1000 MC replications by examining/studying to what extend the contamination level affects these estimates.

As indicated from Table 9, when there is no contamination in the data $(α = 1.00)$ , the estimates for the $π_{x_{i}}$ s are almost equal to $1 / 3$ , while the $μ_{y}$ ’s estimates are almost equal to zero. As the data become more contaminated (i.e., the value of $α$ decreases), the minimum disparity estimators corresponding to X variable remain pretty consistent with their true values. However, this is not the case with the estimates for the $μ_{y_{i}}$ s, which deteriorate as the value of the contamination level $α$ shifts from the target/null value, that is $1.00$ .

Table 9.

Means and SDs of the Hellinger’s distance ( $H D$ ). The data were concurrently generated with a given correlation structure (an overall correlation matrix $Σ$ ) and consist of a discrete variable X with marginal probability vector $(1 / 3, 1 / 3, 1 / 3)$ and a continuous trivariate vector $Y = (Y_{1}, Y_{2}, Y_{3}) \sim α \times M V N_{3} (0, I_{3}) + (1 - α) \times M V N_{3} (μ, I_{3})$ , where $μ^{T} = (3, 3, 3)$ , $I_{3}$ is a $(3 \times 3)$ identity matrix and $α = 1.00 (0.05) 0.80$ indicates the contamination level. The number of MC replications used is 1000.

$ρ_{ON}$	N	$α$	Summary	Estimates
				Means and SDs over 1000 Replications
				${\hat{π}}_{x_{1}}$	${\hat{π}}_{x_{2}}$	${\hat{π}}_{x_{3}}$	${\hat{μ}}_{y_{1}}$	${\hat{μ}}_{y_{2}}$	${\hat{μ}}_{y_{3}}$
0.0	1000	1.00	Mean	0.324	0.337	0.339	0.001	−0.008	0.007
			SD	0.293	0.293	0.298	0.378	0.378	0.386
		0.95	Mean	0.327	0.326	0.347	0.068	0.090	0.079
			SD	0.304	0.299	0.309	0.413	0.413	0.413
		0.90	Mean	0.318	0.331	0.351	0.188	0.170	0.189
			SD	0.300	0.305	0.306	0.443	0.450	0.436
		0.85	Mean	0.324	0.337	0.339	0.292	0.283	0.312
			SD	0.293	0.293	0.297	0.484	0.487	0.491
		0.80	Mean	0.324	0.337	0.338	0.447	0.436	0.470
			SD	0.293	0.293	0.297	0.552	0.547	0.559

Open in a new tab

The mean parameters are estimated with reasonable bias (maximum bias is $9 %$ for the second component of the mean) when $α = 0.95$ , that is the contamination is $5 %$ . When the contamination is $10 %$ , the bias of the mean components is relatively high but still below $19 %$ . With higher contamination, the percentage of bias in the mean components is in the interval $[28.3 %, 47 %]$ . This is the result of using standard density estimation to obtain the smoothing parameters for the different mean components. Smaller values of these component smoothing parameters result in substantial bias reduction.

We also looked at the case where the continuous model was contaminated by a trivariate normal with mean $μ^{T} = (1.5, 1.5, 1.5)$ and covariance matrix $I$ . In this case (results not shown), when the contamination is $5 %$ the maximum bias of the mean components is $6.6 %$ , while when the contamination is $10 %$ the maximum bias of the mean components is $13.5 %$ . Again, in this case the bandwidth parameters were obtained by fitting a unimodal density to the data.

The above results are not surprising. A judicious selection of the smoothing parameter decreases the bias of the component estimates of the mean. Agostinelli and Markatou [44] provide suggestions of how to select the smoothing parameter that can be extended and applied in this context.

8. Discussion and Conclusions

In this paper, we discuss Pearson residual systems that conform to the measurement scale of the data. We place emphasis on the mixed-scale measurements scenario, which is equivalent to having both discrete (categorical or nominal) and continuous type random variables, and obtain robust estimators of the parameters of the joint probability distribution that describes those variables. We show that, disparity methods can be used to actually control against model misspecification and the presence of outliers, and these methods provide reasonable results.

The scale and nature of measurement of the data imposes additional challenges, both computationally and statistically. Detecting outliers in this multidimensional space is an open research question (Eiras-Franco et al. [45]). The concept of outliers has a long history in the field of statistics and outlier detection methods have broad applications in many scientific fields such as security (Diehl and Hampshire [46], Portnoy et al. [47]), health care (Tran et al. [48]) and insurance (Konijn and Kowalczyk [49]) to mention just a few.

Classical outlier detection methods are largely designed for single measurement scale data. Handling mixed measurement scale is a challenge with few works coming from both, the field of statistics (Fraley and Wilkinson [50], Wilkinson [51]) and the fields of engineering and computer science (Do et al. [52], Koufakou et al. [53]). All these works use some version of a probabilistic outlier, either looking for regions in the space of data that have low density (Do et al. [52], Koufakou et al. [53]) or by attaching a probability, under a model, to the suspicious data point (Fraley and Wilkinson [50], Wilkinson [51]).

Our concept of a probabilistic outlier discussed here and expressed via the construction of appropriate Pearson residuals can unify the different measurement scales, and the class of disparity functions discussed above can provide estimators for the model parameters that are not influenced unduly by potential outliers.

One of the important parameters that controls the robustness of these methods is the smoothing parameter(s) used to compute the density estimator of the continuous part of the model. In our computations, we use standard smoothing parameters obtained from utilizing appropriate R functions for density estimation. The results show that, depending on the level of contamination and the type of contaminating probability model, the performance of the methods is satisfactory. Specifically, a small simulation study using the model reported in the caption of Table 9 shows that the overall bias associated with the mean components of the standard multivariate normal model is low when contamination with a multivariate normal model with mean components equal to 3 is less than or equal to $10 %$ . But even in this case, when the percentage of contamination is greater than $10 %$ , the bias increases when the smoothing parameter used is the one obtained from the R density function. Here, smaller values of the smoothing parameter guarantee reduction of the bias.

Devising rules for selecting the smoothing parameter(s) in the context of mixed-scale measurements that can guarantee robustness for larger than $5 %$ levels of contamination may be possible. However, it is the opinion of the authors that greater levels of data inhomogeneity may indicate model failure, a case where assessing model goodness of fit is of importance.

Abbreviations

The following abbreviations are used in this manuscript:

ALT	Alanine Aminotransferase
HD	Twice-Squared Hellinger’s Disparity
LD	Likelihood Disparity
MC	Monte Carlo Replications
MDE	Minimum Distance Estimators
MLE	Maximum Likelihood Estimator
PCS	Pearson’s Chi-Squared Disparity Divided by 2
PWD	Power Divergence Disparity
RAF	Residual Adjustment Function
SCS	Symmetric Chi-Squared Disparity
SD	Standard Deviation

Open in a new tab

Appendix A

Appendix A.1. Proof of Proposition 3

Proof.

The equations (4) are obtained from solving optimization problem (3). To solve this problem we need to form the corresponding Langrangian, which is

$\sum_{x, y} G (δ (x, y)) m_{β} (y | x) π_{x} - λ (\sum π_{x} - 1) .$

(i) Let $\nabla_{β}$ denote gradient with respect to $β$ . The estimators of $β$ are obtained as solutions of the set of equations:

$\nabla_{β} \{\sum_{x, y} G (δ (x, y)) m_{β} (y | x) π_{x} - λ (\sum π_{x} - 1)\} = 0,$

which can be equivalently expressed as follows,

$\sum_{x, y} π_{x} [\nabla_{β} G (δ (x, y))] m_{β} (y | x) + \sum_{x, y} π_{x} G (δ (x, y)) \nabla_{β} (y | x) = 0 .$

Notice that the $\nabla_{β}$ of $G (δ (x, y))$ is given by

$\nabla_{β} G (δ (x, y)) = - G^{'} (δ (x, y)) (δ (x, y) + 1) u (y | x; β),$

where the superscript "’" denote derivative with respect to $δ$ , $δ (x, y)$ is the Pearson residual and

$u (y | x; β) = \frac{\nabla_{β} m_{β} (y | x)}{m_{β} (y | x)} = \nabla_{β} ln [m_{β} (y | x)]$

is the score for $β$ in the conditional distribution of y given x. Therefore,

$\sum_{x, y} A (δ (x, y)) π_{x} u (y | x; β) m_{β} (y | x) = 0,$

where

$A (δ (x, y)) = G^{'} (δ (x, y)) [δ (x, y) + 1] - G (δ (x, y)) .$

By making use of the fact that $\sum_{x} π_{x} \nabla_{β} m_{β} (y | x) = 0$ , the resulting equations can represented as

$\sum_{x, y} \frac{A (δ (x, y)) + 1}{δ (x, y) + 1} n_{x, y} u (y | x; β) = 0,$

or equivalently,

$\sum_{x, y} w (δ (x, y)) n_{x, y} u (y | x; β) = 0 .$

Without loss of generality, we can take,

$w (δ (x, y)) = min \{\frac{{[A (δ (x, y)) + 1]}^{+}}{δ (x, y) + 1}, 1\}, w (δ (x, y)) \leq 1 .$

(ii) We now need to obtain ${\hat{π}}_{x}$ , which can be obtained by setting the gradient of formula with respect to $π_{z}$ equal to zero, that is, by the following equations:

$\sum_{y} G^{'} (δ (z, y)) [\nabla π_{z} δ (z, y)] m_{β} (y | z) π_{z} + \sum_{y} G (δ (z, y)) m_{β} (y | z) - λ = 0 .$

Recording $A (δ (z, y)) = G^{'} (δ (z, y)) [δ (z, y) + 1] - G (δ (z, y))$ and $δ (z, y) + 1 = \frac{n_{z, y} / n}{m_{β} (y | z) π_{z}}$ , the above equations are reduced to,

$\sum_{y} A (δ (z, y)) m_{β} (z, y) \frac{1}{π_{z}} + λ = 0$

and we readily conclude that,

$π_{z} = - \frac{1}{λ} \sum_{y} A (δ (z, y)) m (z, y), \forall z .$

Furthermore, to satisfy the constraint $\sum_{x} π_{x} = 1$ , we obtain

$λ = - \sum_{x, y} A (δ (x, y)) m_{β} (x, y) .$

Therefore, we get

$\sum_{x, y} A (δ (x, y)) m_{β} (y, x) [\frac{I (X = z)}{π_{x}} - 1] = 0$

and by making use of the fact that $\sum_{x, y} m_{β} (x, y) [\frac{I (X = z)}{π_{x}} - 1] = 0$ , the above equation can be represented as

$\sum_{x, y} w (δ (x, y)) n_{x, y} [\frac{I (X = x)}{π_{x}} - 1] = 0$

for any x where $I (X = x)$ is the indicator function of the event ${X = x} .$ □

Appendix A.2. Proof of Proposition 5

Recall that $β_{ϵ}$ is a solution of the set of estimating equation

\sum_{s, t} w (δ_{ϵ} (s, t)) u (t | s; β_{ϵ}) d_{ϵ} (s, t) = 0,

(A1)

where $d_{ϵ} (s, t) = (1 - ϵ) d (s, t) + ϵ \nabla_{x, y} (s, t)$ and $u (t | s; β) = \frac{\nabla_{β} m_{β} (s, t)}{m_{β} (s, t)} = \nabla_{β} ln [m_{β} (s, t)]$ is a p-dimensional vector.

The influence function of $β$ is calculated by differentiating, with respect to $ϵ$ , the quantity (A1), and evaluating the derivative at $ϵ = 0$ . Thus, we need

\begin{matrix} \frac{d}{d ϵ} { & \sum_{s, t} w (δ_{ϵ} (s, t)) u (t | s; β_{ϵ}) d (s, t) \\ - ϵ \sum_{s, t} w (δ_{ϵ} (s, t)) u (t | s; β_{ϵ}) d (s, t) \\ + ϵ \sum_{s, t} w (δ_{ϵ} (s, t)) u (t | s; β_{ϵ}) \nabla_{(x, y)} (s, t)} |_{ϵ = 0} = 0 . \end{matrix}

(A2)

Taking into account that $δ_{ϵ} (s, t) = \frac{d_{ϵ} (s, t)}{m_{β} (s, t)} - 1 = \frac{d_{ϵ} (s, t)}{m_{β} (t | s) π_{s}} - 1$ , the aforementioned evaluation implies

\begin{matrix} { & \sum_{s, t} (δ_{0} (t) + 1) w_{0}^{'} (δ_{0} (s, t)) u (t | s; β_{0}) u^{T} (t | s; β_{0}) d (s, t) \\ - \sum_{s, t} w (δ_{0} (s, t)) \nabla u (t | s; β_{0}) d (s, t)} β_{0}^{'} \\ = & \sum_{s, t} \{\frac{I (s = x, y = t)}{m_{β_{0}} (t | s) π_{s}} - \frac{d (s, t)}{m_{β_{0}} (t | s) π_{s}} w^{'} (δ_{0} (s, t))\} u (t | s; β_{0}) d (s, t) \\ - \sum_{s, t} w (δ_{0} (s, t)) u (t | s; β_{0}) d (s, t) + w (δ_{0} (x, y)) u (y | x; β_{0}), \end{matrix}

(A3)

which implies that

β_{0}^{'} = I F (β; F) = {[A (d)]}^{- 1} B (x, y; d) .

Appendix A.3. Assumptions of Theorem 1

The following assumptions are needed to be able to establish asymptotic normality of the estimators.

1.
The weight functions are nonnegative, bounded and differentiable with respect to $δ$ .
2.
The weight function is regular, that is, $w^{'} (δ) (δ + 1)$ is bounded, where $w^{'} (δ)$ is the derivative of w with respect to $δ$ .
3.
$\sum_{x, y} m^{\frac{1}{2}} (x, y) E [u_{k}^{2} (y | x; β_{0})] < \infty .$
4.
The elements of the Fisher information matrix are finite and the Fisher information matrix is nonsingular.
5.
$\sum_{x, y} m^{\frac{1}{2}} (x, y) E [u_{i}^{2} (y | x; β_{0}) u_{j}^{2} (y | x; β_{0})] < \infty \forall i, j = 1, 2, \dots, p .$
6.
If $β_{0}$ denotes the true value of $β$ , there exist functions $M_{i j k} (x)$ such that $| u_{i j k} (y | x; β_{0}) | \leq M_{i j k} (x)$ , $\forall β$ with $‖ β - β_{0} ‖^{2} < r (β_{0})$ , $r (β_{0}) < 0$ and $E_{β_{0}} | M_{i j k} (y | x) | < \infty, \forall i, j, k .$
7.
If $β_{0}$ denotes the true value of $β$ , there is a neighborhood $N (β_{0})$ such that for $β \in N (β_{0})$ the quantity $| u_{t} (y | x; β_{0}) u_{i} (y | x; β_{0}) u_{e} (y | x; β_{0}) |$ are bounded by $M_{1} (y | x)$ and $M_{2} (y | x)$ respectively, such that their corresponding expectations are finite.
8.
$A^{″} (δ + 1) (δ + 1)$ is bounded, where $A^{″}$ denotes the second derivative of A with respect to $δ$ .

Author Contributions

The authors of this paper have contributed as follows. Conceptualization: M.M.; Methodology: M.M., E.M.S., R.L.; Software: E.M.S., H.W.; Writing-original draft presentation: M.M., E.M.S., R.L., H.W.; Supervision, funding acquisition and project administration: M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Troup Fund, KALEIDA Health Foundation, under award number 82114, to Markatou who supported the work of the first and the third author of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Beran R. Minimum Hellinger Distance Estimates for Parametric Models. Ann. Stat. 1977;5:445–463. doi: 10.1214/aos/1176343842. [DOI] [Google Scholar]
2.Basu A., Lindsay B.G. Minimum Disparity Estimation for Continuous Models: Efficiency, Distributions and Robustness. Ann. Inst. Stat. Math. 1994;46:683–705. doi: 10.1007/BF00773476. [DOI] [Google Scholar]
3.Pardo J.A., Pardo L., Pardo M.C. Minimum ϕ-Divergence Estimator in Logistic Regression Models. Stat. Pap. 2005;47:91–108. doi: 10.1007/s00362-005-0274-7. [DOI] [Google Scholar]
4.Pardo J.A., Pardo L., Pardo M.C. Testing In Logistic Regression Models on ϕ-Divergences Measures. J. Stat. Plan. Inference. 2006;136:982–1006. doi: 10.1016/j.jspi.2004.08.008. [DOI] [Google Scholar]
5.Pardo J.A., Pardo M.C. Minimum ϕ-Divergence Estimator and ϕ-Divergence Statistics in Generalized Linear Models with Binary Data. Methodol. Comput. Appl. Probab. 2008;10:357–379. doi: 10.1007/s11009-007-9054-2. [DOI] [Google Scholar]
6.Simpson D.G. Minimum Hellinger Distance Estimation for the Analysis of Count Data. J. Am. Stat. Assoc. 1987;82:802–807. doi: 10.1080/01621459.1987.10478501. [DOI] [Google Scholar]
7.Simpson D.G. Hellinger Deviance Tests: Efficiency, Breakdown Points, and Examples. J. Am. Stat. Assoc. 1989;84:104–113. doi: 10.1080/01621459.1989.10478744. [DOI] [Google Scholar]
8.Markatou M., Basu A., Lindsay B.G. Weighted Likelihood Estimating Equations: The Discrete Case with Applications to Logistic Regression. J. Stat. Plan. Inference. 1997;57:215–232. doi: 10.1016/S0378-3758(96)00045-6. [DOI] [Google Scholar]
9.Basu A., Basu S. Penalized Minimum Disparity Methods for Multinomial Models. Stat. Sin. 1998;8:841–860. [Google Scholar]
10.Gupta A.K., Nguyen T., Pardo L. Inference Procedures for Polytomous Logistic Regression Models Based on ϕ-Divergence Measures. Math. Methods Stat. 2006;15:269–288. [Google Scholar]
11.Martín N., Pardo L. New Influence Measures in Polytomous Logistic Regression Models Based on Phi-Divergence Measures. Commun. Stat. Theory Methods. 2014;43:2311–2321. doi: 10.1080/03610926.2013.839038. [DOI] [Google Scholar]
12.Castilla E., Ghosh A., Martín N., Pardo L. New Robust Statistical Procedures for Polytomous Logistic Regression Models. Biometrics. 2018;74:1282–1291. doi: 10.1111/biom.12890. [DOI] [PubMed] [Google Scholar]
13.Martín N., Pardo L. Minimum Phi-Divergence Estimators for Loglinear Models with Linear Constraints and Multinomial Sampling. Stat. Pap. 2008;49:2311–2321. doi: 10.1007/s00362-006-0370-3. [DOI] [Google Scholar]
14.Pardo L., Martín N. Minimum Phi-Divergence Estimators and Phi-Divergence Test for Statistics in Contingency Tables with Symmetric Structure: An Overview. Symmetry. 2010;2:1108–1120. doi: 10.3390/sym2021108. [DOI] [Google Scholar]
15.Pardo L., Pardo M.C. Minimum Power-Divergence Estimator in Three-Way Contingency Tables. J. Stat. Comput. Simul. 2003;73:819–831. doi: 10.1080/0094965031000097782. [DOI] [Google Scholar]
16.Pardo L., Pardo M.C., Zografos K. Minimum ϕ-Divergence Estimator for Homogeneity in Multinomial Populations. Sankhyā Indian J. Stat. Ser. A (1961–2002) 2001;63:72–92. [Google Scholar]
17.Basu A., Harris I.A., Hjort N.L., Jones M.C. Robust and Efficient Estimation by Minimising a Density Power Divergence. Biometrika. 1998;85:549–559. doi: 10.1093/biomet/85.3.549. [DOI] [Google Scholar]
18.Csiszár I. Information-Type Measures of Difference of Probability Distributions and Indirect Observations. Stud. Sci. Math. Hung. 1967;25:299–318. [Google Scholar]
19.Lindsay B.G. Efficiency Versus Robustness: The Case for Minimum Hellinger Distance and Related Methods. Ann. Stat. 1994;22:1081–1114. doi: 10.1214/aos/1176325512. [DOI] [Google Scholar]
20.Tamura R.N., Boos D.D. Minimum Hellinger Distance Estimation for Multivariate Location and Covariance. J. Am. Stat. Assoc. 1986;81:223–229. doi: 10.1080/01621459.1986.10478264. [DOI] [Google Scholar]
21.Markatou M., Basu A., Lindsay B.G. Weighted Likelihood Equations with Bootstrap Root Search. J. Am. Stat. Assoc. 1998;93:740–750. doi: 10.1080/01621459.1998.10473726. [DOI] [Google Scholar]
22.Haberman S.J. Generalized Residuals for Log-Linear Models; Proceedings of the 9th International Biometrics Conference; Boston, MA, USA. 22–27 August 1976; pp. 104–122. [Google Scholar]
23.Haberman S.J., Sinharay S. Generalized Residuals for General Models for Contingency Tables with Application to Item Response Theory. J. Am. Stat. Assoc. 2013;108:1435–1444. doi: 10.1080/01621459.2013.835660. [DOI] [Google Scholar]
24.Pierce D.A., Schafer D.W. Residuals in Generalized Linear Models. J. Am. Stat. Assoc. 1986;81:977–986. doi: 10.1080/01621459.1986.10478361. [DOI] [Google Scholar]
25.Aerts M., Molenberghs G., Geys H., Ryan L. Topics in Modelling of Clustered Data. Volume 96 Chapman & Hall/CRC Press; New York, NY, USA: 1986. Monographs on Statistics and Applied Probability. [Google Scholar]
26.Olkin I., Tate R.F. Multivariate Correlation Models with Mixed Discrete and Continuous Variables. Ann. Math. Stat. 1961;32:448–465. doi: 10.1214/aoms/1177705052. With correction in 1961, 36, 343–344. [DOI] [Google Scholar]
27.Genest C., Nešlehová J. A Primer on Copulas for Count Data. ASTIN Bull. 2007;37:475–515. doi: 10.2143/AST.37.2.2024077. [DOI] [Google Scholar]
28.Lauritzen S., Wermuth N. Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative. Ann. Stat. 1989;17:31–57. doi: 10.1214/aos/1176347003. [DOI] [Google Scholar]
29.Hampel F.R., Ronchetti E.M., Rousseeuw P.J., Stahel W.A. Robust Statistics: The Approach Based on Influence Functions. Wiley; New York, NY, USA: 1986. Wiley Series in Probability and Mathematical Statistics. Probability and Mathematical Statistics. [Google Scholar]
30.Hampel F.R. Ph.D. Thesis. Department of Statistics, University of California, Berkeley; Berkeley, CA, USA: 1968. Contributions to the Theory of Robust Estimation. Unpublished. [Google Scholar]
31.Hampel F.R. The Influence Curve and its Role in Robust Estimation. J. Am. Stat. Assoc. 1974;69:383–393. doi: 10.1080/01621459.1974.10482962. [DOI] [Google Scholar]
32.Fienberg S.E. The Analysis of Incomplete Multi-Way Contingency Tables. Biometrics. 1972;28:177–202. doi: 10.2307/2528967. [DOI] [Google Scholar]
33.Agresti A. Categorical Data Analysis. 3rd ed. John Wiley & Sons; Hoboken, NJ, USA: 2013. [Google Scholar]
34.Johnson W.D., May W.L. Combining 2 × 2 Tables That Contain Structural Zeros. Biometrics. 1972;14:1901–1911. doi: 10.1002/sim.4780141706. [DOI] [PubMed] [Google Scholar]
35.Poon W.Y., Tang M.L., Wang S.J. Influence Measures in Contingency Tables with Application in Sampling Zeros. Sociol. Methods Res. 2003;31:439–452. doi: 10.1177/0049124103251946. [DOI] [Google Scholar]
36.Alin A., Kurt S. Ordinary and Penalized Minimum Power-Divergence Estimators in Two-Way Contingency Tables. Comput. Stat. 2008;23:455–468. doi: 10.1007/s00180-007-0088-2. [DOI] [Google Scholar]
37.Ye Y. Ph.D. Thesis. Department of Engineering-Economic Systems, Stanford University; Stanford, CA, USA: 1987. Interior Algorithms for Linear, Quadratic, and Linearly Constrained Convex Programming. Unpublished. [Google Scholar]
38.Conn A.R., Gould N.I.M., Toint P. A Globally Convergent Augmented Lagrangian Algorithm for Optimization with General Constraints and Simple Bounds. SIAM J. Numer. Anal. 1991;28:545–572. doi: 10.1137/0728030. [DOI] [Google Scholar]
39.Birgin E.G., Martínez J.M. Improving Ultimate Convergence of an Augmented Lagrangian Method. Optim. Methods Softw. 2008;23:177–195. doi: 10.1080/10556780701577730. [DOI] [Google Scholar]
40.Amatya A., Demirtas H. OrdNor: An R Package for Concurrent Generation of Correlated Ordinal and Normal Data. J. Stat. Softw. 2015;68:1–14. doi: 10.18637/jss.v068.c02. [DOI] [Google Scholar]
41.Olsson U., Drasgow F., Dorans N.J. The Polyserial Correlation Coefficient. Psychmetrika. 1982;47:337–347. doi: 10.1007/BF02294164. [DOI] [Google Scholar]
42.Duong T. ks: Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data in R. J. Stat. Softw. 2007;21:1–16. doi: 10.18637/jss.v021.i07. [DOI] [Google Scholar]
43.Bolstad W.M. Understanding Computational Bayesian Statistics. John Wiley & Sons; Hoboken, NJ, USA: 2010. [Google Scholar]
44.Agostinelli C., Markatou M. Test of Hypotheses Based on the Weighted Likelihood Methodology. Stat. Sin. 2001;11:499–514. [Google Scholar]
45.Eiras-Franco C., Martínez-Rego D., Guijarro-Berdiñas B., Alonso-Betanzos A., Bahamonde A. Large Scale Anomaly Detection in Mixed Numerical and Categorical Input Spaces. Inf. Sci. 2019;487:115–127. doi: 10.1016/j.ins.2019.03.013. [DOI] [Google Scholar]
46.Diehl C., Hampshire J. Real-Time Object Classification and Novelty Detection for Collaborative Video Surveillance; Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290); Honolulu, HI, USA. 12–17 May 2002; pp. 2620–2625. [Google Scholar]
47.Portnoy L., Eskin E., Stolfo S. Intrusion Detection with Unlabeled Data Using Clustering; Proceedings of the ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001); Philadelphia, PA, USA. 5–8 November 2001; pp. 5–8. [Google Scholar]
48.Tran T., Phung D., Luo W., Harvey R., Berk M., Venkatesh S. An Integrated Framework for Suicide Risk Prediction; Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Chicago, IL, USA. 11–14 August 2013; New York, NY, USA: ACM; 2013. pp. 1410–1418. [Google Scholar]
49.Konijn R.M., Kowalczyk W. Finding Fraud in Health Insurance Data with Two-Layer Outlier Detection Approach. In: Cuzzocrea A., Dayal U., editors. Data Warehousing and Knowledge Discovery, DaWak 2011. Springer; Berlin/Heidelberg, Germany: 2011. pp. 394–405. [Google Scholar]
50.Fraley C., Wilkinson L. Package ‘HDoutliers’. R Package. [(accessed on 31 December 2020)];2020 Available online: https://cran.r-project.org/web/packages/HDoutliers/index.html.
51.Wilkinson L. Visualizing Outliers. [(accessed on 31 December 2020)];2016 Available online: https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf.
52.Do K., Tran T., Phung D., Venkatesh S. Outlier Detection on Mixed-Type Data: An Energy-Based Approach. In: Li J., Li X., Wang S., Li J., Sheng Q.Z., editors. Advanced Data Mining and Applications. Springer; Cham, Switzerland: 2016. pp. 111–125. [Google Scholar]
53.Koufakou A., Georgiopoulos M., Anagnostopoulos G.C. Detecting Outliers in High-Dimensional Datasets with Mixed Attributes; Proceedings of the 2008 International Conference on Data Mining, DMIN; Las Vegas, NV, USA. 14–17 July 2008; pp. 427–433. [Google Scholar]

[B1-entropy-23-00107] 1.Beran R. Minimum Hellinger Distance Estimates for Parametric Models. Ann. Stat. 1977;5:445–463. doi: 10.1214/aos/1176343842. [DOI] [Google Scholar]

[B2-entropy-23-00107] 2.Basu A., Lindsay B.G. Minimum Disparity Estimation for Continuous Models: Efficiency, Distributions and Robustness. Ann. Inst. Stat. Math. 1994;46:683–705. doi: 10.1007/BF00773476. [DOI] [Google Scholar]

[B3-entropy-23-00107] 3.Pardo J.A., Pardo L., Pardo M.C. Minimum ϕ-Divergence Estimator in Logistic Regression Models. Stat. Pap. 2005;47:91–108. doi: 10.1007/s00362-005-0274-7. [DOI] [Google Scholar]

[B4-entropy-23-00107] 4.Pardo J.A., Pardo L., Pardo M.C. Testing In Logistic Regression Models on ϕ-Divergences Measures. J. Stat. Plan. Inference. 2006;136:982–1006. doi: 10.1016/j.jspi.2004.08.008. [DOI] [Google Scholar]

[B5-entropy-23-00107] 5.Pardo J.A., Pardo M.C. Minimum ϕ-Divergence Estimator and ϕ-Divergence Statistics in Generalized Linear Models with Binary Data. Methodol. Comput. Appl. Probab. 2008;10:357–379. doi: 10.1007/s11009-007-9054-2. [DOI] [Google Scholar]

[B6-entropy-23-00107] 6.Simpson D.G. Minimum Hellinger Distance Estimation for the Analysis of Count Data. J. Am. Stat. Assoc. 1987;82:802–807. doi: 10.1080/01621459.1987.10478501. [DOI] [Google Scholar]

[B7-entropy-23-00107] 7.Simpson D.G. Hellinger Deviance Tests: Efficiency, Breakdown Points, and Examples. J. Am. Stat. Assoc. 1989;84:104–113. doi: 10.1080/01621459.1989.10478744. [DOI] [Google Scholar]

[B8-entropy-23-00107] 8.Markatou M., Basu A., Lindsay B.G. Weighted Likelihood Estimating Equations: The Discrete Case with Applications to Logistic Regression. J. Stat. Plan. Inference. 1997;57:215–232. doi: 10.1016/S0378-3758(96)00045-6. [DOI] [Google Scholar]

[B9-entropy-23-00107] 9.Basu A., Basu S. Penalized Minimum Disparity Methods for Multinomial Models. Stat. Sin. 1998;8:841–860. [Google Scholar]

[B10-entropy-23-00107] 10.Gupta A.K., Nguyen T., Pardo L. Inference Procedures for Polytomous Logistic Regression Models Based on ϕ-Divergence Measures. Math. Methods Stat. 2006;15:269–288. [Google Scholar]

[B11-entropy-23-00107] 11.Martín N., Pardo L. New Influence Measures in Polytomous Logistic Regression Models Based on Phi-Divergence Measures. Commun. Stat. Theory Methods. 2014;43:2311–2321. doi: 10.1080/03610926.2013.839038. [DOI] [Google Scholar]

[B12-entropy-23-00107] 12.Castilla E., Ghosh A., Martín N., Pardo L. New Robust Statistical Procedures for Polytomous Logistic Regression Models. Biometrics. 2018;74:1282–1291. doi: 10.1111/biom.12890. [DOI] [PubMed] [Google Scholar]

[B13-entropy-23-00107] 13.Martín N., Pardo L. Minimum Phi-Divergence Estimators for Loglinear Models with Linear Constraints and Multinomial Sampling. Stat. Pap. 2008;49:2311–2321. doi: 10.1007/s00362-006-0370-3. [DOI] [Google Scholar]

[B14-entropy-23-00107] 14.Pardo L., Martín N. Minimum Phi-Divergence Estimators and Phi-Divergence Test for Statistics in Contingency Tables with Symmetric Structure: An Overview. Symmetry. 2010;2:1108–1120. doi: 10.3390/sym2021108. [DOI] [Google Scholar]

[B15-entropy-23-00107] 15.Pardo L., Pardo M.C. Minimum Power-Divergence Estimator in Three-Way Contingency Tables. J. Stat. Comput. Simul. 2003;73:819–831. doi: 10.1080/0094965031000097782. [DOI] [Google Scholar]

[B16-entropy-23-00107] 16.Pardo L., Pardo M.C., Zografos K. Minimum ϕ-Divergence Estimator for Homogeneity in Multinomial Populations. Sankhyā Indian J. Stat. Ser. A (1961–2002) 2001;63:72–92. [Google Scholar]

[B17-entropy-23-00107] 17.Basu A., Harris I.A., Hjort N.L., Jones M.C. Robust and Efficient Estimation by Minimising a Density Power Divergence. Biometrika. 1998;85:549–559. doi: 10.1093/biomet/85.3.549. [DOI] [Google Scholar]

[B18-entropy-23-00107] 18.Csiszár I. Information-Type Measures of Difference of Probability Distributions and Indirect Observations. Stud. Sci. Math. Hung. 1967;25:299–318. [Google Scholar]

[B19-entropy-23-00107] 19.Lindsay B.G. Efficiency Versus Robustness: The Case for Minimum Hellinger Distance and Related Methods. Ann. Stat. 1994;22:1081–1114. doi: 10.1214/aos/1176325512. [DOI] [Google Scholar]

[B20-entropy-23-00107] 20.Tamura R.N., Boos D.D. Minimum Hellinger Distance Estimation for Multivariate Location and Covariance. J. Am. Stat. Assoc. 1986;81:223–229. doi: 10.1080/01621459.1986.10478264. [DOI] [Google Scholar]

[B21-entropy-23-00107] 21.Markatou M., Basu A., Lindsay B.G. Weighted Likelihood Equations with Bootstrap Root Search. J. Am. Stat. Assoc. 1998;93:740–750. doi: 10.1080/01621459.1998.10473726. [DOI] [Google Scholar]

[B22-entropy-23-00107] 22.Haberman S.J. Generalized Residuals for Log-Linear Models; Proceedings of the 9th International Biometrics Conference; Boston, MA, USA. 22–27 August 1976; pp. 104–122. [Google Scholar]

[B23-entropy-23-00107] 23.Haberman S.J., Sinharay S. Generalized Residuals for General Models for Contingency Tables with Application to Item Response Theory. J. Am. Stat. Assoc. 2013;108:1435–1444. doi: 10.1080/01621459.2013.835660. [DOI] [Google Scholar]

[B24-entropy-23-00107] 24.Pierce D.A., Schafer D.W. Residuals in Generalized Linear Models. J. Am. Stat. Assoc. 1986;81:977–986. doi: 10.1080/01621459.1986.10478361. [DOI] [Google Scholar]

[B25-entropy-23-00107] 25.Aerts M., Molenberghs G., Geys H., Ryan L. Topics in Modelling of Clustered Data. Volume 96 Chapman & Hall/CRC Press; New York, NY, USA: 1986. Monographs on Statistics and Applied Probability. [Google Scholar]

[B26-entropy-23-00107] 26.Olkin I., Tate R.F. Multivariate Correlation Models with Mixed Discrete and Continuous Variables. Ann. Math. Stat. 1961;32:448–465. doi: 10.1214/aoms/1177705052. With correction in 1961, 36, 343–344. [DOI] [Google Scholar]

[B27-entropy-23-00107] 27.Genest C., Nešlehová J. A Primer on Copulas for Count Data. ASTIN Bull. 2007;37:475–515. doi: 10.2143/AST.37.2.2024077. [DOI] [Google Scholar]

[B28-entropy-23-00107] 28.Lauritzen S., Wermuth N. Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative. Ann. Stat. 1989;17:31–57. doi: 10.1214/aos/1176347003. [DOI] [Google Scholar]

[B29-entropy-23-00107] 29.Hampel F.R., Ronchetti E.M., Rousseeuw P.J., Stahel W.A. Robust Statistics: The Approach Based on Influence Functions. Wiley; New York, NY, USA: 1986. Wiley Series in Probability and Mathematical Statistics. Probability and Mathematical Statistics. [Google Scholar]

[B30-entropy-23-00107] 30.Hampel F.R. Ph.D. Thesis. Department of Statistics, University of California, Berkeley; Berkeley, CA, USA: 1968. Contributions to the Theory of Robust Estimation. Unpublished. [Google Scholar]

[B31-entropy-23-00107] 31.Hampel F.R. The Influence Curve and its Role in Robust Estimation. J. Am. Stat. Assoc. 1974;69:383–393. doi: 10.1080/01621459.1974.10482962. [DOI] [Google Scholar]

[B32-entropy-23-00107] 32.Fienberg S.E. The Analysis of Incomplete Multi-Way Contingency Tables. Biometrics. 1972;28:177–202. doi: 10.2307/2528967. [DOI] [Google Scholar]

[B33-entropy-23-00107] 33.Agresti A. Categorical Data Analysis. 3rd ed. John Wiley & Sons; Hoboken, NJ, USA: 2013. [Google Scholar]

[B34-entropy-23-00107] 34.Johnson W.D., May W.L. Combining 2 × 2 Tables That Contain Structural Zeros. Biometrics. 1972;14:1901–1911. doi: 10.1002/sim.4780141706. [DOI] [PubMed] [Google Scholar]

[B35-entropy-23-00107] 35.Poon W.Y., Tang M.L., Wang S.J. Influence Measures in Contingency Tables with Application in Sampling Zeros. Sociol. Methods Res. 2003;31:439–452. doi: 10.1177/0049124103251946. [DOI] [Google Scholar]

[B36-entropy-23-00107] 36.Alin A., Kurt S. Ordinary and Penalized Minimum Power-Divergence Estimators in Two-Way Contingency Tables. Comput. Stat. 2008;23:455–468. doi: 10.1007/s00180-007-0088-2. [DOI] [Google Scholar]

[B37-entropy-23-00107] 37.Ye Y. Ph.D. Thesis. Department of Engineering-Economic Systems, Stanford University; Stanford, CA, USA: 1987. Interior Algorithms for Linear, Quadratic, and Linearly Constrained Convex Programming. Unpublished. [Google Scholar]

[B38-entropy-23-00107] 38.Conn A.R., Gould N.I.M., Toint P. A Globally Convergent Augmented Lagrangian Algorithm for Optimization with General Constraints and Simple Bounds. SIAM J. Numer. Anal. 1991;28:545–572. doi: 10.1137/0728030. [DOI] [Google Scholar]

[B39-entropy-23-00107] 39.Birgin E.G., Martínez J.M. Improving Ultimate Convergence of an Augmented Lagrangian Method. Optim. Methods Softw. 2008;23:177–195. doi: 10.1080/10556780701577730. [DOI] [Google Scholar]

[B40-entropy-23-00107] 40.Amatya A., Demirtas H. OrdNor: An R Package for Concurrent Generation of Correlated Ordinal and Normal Data. J. Stat. Softw. 2015;68:1–14. doi: 10.18637/jss.v068.c02. [DOI] [Google Scholar]

[B41-entropy-23-00107] 41.Olsson U., Drasgow F., Dorans N.J. The Polyserial Correlation Coefficient. Psychmetrika. 1982;47:337–347. doi: 10.1007/BF02294164. [DOI] [Google Scholar]

[B42-entropy-23-00107] 42.Duong T. ks: Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data in R. J. Stat. Softw. 2007;21:1–16. doi: 10.18637/jss.v021.i07. [DOI] [Google Scholar]

[B43-entropy-23-00107] 43.Bolstad W.M. Understanding Computational Bayesian Statistics. John Wiley & Sons; Hoboken, NJ, USA: 2010. [Google Scholar]

[B44-entropy-23-00107] 44.Agostinelli C., Markatou M. Test of Hypotheses Based on the Weighted Likelihood Methodology. Stat. Sin. 2001;11:499–514. [Google Scholar]

[B45-entropy-23-00107] 45.Eiras-Franco C., Martínez-Rego D., Guijarro-Berdiñas B., Alonso-Betanzos A., Bahamonde A. Large Scale Anomaly Detection in Mixed Numerical and Categorical Input Spaces. Inf. Sci. 2019;487:115–127. doi: 10.1016/j.ins.2019.03.013. [DOI] [Google Scholar]

[B46-entropy-23-00107] 46.Diehl C., Hampshire J. Real-Time Object Classification and Novelty Detection for Collaborative Video Surveillance; Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290); Honolulu, HI, USA. 12–17 May 2002; pp. 2620–2625. [Google Scholar]

[B47-entropy-23-00107] 47.Portnoy L., Eskin E., Stolfo S. Intrusion Detection with Unlabeled Data Using Clustering; Proceedings of the ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001); Philadelphia, PA, USA. 5–8 November 2001; pp. 5–8. [Google Scholar]

[B48-entropy-23-00107] 48.Tran T., Phung D., Luo W., Harvey R., Berk M., Venkatesh S. An Integrated Framework for Suicide Risk Prediction; Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Chicago, IL, USA. 11–14 August 2013; New York, NY, USA: ACM; 2013. pp. 1410–1418. [Google Scholar]

[B49-entropy-23-00107] 49.Konijn R.M., Kowalczyk W. Finding Fraud in Health Insurance Data with Two-Layer Outlier Detection Approach. In: Cuzzocrea A., Dayal U., editors. Data Warehousing and Knowledge Discovery, DaWak 2011. Springer; Berlin/Heidelberg, Germany: 2011. pp. 394–405. [Google Scholar]

[B50-entropy-23-00107] 50.Fraley C., Wilkinson L. Package ‘HDoutliers’. R Package. [(accessed on 31 December 2020)];2020 Available online: https://cran.r-project.org/web/packages/HDoutliers/index.html.

[B51-entropy-23-00107] 51.Wilkinson L. Visualizing Outliers. [(accessed on 31 December 2020)];2016 Available online: https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf.

[B52-entropy-23-00107] 52.Do K., Tran T., Phung D., Venkatesh S. Outlier Detection on Mixed-Type Data: An Energy-Based Approach. In: Li J., Li X., Wang S., Li J., Sheng Q.Z., editors. Advanced Data Mining and Applications. Springer; Cham, Switzerland: 2016. pp. 111–125. [Google Scholar]

[B53-entropy-23-00107] 53.Koufakou A., Georgiopoulos M., Anagnostopoulos G.C. Detecting Outliers in High-Dimensional Datasets with Mixed Attributes; Proceedings of the 2008 International Conference on Data Mining, DMIN; Las Vegas, NV, USA. 14–17 July 2008; pp. 427–433. [Google Scholar]

PERMALINK

Distance-Based Estimation Methods for Models for Discrete and Mixed-Scale Data

Elisavet M Sofikitou

Ray Liu

Huipei Wang

Marianthi Markatou

Abstract

1. Introduction

2. Concepts in Minimum Disparity Estimation

Definition 1

Remark 1.

3. Pearson Residual Systems

Proposition 1.

Proof.

Proposition 2.

Proof.

4. Estimating Equations

Proposition 3.

Proof.

Remark 2.

Proposition 4.

Proof.

5. Robustness Properties

Definition 2.

Proposition 5.

Proof.

Proposition 6.

Proof.

6. Asymptotic Properties

Theorem 1.

7. Simulations

Table 1.

Table 2.

Table 3.

Table 4.

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

8. Discussion and Conclusions

Abbreviations

Appendix A

Appendix A.1. Proof of Proposition 3

Proof.

Appendix A.2. Proof of Proposition 5

Appendix A.3. Assumptions of Theorem 1

Author Contributions

Funding

Conflicts of Interest

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases