Regularized robust estimation in binary regression models

Qingguo Tang; Rohana J Karunamuni; Boxiao Liu

doi:10.1080/02664763.2020.1822304

. 2020 Sep 18;49(3):574–598. doi: 10.1080/02664763.2020.1822304

Regularized robust estimation in binary regression models

Qingguo Tang ^a, Rohana J Karunamuni ^b,^CONTACT, Boxiao Liu ^b,^*

PMCID: PMC9041772 PMID: 35706765

Abstract

In this paper, we investigate robust parameter estimation and variable selection for binary regression models with grouped data. We investigate estimation procedures based on the minimum-distance approach. In particular, we employ minimum Hellinger and minimum symmetric chi-squared distances criteria and propose regularized minimum-distance estimators. These estimators appear to possess a certain degree of automatic robustness against model misspecification and/or for potential outliers. We show that the proposed non-penalized and penalized minimum-distance estimators are efficient under the model and simultaneously have excellent robustness properties. We study their asymptotic properties such as consistency, asymptotic normality and oracle properties. Using Monte Carlo studies, we examine the small-sample and robustness properties of the proposed estimators and compare them with traditional likelihood estimators. We also study two real-data applications to illustrate our methods. The numerical studies indicate the satisfactory finite-sample performance of our procedures.

Keywords: Binary regression, maximum likelihood, minimum-distance methods, variable selection, efficiency, robustness

2010 Mathematics Subject Classification: 62F35

1. Introduction

Data involving explanatory (covariate) variables and binary responses are common in many disciplines, including health, medicine, environmental science, agriculture, behavioral science, social science, and education. If the response is one of two possible outcomes and covariates are observed for each experimental subject, binary regression models are commonly used [35]. In a typical binary regression model, we have a random sample of response variables $Y_{j} \in {0, 1}$ and covariates $x_{j} \in R^{p}, j = 1, \dots, K$ . The probability of a positive response is modeled as a function of a linear combination of the covariates: $P (Y_{j} = 1 | x_{j}) = p_{j},$ where $p_{j} = F (x_{j}^{T} β)$ with F denoting a known cumulative distribution function (CDF), commonly known as the link function, and $β$ the unknown regression parameter that needs to be estimated. When F is the logistic distribution we have a logit model, and when F is the standard normal distribution we have a probit model. Chapter 4 of McCullagh and Nelder [35] provides an excellent account of methods of estimation and data analysis procedures based on binary regression models.

In many experiments, the units under study can be classified into K groups in such a way that the individuals in a group have identical values for all the covariates. Thus, for the jth combination of experimental conditions characterized by the p-dimensional vector $x_{j},$ observations are available for $n_{j}$ individuals. Then of the $N = \sum_{j = 1}^{K} n_{j}$ individuals under study, $n_{j}$ share the covariate vector $x_{j},$ $j = 1, \dots, K$ . These groups are known as covariate classes [35]. Working with grouped data has the additional advantage that, depending on the size of the groups, it becomes possible to test the goodness of fit of the model.

A classic example where grouped data arise naturally is in the ‘effective dose level’ estimation of dose–response studies; see, e.g. Bhattacharya and Kong [5], Li and Wiens [28] and Karunamuni et al. [26]. Specifically, in pharmacology or toxicology studies, experimenters are often interested in estimating the effective dose level $E D_{p},$ where 0<p<1. The $E D_{P}$ is the dose at which $100 p %$ of the subjects show a response. Generally, K groups of test subjects characterized by different dose levels $x_{j}$ $(j = 1, \dots, K)$ are collected, where each subject in the corresponding group is collected independently. The number of subjects in the groups is $n_{j}$ $(1 \leq j \leq K)$ , and the number of subjects showing a response at dose level $x_{j}$ $(1 \leq j \leq K)$ is $m_{j} .$ In the dose–response context, it is generally assumed that $x_{1} < x_{2} < \dots < x_{K} .$ That is, the outcome of interest is usually measured at several increasing dose levels. For every subject, a binary response is produced: ‘1’ indicates a response, and ‘0’ indicates no response. The model then reduces to $P (Y_{j} = 1 | x_{j}) = F (x_{j}^{T} β)$ with $β = (β_{0}, β_{1})^{T}$ and $x_{j} = (1, x_{j})^{T}$ for parameters $β_{0}$ and $β_{1} > 0.$ Many such examples can be found in the literature; see, e.g. McCullagh and Nelder [35] and Tutz [47].

For statistical inference in binary regression models, the maximum likelihood approach is by far the most widely used method. Specifically, let $n_{j}$ denote the number of observations in group j, and let $Y_{j}$ denote the number of units with the attribute of interest in group j, $j = 1, \dots, K$ . Then the conditional distribution of $Y_{j}$ given $x_{j}$ is binomial with parameters $p_{j} = F (x_{j}^{T} β)$ and $n_{j},$ $j = 1, \dots, K .$ Hence the log-likelihood function of the data ${(Y_{1}, x_{1}),$ $(Y_{2}, x_{2}), \dots, (Y_{K}, x_{K})}$ is given by

l_{N} (β) = \sum_{j = 1}^{K} \{Y_{j} \ln (F (x_{j}^{T} β)) + (n_{j} - Y_{j}) \ln (1 - F (x_{j}^{T} β))\},

(1)

and the maximum likelihood estimator (MLE) ${\hat{β}}_{M L E}$ of the parameter $β$ is obtained by maximizing $l_{N} (β)$ . For simultaneous estimation and variable selection, an estimator is generally constructed by maximizing the penalized log-likelihood function $l_{N} (β) - P_{N} (β),$ where $P_{N} (β)$ is a penalty function on $β .$ This idea defines a penalized (regularized) estimator of $β$ as

{\hat{β}}_{P M L E} = \arg max_{β} {l_{N} (β) - P_{N} (β)} .

(2)

Under some assumptions, ${\hat{β}}_{M L E}$ exists and is asymptotically unique; it is consistent and asymptotically normal as $N \to \infty$ and all $n_{j} \to \infty,$ $j = 1, \dots, K .$ Moreover, it is asymptotically efficient compared to a wide class of other estimators [12,18]. However, both ${\hat{β}}_{M L E}$ and ${\hat{β}}_{P M L E}$ are sensitive to atypical data and model misspecification. In particular, observations with extreme covariates have a large influence on these estimators, and if they are accompanied by misclassified responses and a misclassified link function, the resulting estimates can be seriously biased [23].

For grouped data, the lack of robustness of ${\hat{β}}_{M L E}$ has been examined by several authors; see, e.g. Barnett [1], Victoria-Feser and Ronchetti [48] and Hosseinian and Morgenthaler [21]. For non-grouped data, the non-robustness of ${\hat{β}}_{M L E}$ has been extensively discussed; see, e.g. Pregibon [37,38], Stefanski et al. [42], Copas [9], Künsch et al. [24], Morgenthaler [36], Carroll and Pederson [8], Bianco and Yohai [6], Markatou et al. [34], Cantoni and Ronchetti [7], Croux and Haesbroeck [10], Müller and Neykov [33], Gervini [17], and Hosseinian and Morgenthale [21], among others.

A robust methodology is vital in data analysis because outliers and model misspecifications are common in practical applications. Moreover, efficient methods are essential in practice. These considerations have motivated our research. We propose regularized estimation of $β$ using the minimum-distance approach. Minimum-distance estimators possess a certain degree of automatic robustness to model misspecification [11]. Furthermore, certain minimum-distance estimators achieve efficiency under the model. In particular, minimum Hellinger distance (MHD) estimators for parametric models attain efficiency under the model and have excellent robustness properties in the presence of outliers and/or model misspecification [3,4,39]. Moreover, Lindsay [30] has shown that the maximum likelihood and MHD estimators are members of a larger class of efficient estimators with various second-order efficiency properties. For discrete data, Simpson [39] has shown that the breakdown point of MHD estimators is 1/2; that is, they achieve maximum robustness in the presence of outliers (see also [20]). Another distance measure that is intimately related to the Hellinger distance is the symmetric chi-squared distance introduced by Lindsay [29]. Lindsay [29,30] studied non-regularized estimators using several distance measures and showed that the minimum symmetric chi-squared distance (MSCD) generates highly efficient and robust estimators.

In this paper, we investigate simultaneous robust estimation and variable selection for binary regression models with grouped data. For this purpose, we employ the squared Hellinger distance or the symmetric chi-squared distance as the measure of adequacy, combined with a penalty function. Specifically, the proposed penalized (regularized) estimator of $β$ is constructed by minimizing $D (P, Q) + P_{N} (β),$ where $D (P, Q)$ is a distance measure between two probability distributions P and Q based on an estimated model and the true model, respectively, and $P_{N} (β)$ is a penalty function. We use distance measures instead of the log-likelihood function (1) to develop robust estimators for $β .$ We investigate asymptotic properties including consistency and asymptotic normality of the proposed estimators. In particular, we show that the proposed regularized estimators of $β$ have desirable asymptotic properties, such as oracle properties [15]. Using Monte Carlo studies we examine their small-sample and robustness properties and compare them with the corresponding regularized MLE of $β .$

The remainder of this paper is organized as follows. Section 2 develops proposed robust regularized regression estimators. Section 3 gives the asymptotic properties of the estimators. Section 4 presents the finite-sample performance of the estimators. In Section 5, we illustrate and compare the proposed estimators using two real-data applications. Finally, Section 6 contains a discussion of the results. The proofs of main results are given in the Appendix.

2. Regularized minimum-distance estimators

To develop the methodology, we first consider two discrete probability distributions $P = {p_{i} : i \in I}$ and $Q = {q_{i} : i \in I}$ , where I is some discrete set, $p_{i}, q_{i} > 0$ for all $i \in I,$ and $\sum p_{i} = \sum q_{i} = 1.$ Then the squared Hellinger distance between the probability distributions P and Q is defined by $H^{2} (P, Q) = \sum_{i \in I} (\sqrt{p_{i}} - \sqrt{q_{i}})^{2}$ [3,39], and the symmetric chi-squared distance between the probability distributions P and Q is defined as $C^{2} (P, Q) = 2 \sum_{i \in I} \frac{(p_{i} - q_{i})^{2}}{(p_{i} + q_{i})}$ [29,30]. Using the inequalities $p_{i} + q_{i} \leq (\sqrt{p_{i}} + \sqrt{q_{i}})^{2} \leq 2 (p_{i} + q_{i}),$ it follows that $\frac{1}{4} C^{2} (P, Q) \leq H^{2} (P, Q) \leq \frac{1}{2} C^{2} (P, Q)$ [27]. Therefore, there is a very strong near-equivalence relationship between the Hellinger distance and the symmetric chi-squared distance. That is, the two distances generate equivalent topologies on the space of distributions: there is a $C^{2}$ -ball inside every $H^{2}$ -ball and vice-versa. The Hellinger distance H is not strongly affected by the presence of a few outliers, and these bounds show that this property carries over in some degree to the symmetric chi-squared distance $C^{2} .$ On the other hand, both distances are closely linked to the total variation distance defined by $V (P, Q) = \frac{1}{2} \sum_{i \in I} | p_{i} - q_{i} |$ via the relationships $V^{2} (P, Q) \leq C^{2} (P, Q) / 4 \leq V (P, Q)$ and $V^{2} (P, Q) \leq H^{2} (P, Q) \leq 2 V (P, Q) .$

In order to construct estimators of $β$ for binary regression with grouped data, we compute Hellinger and symmetric chi-squared distances with the following two discrete probability distributions:

\begin{aligned} P_{N} & = {(w_{1, N} {\hat{p}}_{1}, \dots, w_{K, N} {\hat{p}}_{K}, w_{1, N} (1 - {\hat{p}}_{1}), \dots, w_{K, N} (1 - {\hat{p}}_{K}))}^{T} \end{aligned}

(3)

\begin{aligned} Q_{N} & = {(w_{1, N} p_{1}, . ., w_{K, N} p_{K}, w_{1, N} (1 - p_{1}), . ., w_{K, N} (1 - p_{K}))}^{T}, \end{aligned}

(4)

where ${\hat{p}}_{j} = \frac{Y_{j}}{n_{j}}, p_{j} = F (x_{j}^{T} β), w_{j, N} = \frac{n_{j}}{N}, n_{j}$ is the number of observations in group j, $Y_{j}$ is the number of units with the attribute of interest in group j, $j = 1, \dots, K,$ and $N = \sum_{j = 1}^{K} n_{j} .$ Note that ${\hat{p}}_{j}$ is a consistent estimator of $p_{j},$ $1 \leq j \leq K .$ Thus, the probability distributions $P_{N}$ and $Q_{N}$ are based on an estimated model and the true model, respectively. The squared Hellinger distance between $P_{N}$ and $Q_{N}$ is given by

\begin{aligned} H^{2} (P_{N}, Q_{N}) & = \sum_{j = 1}^{K} \{w_{j, N} ({\sqrt{\hat{p}}}_{j} - {\sqrt{p}}_{j})^{2} + w_{j, N} (\sqrt{1 - {\hat{p}}_{j}} - \sqrt{1 - p_{j}})^{2}\} \\ = \sum_{j = 1}^{K} w_{j, N} \{{(\sqrt{{\hat{p}}_{j}} - \sqrt{F (x_{j}^{T} β)})}^{2} + {(\sqrt{1 - {\hat{p}}_{j}} - \sqrt{1 - F (x_{j}^{T} β)})}^{2}\}, \end{aligned}

(5)

and the symmetric chi-squared distance $C^{2} (P_{N}, Q_{N})$ is given by

C^{2} (P_{N}, Q_{N}) = 2 \sum_{j = 1}^{K} w_{j, N} \{\frac{[{\hat{p}}_{j} - F (x_{j}^{T} β)]^{2}}{[{\hat{p}}_{j} + F (x_{j}^{T} β)]} + \frac{[(1 - {\hat{p}}_{j}) - (1 - F (x_{j}^{T} β))]^{2}}{[(1 - {\hat{p}}_{j}) + (1 - F (x_{j}^{T} β))]}\} .

(6)

Then MHD and MSCD estimators of $β$ can be obtained by minimizing $H^{2} (P_{N}, Q_{N})$ and $C^{2} (P_{N}, Q_{N}),$ respectively, with respect to $β .$ By simplifying (5) and (6), MHD and MSCD estimators can also be obtained as follows:

{\hat{β}}_{M H D} = \arg max_{β} \sum_{j = 1}^{K} w_{j, N} \{\sqrt{{\hat{p}}_{j} F (x_{j}^{T} β)} + \sqrt{(1 - {\hat{p}}_{j}) (1 - F (x_{j}^{T} β))}\}

(7)

and

{\hat{β}}_{M S C D} = \arg min_{β} \sum_{j = 1}^{K} w_{j, N} [{\hat{p}}_{j} - F (x_{j}^{T} β)]^{2} / {[{\hat{p}}_{j} + F (x_{j}^{T} β)] [2 - {\hat{p}}_{j} - F (x_{j}^{T} β)]} .

(8)

The asymptotic properties of ${\hat{β}}_{M H D}$ and ${\hat{β}}_{M S C D}$ can be established following the techniques developed in Stather [41] and Karunamuni et al. [26]. The estimators ${\hat{β}}_{M H D}$ and ${\hat{β}}_{M S C D}$ have excellent robustness properties in the presence of outliers and/or model misspecification [3,4,20,29,30,39,44].

We now discuss the simultaneous estimation and variable selection problem. A penalty function generally facilitates variable selection in regression models. As discussed in the Introduction, in the present context simultaneous estimation and variable selection is generally carried out by maximizing the penalized log-likelihood function $l_{N} (β) - P_{N} (β),$ where $l_{N} (β)$ is the (conditional) log-likelihood function defined by (1), and $P_{N} (β)$ is a penalty function on $β$ . It can be shown that the resulting estimator ${\hat{β}}_{P M L E}$ (see (2)) has nice properties, including the oracle properties [15]; namely, it performs as well as if the true underlying model were given in advance. Such regularized methods have been widely used for simultaneous coefficient estimation and variable selection by identifying covariate variables that are associated with a response variable. However, penalized likelihood procedures are not robust to outliers and model misspecification [15]. In other words, the estimator ${\hat{β}}_{P M L E}$ can be highly unstable if the model is not completely correct and if outliers are present.

We propose the following approach: we replace the log-likelihood function $l_{N} (β)$ with a distance measure, such as $H^{2} (P, Q)$ or $C^{2} (P, Q),$ to develop a robust regularized estimator. In view of (5), a penalized MHD estimator of $β$ is obtained by minimizing $H^{2} (P_{N}, Q_{N}) + P_{N} (β)$ w.r.t. $β .$ Equivalently, in view of (7) a regularized MHD estimator of $β$ can be obtained as

{\tilde{β}}_{P M H D} = \arg max_{β} \{\sum_{j = 1}^{K} w_{j, N} \{\sqrt{{\hat{p}}_{j} F (x_{j}^{T} β)} + \sqrt{(1 - {\hat{p}}_{j}) (1 - F (x_{j}^{T} β))}\} - P_{N} (β)\} .

(9)

In view of (8), one can construct a regularized MSCD estimator of $β$ as

\begin{aligned} {\tilde{β}}_{P M S C D} & = \arg min_{β} \{\sum_{j = 1}^{K} w_{j, N} [{\hat{p}}_{j} - F (x_{j}^{T} β)]^{2} / \{[{\hat{p}}_{j} + F (x_{j}^{T} β)] [2 - {\hat{p}}_{j} - F (x_{j}^{T} β))]\} \\ + P_{N} (β)\} . \end{aligned}

(10)

In variable selection problems, it is assumed that some components of $β$ are equal to zero. The goal is to identify and estimate the subset model. It has been argued that folded concave penalties are preferable to convex penalties such as the $l_{1}$ -penalty in terms of both model-estimation accuracy and variable selection consistency [16,32].

Let $p_{λ_{N}} (| t |)$ denote a folded concave penalty function defined on $t \in (- \infty, + \infty)$ satisfying

$p_{λ_{N}} (t)$ are increasing and concave in $t \in [0, + \infty)$ ;
$p_{λ_{N}} (t)$ are differentiable in $t \in (0, + \infty)$ with $p_{λ_{N}}^{(1)} (0) := p_{λ_{N}}^{(1)} (0 +) \geq a_{1} λ_{N}$ , $p_{λ_{N}}^{(1)} (t) \geq a_{1} λ_{N}$ for $t \in (0, a_{2} λ_{N}]$ , $p_{λ_{N}}^{(1)} (t) \leq a_{3} λ_{N}$ for $t \in [0, + \infty)$ , and $p_{λ_{N}}^{(1)} (t) = 0$ for $t \in [a λ_{N}, + \infty)$ with a prespecified constant $a > a_{2}$ , where $p_{λ_{N}}^{(1)}$ denotes the first derivative of $p_{λ_{N}}$ , and $a_{1}$ , $a_{2}$ , and $a_{3}$ are fixed positive constants.

The above family of general folded concave penalties contains several popular penalties, including the SCAD penalty [15] and the MCP penalty [53].

With a concave penalty function and in view of (9), the proposed regularized MHD estimator of $β$ is defined by

\begin{aligned} {\hat{β}}_{P M H D} & = \arg max_{β} \{\sum_{j = 1}^{K} w_{j, N} \{\sqrt{{\hat{p}}_{j} F (x_{j}^{T} β)} + \sqrt{(1 - {\hat{p}}_{j}) (1 - F (x_{j}^{T} β))}\} \\ - \sum_{k = 1}^{p} p_{λ_{N}}^{(1)} (| β_{k}^{(0)} |) | β_{k} |\}, \end{aligned}

(11)

where $β^{(0)} = (β_{1}^{(0)}, \dots, β_{p}^{(0)})^{T}$ is an initial robust estimator of $β .$ For example, $β^{(0)}$ can be obtained from (7) or (8) above. We will show that ${\hat{β}}_{P M H D}$ has the oracle properties. Similarly, in view of (10), the proposed regularized MSCD estimator of $β$ is defined by

\begin{aligned} {\hat{β}}_{P M S C D} & = \arg min_{β} \{\sum_{j = 1}^{K} w_{j, N} [{\hat{p}}_{j} - F (x_{j}^{T} β)]^{2} / \{[{\hat{p}}_{j} + F (x_{j}^{T} β)] [2 - {\hat{p}}_{j} - F (x_{j}^{T} β)]\} \\ + \sum_{k = 1}^{p} p_{λ_{N}}^{(1)} (| β_{k}^{(0)} |) | β_{k} |\} . \end{aligned}

(12)

As briefly mentioned in the Introduction, it has been argued in the literature that all minimum-distance estimators, including the MHD and MSCD estimators, are automatically robust with respect to the stability of the quantity being estimated [11]. In other words, they are only slightly affected by small departures from the true model. Furthermore, MHD estimators for count data have excellent robustness to outliers and model misspecification [39]. This can be attributed to the fact that both the Hellinger and symmetric chi-squared distances generate topologies that are equivalent to the topology of the total variation distance, which is known to produce highly robust estimators. For discrete data, MHD estimators attain the highest breakdown point 1/2 [20,39]. The breakdown point of an estimator is the proportion of incorrect observations (i.e. arbitrary values) it can handle before giving an arbitrarily large result [19,22,23]. Further, Lindsay [29] has shown that the non-regularized MSCD estimators are highly efficient and robust in general. Thus, we can also expect that the proposed regularized MHD and MSCD estimators, namely ${\hat{β}}_{P M H D}$ and ${\hat{β}}_{P M S C D},$ to be highly efficient and robust.

Apart from Lindsay [29] and Karunamuni et al. [26], we are not aware of any significant work on the use of $C^{2} (P, Q)$ in statistical inference. However, $H^{2} (P, Q)$ has been widely implemented for both continuous and discrete distributions. The literature is too extensive for a complete listing here. Some recent developments and important references can be found in the articles Wu and Karunamuni [50–52] and Tang and Karunamuni [45,46], and in the monograph of Basu et al. [2].

3. Asymptotic properties of estimators

We first introduce some notation. Let $I^{K} = [0, 1] \times [0, 1] \times \dots \times [0, 1]$ $(K$ terms) denote the K-product space of the interval $[0, 1]$ and let $ℵ = {w : w = (w_{1}, \dots, w_{K}) \in I^{K}, w_{j} > 0, \sum_{j = 1}^{K} w_{j} = 1}$ . Define $G_{K} = I^{K} \times ℵ .$ For $1 \leq j \leq K,$ let $w_{j, N} = n_{j} / N$ and ${\hat{p}}_{j} = Y_{j} / n_{j},$ with $Y_{j} \sim B (n_{j}, p_{j})$ and $N = \sum_{j = 1}^{K} n_{j}$ as defined in Section 2. We assume that the $Y_{j}$ 's are independent, $1 \leq j \leq K .$ Let $w_{N}$ and ${\hat{p}}_{N}$ denote K-dimensional vectors with components $w_{j, N}$ $(1 \leq j \leq K)$ and ${\hat{p}}_{j}$ $(1 \leq j \leq K)$ , respectively. Then $({\hat{p}}_{N}, w_{N}) \in G_{K} .$ Let $Θ \subseteq$ $R^{p}$ denote the parameter space of $β$ .

The main result of this paper is Theorem 3.4 below, which establishes the oracle properties of the penalized MHD estimator ${\hat{β}}_{P M H D}$ defined by (11). We first present some asymptotic properties of the MHD estimator ${\hat{β}}_{M H D}$ defined by (7). These results on ${\hat{β}}_{M H D}$ are helpful to establish oracle properties of ${\hat{β}}_{P M H D} .$ Generally, it is convenient to formulate ${\hat{β}}_{M H D}$ as a functional value. We define a functional $T : G_{K} \to Θ$ such that $T (π, w)$ is a value of $β \in$ Θ defined by

\arg max_{β} \sum_{j = 1}^{K} w_{j} \{\sqrt{π_{j} F (x_{j}^{T} β)} + \sqrt{(1 - π_{j}) (1 - F (x_{j}^{T} β))}\} .

(13)

If $T (π, w)$ is not uniquely defined then we choose one of the possible values arbitrarily. In terms of the functional T, the MHD estimator ${\hat{β}}_{M H D}$ defined by (7) is equal to $T ({\hat{p}}_{N}, w_{N}) .$ This formulation of the MHD estimator makes it easier to prove the asymptotic results (e.g. [3,26,41,45], among others).

The following theorem establishes the consistency of the MHD estimator ${\hat{β}}_{M H D}$ . The proofs of theorems given below are relegated to Appendix. Let $β_{0}$ denote the true parameter value of $β .$

Theorem 3.1

Suppose that Θ is a compact subset of $R^{p}$ and F is continuous and strictly increasing on $R$ . Suppose further that $π = (π_{1}, \dots, π_{K})^{T}$ with $π_{j} = F (x_{j}^{T} β_{0})$ for $1 \leq j \leq K,$ and the $x_{j}$ 's span $R^{p}$ . Assume that $w_{j, N} \to w_{j} > 0$ as $N \to \infty$ for $1 \leq j \leq K .$ Then the MHD estimator ${\hat{β}}_{M H D}$ is a consistent estimator of $β_{0}$ ; i.e. ${\hat{β}}_{M H D} \overset{P}{\to}$ $β_{0}$ as $N \to \infty,$ where $\overset{P}{\to}$ stands for the convergence in probability.

The next theorem lays the necessary foundation for a result on the asymptotic normality of the MHD estimator ${\hat{β}}_{M H D} .$

Theorem 3.2

Suppose Θ is a compact subset of $R^{p}$ and let $C = {x_{j}^{T} β : β \in Θ, 1 \leq j \leq K} .$ Suppose further that F is strictly increasing and thrice differentiable with derivatives f, $f^{(1)},$ and $f^{(2)}$ bounded on $C$ . Assume that $F (C) \subseteq [δ, 1 - δ]$ for some $δ > 0.$ Let $(π, w) \in G_{K}$ be such that $T (π, w)$ is unique, and let ${(π_{n}, w_{n}) \in G_{K} : n \geq 1}$ be a deterministic sequence such that $(π_{n}, w_{n}) \to (π, w)$ as $n \to \infty .$ Let $Σ (β)$ be a $p \times p$ matrix defined by $Σ (β) = \sum_{j = 1}^{K} w_{j} x_{j} x_{j}^{T} G_{j}^{(1)} (x_{j}^{T} β),$ and let $λ (π, w, β)$ be a $p \times 1$ vector defined by $λ (π, w, β) = \sum_{j = 1}^{K} w_{j} x_{j} G_{j} (x_{j}^{T} β),$ where $G_{j} (y) = \frac{d}{d y} {\sqrt{π_{j} F (y)} + \sqrt{(1 - π_{j}) (1 - F (y))}}$ for $1 \leq j \leq K .$ Assume that the matrix $Σ (β)$ is nonsingular. Then, as $n \to \infty,$ we have

$T (π_{n}, w_{n}) - T (π, w) = - [Σ^{- 1} (β) + o (1)] λ (π_{n}, w_{n}, β),$ (14)

where $λ (π_{n}, w_{n}, β)$ is obtained from $λ (π, w, β)$ by replacing $(π, w)$ with $(π_{n}, w_{n}) .$ Then we have

$T (π_{n}, w_{n}) - T (π, w) = 4 [Σ^{* - 1} (β) + o (1)] λ (π_{n}, w_{n}, β),$ (15)

where $Σ^{*}$ is a $p \times p$ matrix defined by

$Σ^{*} (β) = \sum_{j = 1}^{K} \frac{w_{j} f^{2} (x_{j}^{T} β) [{π_{j} F (x_{j}^{T} β)}^{1 / 2} + {(1 - π_{j}) (1 - F (x_{j}^{T} β))}^{1 / 2}]^{2}}{F (x_{j}^{T} β) (1 - F {(x}_{j}^{T} β))} x_{j} x_{j}^{T} .$

The next theorem establishes the asymptotic normality of the MHD estimator ${\hat{β}}_{M H D} .$

Theorem 3.3

Assume that the conditions of Theorems 3.1 and 3.2 hold. Further, assume that the expansion (14) holds for $T ({\hat{p}}_{N}, w_{N})$ with $o_{p} (1),$ where ${\hat{β}}_{M H D} = T ({\hat{p}}_{N}, w_{N}) .$ Let $(π, w) \in G_{K}$ be such that $T (π, w)$ is unique and $T (π, w) = β_{0}$ . Then, as $N \to \infty,$ we have

$\sqrt{N} ({\hat{β}}_{M H D} - β_{0}) \overset{D}{\to} N (0, \frac{1}{16} Σ^{- 1} (β_{0}) Σ^{*} (β_{0}) Σ^{- 1} (β_{0})),$ (16)

where $Σ^{*} (β)$ and $Σ (β)$ are as defined in Theorem 3.2. For $π_{j} = F (x_{j}^{T} β_{0})$ for $1 \leq j \leq K,$ we have

$\sqrt{N} ({\hat{β}}_{M H D} - β_{0}) \overset{D}{\to} N (0, Σ^{* - 1} (β_{0})),$ (17)

where $Σ^{*} (β_{0}) = \sum_{j = 1}^{K} w_{j} x_{j} x_{j}^{T} f^{2} (x_{j}^{T} β_{0}) / [F (x_{j}^{T} β_{0}) (1 - F {(x}_{j}^{T} β_{0}))] .$

In the next theorem we show that the penalized MHD estimator ${\hat{β}}_{P M H D}$ defined by (11) has the oracle properties. Without loss of generality, let $β = (β_{1}^{T}, β_{2}^{T})^{T}$ , where $β_{1} \in R^{d}$ and $β_{2} \in R^{p - d}$ . The vector of true parameters is denoted by $β_{0} = (β_{01}^{T}, β_{02}^{T})^{T}$ with each element of $β_{01}$ being nonzero and $β_{02} = 0$ .

Theorem 3.4

Assume that the conditions of Theorem 3.3 hold. Let $p_{λ_{N}} (\cdot)$ be general folded concave penalty functions satisfying assumptions $(a)$ and $(b)$ defined in Section 2. If $λ_{N} \to 0$ and $\sqrt{N} λ_{N} \to \infty$ as $N \to \infty,$ then the penalized MHD estimator ${\hat{β}}_{P M H D} = ({\hat{β}}_{P M H D 1}^{T}, {\hat{β}}_{P M H D 2}^{T})^{T}$ defined by (11) satisfies

Sparsity: $P ({\hat{β}}_{P M H D 2} = 0) \to 1;$

Asymptotic normality:
$\sqrt{N} ({\hat{β}}_{P M H D 1} - β_{01}) \overset{D}{\to} N (0, \frac{1}{16} Σ_{1}^{- 1} (β_{0}) Σ_{1}^{*} (β_{0}) Σ_{1}^{- 1} (β_{0})),$ (18)
where $Σ_{1} (β) = \sum_{j = 1}^{K} w_{j} x_{j 1} x_{j 1}^{T} G_{j}^{(1)} (x_{j}^{T} β)$ and
$Σ_{1}^{*} (β) = \sum_{j = 1}^{K} \frac{w_{j} f^{2} (x_{j}^{T} β) [{π_{j} F (x_{j}^{T} β)}^{1 / 2} + {(1 - π_{j}) (1 - F (x_{j}^{T} β))}^{1 / 2}]^{2}}{F (x_{j}^{T} β) (1 - F {(x}_{j}^{T} β))} x_{j 1} x_{j 1}^{T}$
with $x_{j} = (x_{j 1}^{T}, x_{j 2}^{T})^{T}$ and $x_{j 1} \in R^{d}$ . If $π_{j} = F (x_{j}^{T} β_{0})$ for $1 \leq j \leq K,$ then we have
$\sqrt{N} ({\hat{β}}_{P M H D 1} - β_{01}) \overset{D}{\to} N (0, Σ_{1}^{* - 1} (β_{0})),$ (19)
where $Σ_{1}^{*} (β_{0}) = \sum_{j = 1}^{K} w_{j} x_{j 1} x_{j 1}^{T} f^{2} (x_{j}^{T} β_{0}) / [F (x_{j}^{T} β_{0}) (1 - F {(x}_{j}^{T} β_{0}))] .$

Recall the regularized MLE ${\hat{β}}_{P M L E}$ defined by (2) in the Introduction. For a general folded concave penalty function $p_{λ_{N}} (\cdot)$ as in Theorem 3.3, it can be shown that ${\hat{β}}_{P M L E}$ satisfies a asymptotic normality property such as (19) when $π_{j} = F (x_{j}^{T} β_{0})$ for $1 \leq j \leq K$ . That is, we have $\sqrt{N} ({\hat{β}}_{P M L E 1} - β_{01}) \overset{D}{\to} N (0, Σ_{1}^{* - 1} (β_{0}))$ as $N \to \infty$ under some regularity conditions. By comparing the preceding result with (19), one can see the asymptotic equivalence of the penalized MHD estimator ${\hat{β}}_{P M H D}$ and the penalized MLE ${\hat{β}}_{P M L E}$ . Thus, the penalized MHD and MLE estimators share some (asymptotic) optimality properties. It can be shown that the penalized MSCD estimator ${\hat{β}}_{P M S C D}$ defined by (12) is also asymptotically equivalent to the penalized MLE ${\hat{β}}_{P M L E}$ under some conditions. The advantage of penalized and non-penalized MHD and MSCD estimators is that they possess excellent robustness properties (see Section 5), which the corresponding MLE's generally lack.

It can be shown that the MLE ${\hat{β}}_{M L E}$ exhibits the same asymptotic normality property as (17). By comparing (17) and (19), we observe that the penalized MHD estimator ${\hat{β}}_{P M H D}$ possesses the oracle properties and is asymptotically as good as the MLE for estimating $β_{0}$ given $β_{02} = 0$ . Thus, ${\hat{β}}_{P M H D}$ is a fully efficient oracle procedure with excellent robustness properties. The penalized MSCD estimator ${\hat{β}}_{P M S C D}$ also possesses the same properties. This is the most striking feature of ${\hat{β}}_{P M H D}$ and ${\hat{β}}_{P M S C D} .$ To the best of our knowledge, no other estimators have this property in the present context. Also from Theorem 3.4 we note that ${\hat{β}}_{P M H D}$ is asymptotically unbiased. This is because we have employed an adaptive penalty function to define ${\hat{β}}_{P M H D} .$ For non-adaptive penalty functions, such as the SCAD penalty [15], there would be an extra bias term, and this extra term would be a function of $p_{λ_{N}}^{(1)} (| t |),$ the first derivative of $p_{λ_{N}} (| t |) .$ This extra bias term however would be negligible for large $| t |$ in the case of the SCAD penalty function. To select the regularization parameter $λ_{N}$ in practice, we can use a data-driven method, such as cross-validation, AIC, or BIC.

4. Monte Carlo studies

4.1. Estimation

In this subsection, we conduct simulation studies to compare the finite-sample performance of MHD and MSCD estimators (denoted as MHDE and MSCDE in this section) defined by (7) and (8), respectively, with the traditional MLE (see circa (1)). The behavior of MHDE and MSCDE is studied under contamination models. For our simulation, we considered K = 10 groups. Within each group, the sample size is set to $n_{j} = n,$ $j = 1, \dots, 10.$ Thus, for the ith group, we generated n data points based on a Bernoulli distribution with probability of success $F (x_{j}^{T} β)$ , $j = 1, \dots, 10,$ where $F (.)$ is a CDF. We used the CDF of the Logistic $(1, 1.2)$ distribution as the CDF for F in this subsection. We computed the bias and the mean squared error (MSE) of each estimator based on M replications as follows:

B i a s ({\hat{β}}_{m}) = \frac{\sum_{i = 1}^{M} ({\hat{β}}_{m, i} - β_{m})}{M}, M S E ({\hat{β}}_{m}) = \frac{\sum_{i = 1}^{M} ({\hat{β}}_{m, i} - β_{m})^{2}}{M}, m = 0, 1, 2,

that is the average of each performance measure over M repetitions. In this subsection, we set M = 1000. We also set $β^{T} = (β_{0}, β_{1}) = (- 1.5, 0.4)$ to be the true parameter vector and $x_{j} = (1, x_{j})^{T},$ where $x_{j} = j$ for $j = 1, \dots, 10$ . We first examined the following two models:

Model I: $F_{1} (y) = F (y);$
Model II: $F_{2} (y) = 0.9 F (y) + 0.1.$

Model I is the clean model (i.e. there is no contamination), and Models II represents an overall increase in the response for 10% of the observations.

Table 1 presents the simulation results for the Bias and MSE of the three estimators under Models I and II. For Model I, the biases for MLE considerably lower than those of MHDE and MSCDE. Indeed, MLE performs better than MSCDE and MHDE for Model I, but the MSE differences are small. For Model II, the biases of the three methods are comparative. Based on the MSE values for Model II, MSCDE performs the best, followed by MHDE. MSCDE is better in applications where the subjects may show a response not caused by the stimulus/treatment under examination; e.g. if they recover naturally.

Table 1. Biases and MSEs of MHDE, MSCDE, and MLE for Models I and II.

Model	n	Estimator	$B i a s (β_{0})$	$B i a s (β_{1})$	$M S E (β_{0})$	$M S E (β_{1})$
	30	MLE	0.0075	−0.0007	0.1111	0.0028
I	30	MHDE	−0.0291	0.0051	0.1124	0.0030
	30	MSCDE	−0.0730	0.0126	0.1208	0.0032
	50	MLE	0.0021	0.0001	0.0767	0.0020
I	50	MHDE	−0.0210	0.0038	0.0763	0.0020
	50	MSCDE	−0.0473	0.0082	0.0892	0.0023
	30	MLE	0.6113	−0.0532	0.4678	0.0056
II	30	MHDE	0.5850	−0.0481	0.4419	0.0052
	30	MSCDE	0.5805	−0.0428	0.4255	0.0049
	50	MLE	0.5992	−0.0497	0.4259	0.0043
II	50	MHDE	0.5838	−0.0466	0.4081	0.0040
	50	MSCDE	0.5721	−0.0426	0.4032	0.0037

Open in a new tab

Next, we considered a case where different covariates are assigned for different groups; see, e.g. Stephenson et al. [43]. Specifically, we set $x_{j} = 0$ for j = 1, 2, $x_{j} = j$ for j = 3, 4, $x_{j} \sim N (j, 1)$ for j = 5, 6, 7 and $x_{j} \sim U [j - 1, j + 1]$ for j = 8, 9, 10. Table 2 displays the simulation results for the Bias and MSE of the three estimators under Models I and II. We observe that the results in Table 2 are similar to those in Table 1; that is, MLE performs better than MSCDE and MHDE for Model I, whereas MSCDE and MHDE do better than MLE for Model II.

Table 2. Simulation results for Models I and II with different covariates for different groups.

Model	n	Estimator	$B i a s (β_{0})$	$B i a s (β_{1})$	$M S E (β_{0})$	$M S E (β_{1})$
	30	MLE	−0.2879	0.0099	0.1426	0.0017
I	30	MHDE	−0.3270	0.0153	0.1584	0.0017
	30	MSCDE	−0.3326	0.0178	0.1789	0.0020
	50	MLE	−0.2959	0.0120	0.1190	0.0009
I	50	MHDE	−0.3158	0.0147	0.1295	0.0009
	50	MSCDE	−0.3218	0.0158	0.1353	0.0014
	30	MLE	0.3330	−0.0416	0.3033	−0.0367
II	30	MHDE	0.3033	−0.0367	0.1898	0.0040
	30	MSCDE	0.2940	−0.0319	0.1700	0.0037
	50	MLE	0.3252	−0.0381	0.1744	0.0034
II	50	MHDE	0.3014	−0.0349	0.1612	0.0031
	50	MSCDE	0.2777	−0.0315	0.1520	0.0030

Open in a new tab

We now generate the data from the following model:

Model III: $P (Y_{j i} = 1 | x_{j}) = F (x_{j}^{T} β + ε_{j i})$ ,

where $β^{T} = (β_{1}, β_{2}, β_{3}) = (- 1, 0.4, 0.8)$ and $x_{j} = (1, x_{j 1}, x_{j 2})^{T};$ $x_{j 1}$ and $x_{j 2}$ are mutually independent and $x_{j k} \sim U [0, 2]$ for k = 1, 2; $ε_{j i}$ are independent random variables with common mixture distribution $(1 - α) N (0, {0.5}^{2}) + α N (8, {0.2}^{2})$ ; and F is the CDF of the Logistic $(1, 1.2)$ distribution. Model III includes the case where there are errors in the measurement or recording of the $x_{j}$ . We consider two cases: $α = 0$ and $α = 0.1$ . When $α = 0$ , $ε_{j i}$ follow the normal distribution $N (0, {0.5}^{2})$ . When $α = 0.1$ , $ε_{j i}$ are from a contaminated normal distribution with about $10 %$ are from a $N (8, {0.2}^{2})$ distribution, which can be interpreted as an outlier distribution.

Table 3 reports the simulation results for the Bias and MSE of MHDE, MSCDE, and MLE for Model III with n = 30 based on 1000 replications. One can see from Table 3 when $α = 0$ , MLE has a smaller MSE than MSCDE and MHDE, but MHDE and MSCDE have smaller biases in absolute value than MLE. When $α = 0.1$ , MSCDE and MHDE outperform MLE, and MSCDE behaves better than MHDE. These results show that MHDE and MSCDE appear to be more robust than MLE for outliers. We also see from Tables 1–3 that MSCDE is more robust than MHDE. The simulation results with the CDF of the $N (0, 1)$ distribution as the CDF for F give similar results and are omitted.

Table 3. Biases and MSEs of MHDE, MSCDE, and MLE for Model III.

α	Method	$B i a s (β_{0})$	$B i a s (β_{1})$	$B i a s (β_{2})$	$M S E (β_{0})$	$M S E (β_{1})$	$M S E (β_{2})$
	MLE	0.0712	−0.0223	−0.0258	0.2441	0.1128	0.0976
0	MHDE	0.0425	−0.0168	−0.0137	0.2486	0.1155	0.1006
	MSCDE	−0.0021	−0.0045	0.0012	0.2705	0.1246	0.1078
	MLE	0.5833	−0.0702	−0.1466	0.5178	0.1808	0.1258
0.1	MHDE	0.5682	−0.0668	−0.1397	0.5037	0.1775	0.1153
	MSCDE	0.5515	−0.0649	−0.1316	0.4905	0.1767	0.1146

Open in a new tab

4.2. Variable selection

In this subsection, we carried out a simulation study to compare the performance of the penalized MHDE and penalized MSCDE defined by (11) and (12), respectively, with that of the penalized MLE defined by (2). For all three estimators, we have used an adaptive penalty function of the form $\sum_{k = 1}^{p} p_{λ_{N}}^{(1)} (| β_{k}^{(0)} |) | β_{k} |,$ with $β^{(0)}$ being the corresponding non-penalized MHDE and MSCDE defined in Section 2, and the MLE based on log-likelihood function $l_{N} (β)$ defined by (1). Further, we have employed the SCAD penalty function with a = 3.7 for $p_{λ_{N}} (.) .$ The tuning parameter $λ_{N}$ in $p_{λ_{N}} (.)$ is chosen by the method given in Fan et al. [14].

We considered K = 20 groups. For each group, the sample size is set to $n_{j} = n,$ $j = 1, \dots, K .$ In this section, we set n = 10, 20. For the ith group, we generated n data points based on a Bernoulli distribution with probability of success $F (x_{j}^{T} β)$ , $j = 1, \dots, K .$ The simulation results presented in this section are based on 500 replications.

We first considered the case where the data are generated from the following model:

Model IV: F (x_{j}^{T} β) = L (x_{j}^{T} β),

where $L (y)$ denotes the CDF of the Logistic $(2, 3)$ distribution, $β^{T} = (β_{1}, β_{2}, β_{3}, β_{4}) = (1.2, 0, 0.9, 0)$ , $x_{j} = (1, x_{j 1}, x_{j 2}, x_{j 3})^{T},$ where $x_{j 1}, x_{j 2}$ , and $x_{j 3}$ are mutually independent and $x_{j k} \sim U [0, 2]$ for k = 1, 2, 3.

We measured the estimation accuracy by the average $l_{1}$ -losses: $| {\hat{β}}_{1} - β_{1} |$ and $| {\hat{β}}_{3} - β_{3} |$ over 500 replications. We also evaluated the selection accuracy by the average counts of false positives (FPs) and false negatives (FNs), i.e. the number of noise covariates included in the model and the number of signal covariates not included. Table 4 gives the results for Model IV. From Table 4, we observe that both the penalized MSCDE and the penalized MHDE are comparable to the penalized MLE in variable selection, but have bigger absolute biases than the penalized MLE. Table 4 also shows that the penalized MSCDE outperforms the penalized MHDE.

Table 4. Comparison of (penalized) MHDE, MSCDE, and MLE for Model IV.

	Method	FP	FN	$\| {\hat{β}}_{1} - β_{1} \|$	$\| {\hat{β}}_{3} - β_{3} \|$
K = 20, n = 10	MLE	0.2880	0.2640	0.4381	0.3891
	MHDE	0.1820	0.3880	0.4823	0.4677
	MSCDE	0.1760	0.2800	0.4519	0.4096
K = 20, n = 20	MLE	0.2100	0.2100	0.3969	0.3527
	MHDE	0.1440	0.2580	0.4234	0.3869
	MSCDE	0.1260	0.2540	0.3953	0.3826

Open in a new tab

Next, we generated the data from the following model:

Model V: F (x_{j}^{T} β) = (1 - ς) L (x_{j}^{T} β) + ς,

where $0 < ς < 1,$ and $L (y)$ , $β$ , $x_{j}$ are as defined in Model IV. Table 5 gives the results for Model V. In this case, we see from Table 5 that both the penalized MSCDE and the penalized MHDE outperform the penalized MLE. Table 5 also shows that when ς changes from 0.1 to 0.2, FP, FN and $l_{1}$ -loss for the three methods become large.

Table 5. Comparison of (penalized) MHDE, MSCDE, and MLE for Model V.

	ς	Method	FP	FN	$\| {\hat{β}}_{1} - β_{1} \|$	$\| {\hat{β}}_{3} - β_{3} \|$
K = 20, n = 10	0.1	MLE	0.2320	0.1940	0.5417	0.3335
		MHDE	0.1540	0.1040	0.5356	0.3123
		MSCDE	0.1520	0.1260	0.5054	0.3073
K = 20, n = 20	0.1	MLE	0.1500	0.1740	0.5036	0.3180
		MHDE	0.1120	0.0920	0.4970	0.2851
		MSCDE	0.1360	0.0780	0.4833	0.2672
K = 20, n = 10	0.2	MLE	0.3080	0.1960	0.8085	0.3576
		MHDE	0.2040	0.1180	0.8022	0.3133
		MSCDE	0.2120	0.1240	0.7912	0.3082
K = 20, n = 20	0.2	MLE	0.2420	0.1640	0.8678	0.3294
		MHDE	0.1160	0.0960	0.8509	0.3121
		MSCDE	0.1320	0.0760	0.8405	0.3085

Open in a new tab

We now generate the data from the following model:

Model VI: P (Y_{j i} = 1 | x_{j}) = F (x_{j}^{T} β + ε_{j i}),

where $β^{T} = (β_{1}, \dots, β_{7}) = (0.6, 0, 0.9, 0, 0, 1.1, 0)$ ; $x_{j} = (1, x_{j 1}, \dots, x_{j 6})^{T},$ where $x_{j 1}, \dots, x_{j 6}$ are mutually independent and $x_{j k} \sim U [0, 2]$ for $k = 1, \dots, 6$ ; $ε_{j i}$ are independent random variables with common mixture distribution $(1 - α) N (0, {0.5}^{2}) + α N (0, 10^{2})$ ; and F is the CDF of the Logistic $(2, 3)$ distribution. Table 6 reports the simulation results for Model VI with $α = 0, 0.1$ based on 500 replications. Table 6 shows that when $α = 0$ , the penalized MSCDE and penalized MHDE perform better than the penalized MLE in variable selection as well as the estimation of $β_{1}$ . Whereas the penalized MLE has low $l_{1}$ -losses for $β_{3}$ and $β_{6}$ than those for the penalized MSCDE and penalized MHDE. When $α = 0.1$ , the penalized MHDE and penalized MSCDE outperform the penalized MLE. The values in Table 6 also reveal that the estimation and variable selection for three methods become better when n increases. Based on Tables 4–6, we conclude that the penalized MSCDE and penalized MHDE are more robust than the penalized MLE in the presence of outliers.

Table 6. Comparison of (penalized) MHDE, MSCDE, and MLE for Model VI.

		Method	FP	FN	$\| {\hat{β}}_{1} - β_{1} \|$	$\| {\hat{β}}_{3} - β_{3} \|$	$\| {\hat{β}}_{6} - β_{6} \|$
$α = 0,$	K = 20, n = 20	MLE	0.5140	0.5920	0.4797	0.2390	0.2234
		MHDE	0.3340	0.4000	0.4580	0.2611	0.2308
		MSCDE	0.1880	0.5380	0.4670	0.2707	0.2498
$α = 0,$	K = 20, n = 40	MLE	0.3820	0.5280	0.4366	0.2187	0.2032
		MHDE	0.2780	0.3300	0.3711	0.2209	0.2165
		MSCDE	0.2120	0.4300	0.4294	0.2203	0.2316
$α = 0.1,$	K = 20, n = 20	MLE	0.6600	0.6260	0.4861	0.2683	0.2575
		MHDE	0.2880	0.3820	0.4589	0.2625	0.2422
		MSCDE	0.2280	0.5600	0.4759	0.2468	0.2352
$α = 0.1,$	K = 20, n = 40	MLE	0.5100	0.6020	0.4605	0.2466	0.2392
		MHDE	0.3040	0.3180	0.3742	0.2308	0.2207
		MSCDE	0.2280	0.5600	0.4759	0.2468	0.2352

Open in a new tab

5. Real-data applications

In this section, we analyze two real-data sets. For each data set, we estimate the vector $β$ using MHDE and MSCDE defined by (7) and (8), respectively, and the MLE ${\hat{β}}_{M L E}$ , see circa (1).

5.1. Example 1

We first apply our methods to analyze a data set on Caesarian birth previously analyzed by Fahrmeir and Tutz [13] using the maximum likelihood approach. The response variable of interest is the occurrence or nonoccurrence of infection of type I or II. Three dichotomous covariates that may influence the risk of infection were considered: Was the Caesarian section planned or not? Were risk factors such as diabetes or excessive weight present? Were antibiotics given as a prophylaxis? The aim is to analyze the effects of the covariates on the risk of infection, and in particular to determine whether antibiotics can decrease this risk. We define a binary response variable Y with Y = 0 if there is no infection and Y = 1 if there is infection of either type. We constructed logistic and probit models to fit this data set:

\begin{aligned} P (Y_{j} & = 1 | x_{j}) = F (x_{j}^{T} β) = e^{x_{j}^{T} β} / (1 + e^{x_{j}^{T} β}), \end{aligned}

(20)

\begin{aligned} P (Y_{j} & = 1 | x_{j}) = F (x_{j}^{T} β) = Φ (x_{j}^{T} β), \end{aligned}

(21)

where $x_{j} = (1, x_{j 1}, x_{j 2}, x_{j 3})^{T}$ , $β = (β_{0}, β_{1}, β_{2}, β_{3})^{T}$ , and $Φ (\cdot)$ is the CDF of the standard normal distribution. We set $x_{j 1} = 1$ if the Caesarian was not planned and $x_{j 1} = 0$ if it was planned; $x_{j 2} = 1$ indicates that antibiotics were given, and $x_{j 3} = 1$ indicates that there were risk factors present.

Fahrmeir and Tutz [13] implemented the maximum likelihood method to estimate the unknown parameter vector $β$ . We compute MHDE, MSCDE and MLE for $β$ . Table 7 gives the parametric estimates for the three methods and the MSE defined by MSE $= \frac{1}{7} \sum_{j = 1}^{7} ({\hat{p}}_{j} - p_{j})^{2}$ , where $p_{j}$ is the probability of infection for the jth group and ${\hat{p}}_{j} = F (x_{j}^{T} \hat{β})$ , with $\hat{β}$ being an estimator of $β .$ Figure 1 exhibits the scatterplots for the points $(p_{j}, {\hat{p}}_{j})$ , $j = 1, \dots, 7,$ for models (20) and (21) under the three methods, where ${\hat{p}}_{j}$ is estimated by deleting the jth group data. From the MSE values in Table 7, we note that MSCDE clearly the best one, and MHDE comes close second. The scatterplots in Figure 1 are generally similar for the three methods, but Panels (c) and (f) of Figure 1 appear to show that MLE is taking large values for some groups at $p = 0.$ In this case, MSCDE and MHDE appear to have better robustness with respect to possible deviations of the postulated model from true model. In Figure 1, comparing (a), (b) and (c) with (d), (e) and (f), respectively, we note that the scatterplots for model (20) are similar to those for model (21), meaning that both models may not be the true model.

Table 7. Parametric estimates and MSEs for models (20) and (21).

Model	Method	${\hat{β}}_{0}$	${\hat{β}}_{1}$	${\hat{β}}_{2}$	${\hat{β}}_{3}$	MSE
	MHDE	−3.0	0.58	3.55	−3.3	0.0099
(20)	MSCDE	−3.9	1.52	4.0	−3.5	0.0064
	MLE	−1.89	1.07	2.03	−3.25	0.0163
	MHDE	−2.1	0.45	2.4	−1.95	0.0095
(21)	MSCDE	−2.95	0.9	3.0	−2.05	0.0066
	MLE	−1.09	0.61	1.2	−1.9	0.0174

Open in a new tab

Figure 1. — Scatterplot for points $(p_{j}, {\hat{p}}_{j}), j = 1, \dots, 7$ . Panels (a), (b), and (c) are for the MHDE, MSCDE and MLE for model (20). Panels (d), (e), and (f) are for the MHDE, MSCDE and MLE for model (21). The diagonal line is y = x.

5.2. Example 2

Next we analyzed the data set given in Little [31] using our methods and the maximum likelihood method. These data present a distribution of 1607 married and fecund women interviewed in the Fiji Fertility Survey of 1975, classified by age, level of education, desire for more children, and contraceptive use. We view the use of contraception as the response variable Y and age ( $X_{1}$ ), education ( $X_{2}$ ), and desire for more children ( $X_{3}$ ) as the covariates. We set Y = 1 for contraceptive use and Y = 0 for no contraceptives. The women are classified into K = 16 groups in terms of their age, education, and desire for more children. We use the following logistic model:

P (Y_{j} = 1 | x_{j}) = F (x_{j}^{T} β) = e^{x_{j}^{T} β} / (1 + e^{x_{j}^{T} β}),

(22)

where $x_{j} = (1, x_{j 1}, x_{j 2}, x_{j 3})^{T}$ and $β = (β_{0}, β_{1}, β_{2}, β_{3})^{T}$ . For age, we set $x_{j 1} = 21.5$ for $j = 1, \dots, 4$ ; $x_{j 1} = 27$ for $j = 5, \dots, 8$ ; $x_{j 1} = 34.5$ for $j = 9, \dots, 12;$ and $x_{j 1} = 44.5$ for $j = 13, \dots, 16$ . We also set $x_{j 2} = 1$ for a higher level of education and $x_{j 2} = 0$ for a lower level. Further, $x_{j 3} = 1$ denotes a desire for more children and $x_{j 3} = 0$ denotes no such desire.

The parameters of model (22) are estimated using MHDE, MSCDE and MLE. Table 8 gives the estimates for the three methods and the MSE defined by MSE $= \frac{1}{16} \sum_{j = 1}^{16} ({\hat{p}}_{j} - p_{j})^{2}$ , where $p_{j}$ is the proportion of contraceptive use for the jth group and ${\hat{p}}_{j} = F (x_{j}^{T} \hat{β})$ . MHDE and MSCDE give almost the same results, and they have lower MSE values than MLE.

Table 8. Parametric estimates and MSEs for the three methods.

Method	${\hat{β}}_{0}$	${\hat{β}}_{1}$	${\hat{β}}_{2}$	${\hat{β}}_{3}$	MSE
MHDE	−2.4500	0.0600	0.4000	−0.9000	0.0064
MSCDE	−2.4500	0.0600	0.4000	−0.9500	0.0064
MLE	−0.3110	0.0740	−0.0196	−1.1224	0.0193

Open in a new tab

To evaluate the prediction performance of the three methods, we applied leave-one-out cross-validation to the data; i.e. to predict the proportion of contraceptive use for the jth group, we omitted the data for this group when fitting the model. Figure 2 displays the boxplots for the absolute prediction errors $| p_{j} - {\hat{p}}_{j} |, j = 1, \dots, 16,$ for the three methods. The mean values of these errors for MHDE, MSCDE, and MLE are 0.0900, 0.0892, and 0.1226, respectively. These observations and Figure 2 suggest that MSCDE and MHDE have better prediction performance than MLE. Since we set the age $x_{j 1} = 21.5$ for $j = 1, \dots, 4$ and set the age to be the median for the other groups, it is more appropriate to use the following model to fit this data set:

P (Y_{j i} = 1 | x_{j}) = F (x_{j}^{T} β + ε_{j i}), j = 1, \dots, 16, i = 1, \dots, n_{i},

where the $ε_{j i}$ are random errors and the CDF F is as in (22). As observed in simulated results of Section 4, MSCDE and MHDE have better robustness than MLE for the contaminated model (5.2). In practical applications, the true model for a given data set is generally unknown, and the postulated model is usually not correct. Simulation results in Section 4 suggest that MSCDE and MHDE may offer some protection when the postulated model appears to deviate from the true model. This is an added advantage of using minimum-distance methods such as MSCD and MHD, which are generally robust to model misspecification.

Figure 2. — Boxplots for the absolute prediction error $| p_{j} - {\hat{p}}_{j} |, j = 1, \dots, 16,$ for the three methods. Here 1, 2, and 3 are boxplots for MHDE, MSCDE, and MLE.

6. Discussion

In this paper, we have investigated simultaneous robust estimation and variable selection for binary regression models with grouped data. In many practical situations, the data are available only in a grouped form, or the observed data can be formed into groups based on the covariates observed for each subject. Working with grouped data has the additional advantage that it is possible to test the goodness of fit of a postulated model. The maximum likelihood approach is the most widely used method in inference for binary regression models. However, MLEs are sensitive to atypical data and model misspecification. Their lack of robustness has motivated researchers to develop more robust approaches. A well-known alternative method is the minimum-distance approach, which has been observed to produce estimators having excellent robustness properties against model misspecification and the presence of outliers. We have examined two minimum-distance estimation methods, namely MHD and MSCD, for estimation in binary regression models with grouped data. The results of simulation and two real-data analysis show that MHD and MSCD estimators have good robustness properties, and the MSCD estimator is marginally better than the MHD counterpart in small samples, but both are equally efficient asymptotically. Further, they have outperformed MLE in the presence of outliers and model misspecification.

Regularization methods play an important role in identifying covariates that truly affect the outcome of a response in models containing covariates and a response variable. They have been widely used for simultaneous coefficient estimation and variable selection by identifying the covariates that are associated with a response variable. The importance of robust procedures has also been stressed for regularization methods [14,15,25,40,46,49]. Many well-known regularization methods are based on solving an optimization problem formed by the sum of a ‘loss function’ and a penalty function. We have constructed regularized estimators using the squared Hellinger and symmetric chi-squared distances as loss functions combined with an adaptive $l_{1}$ -penalty function. These techniques have produced both robust and efficient regularized estimators for binary regression models with grouped data. We have shown that our estimators satisfy the oracle properties. Such optimal regularized procedures are not yet available for binary regression models with grouped data. Furthermore, our numerical studies have shown that our penalized estimators are more stable in the presence of outliers and model misspecification than the corresponding MLE. Overall, the full efficiency combined with excellent robustness and computational feasibility make our proposed estimators very appealing in practice. Thus, we expect our regularized methods to be very useful in practical applications.

Acknowledgments

We wish to thank the Editor, an Associate Editor and three reviewers for careful reading of our paper and for their helpful comments that led to a substantial improvement in this paper. Q. Tang's research was supported by the National Social Science Foundation of China (16BTJ019) and the Natural Science Foundation of Jiangsu Province of China (Grant No. BK20151481). R.J. Karunamuni's research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

Appendix.

Here we give detailed proofs of the theorems stated in Section 3. The proofs are similar to those used in Stather [41] and Karunamuni et al. [26], but we have extended them here for binary regression models.

Let $β_{0}$ denote the true parameter value of $β .$ The next two lemmas show that, under certain conditions, $T (π, w)$ is unique and continuous.

Lemma A.1

Suppose that Θ is a compact subset of $R^{p}$ and F is a continuous CDF on $R$ . Then $T (π, w)$ exists for all $(π, w) \in G_{K}$ . Further, suppose that F is continuous and strictly increasing on $R$ and that $π_{j} = F (x_{j}^{T} β_{0}),$ $1 \leq j \leq K,$ with the $x_{j}$ 's span $R^{p} .$ Then we have $T (π, w) = β_{0}$ uniquely for any $w \in ℵ .$

Proof.

Observe that

$F (x_{j}^{T} β)$ is a continuous function of $β$ . Therefore, $\sqrt{π_{j} F (x_{j}^{T} β)}$ and $\sqrt{(1 - π_{j}) (1 - F (x_{j}^{T} β))}$ are continuous functions of $β$ .

$g (β) = \sum_{j = 1}^{K} w_{j} {\sqrt{π_{j} F (x_{j}^{T} β)} + \sqrt{(1 - π_{j}) (1 - F (x_{j}^{T} β))}}$ is a continuous function of $β$ .

$g (β)$ is bounded (when K is fixed), because $| w_{j} | \leq 1, 0 \leq π_{j} \leq 1, 0 \leq F (x_{j}^{T} β) \leq 1$ .

Then from the above facts we note that the maximum of

$\sum_{j = 1}^{K} w_{j} {\sqrt{π_{j} F (x_{j}^{T} β)} + \sqrt{(1 - π_{j}) (1 - F (x_{j}^{T} β))}}$

is attained in at least one point.

For the second part, we argue as follows. Using basic calculus, it is easy to show the maximum of $\sqrt{π_{j} F (x_{j}^{T} β)} + \sqrt{(1 - π_{j}) (1 - F (x_{j}^{T} β))}$ is attained when $π_{j} = F (x_{j}^{T} β) .$ That is, if $π_{j} = F (x_{j}^{T} β_{0}),$ $1 \leq j \leq K,$ then $T (π, w) = β_{0}$ is a solution for $β$ . Since F is one-to-one by assumptions, it then follows that $x_{j}^{T} β = x_{j}^{T} β_{0}, j = 1, \dots, p$ . We then have $β = β_{0}$ by $x_{j}$ 's span $R^{p} .$ Hence the result.

Lemma A.2

Suppose that Θ is a compact subset of $R^{p},$ F is continuous and strictly increasing on $R$ , and $(π, w) \in G_{K}$ is such that $T (π, w)$ is unique with $π = (π_{1}, \dots, π_{K})^{T},$ $0 < π_{j} < 1$ for $1 \leq j \leq K .$ Suppose that the $x_{j}$ 's span $R^{p} .$ Then T is continuous at $(π, w) .$

Proof.

Let $π_{n} = (π_{1, n}, π_{2, n}, \dots, π_{K, n})^{T}$ and $w_{n} = (w_{1, n}, w_{2, n}, \dots, w_{K, n})^{T}$ . Assume $π_{n} \times w_{n} \in G_{K}$ and $(π_{n}, w_{n}) \to (π, w)$ as $n \to \infty .$ We will show $T (π_{n}, w_{n}) \to T (π, w)$ as $n \to \infty$ . Then T is continuous at $(π, w)$ . Denote $β = T (π, w)$ and $β_{n} = T (π_{n}, w_{n})$ . Define functions

$\begin{aligned} g_{j, n} (β) & = \sqrt{π_{j, n} F (x_{j}^{T} β)} + \sqrt{(1 - π_{j, n}) (1 - F (x_{j}^{T} β))}, \\ g_{n} (β) & = \sum_{j = 1}^{K} w_{j, n} g_{j, n} (β), \\ g_{j} (β) & = \sqrt{π_{j} F (x_{j}^{T} β)} + \sqrt{(1 - π_{j}) (1 - F (x_{j}^{T} β))}, \end{aligned}$

and

$g (β) = \sum_{j = 1}^{K} w_{j} g_{j} (β) .$

It is enough to show that

$S u p \{|g_{n} (β) - g (β)| : β \in Θ\} \to 0.$ (A1)

Then from (A1) we have $| g_{n} (β_{n}) - g (β_{n}) | \to 0.$ We also obtain $g (β_{n}) \to g (β)$ as $n \to \infty$ using the continuity of g. Then it follows that $g_{n} (β_{n}) \to g (β)$ as $n \to \infty$ . Then the uniqueness of $T (π, w)$ and the compactness of Θ, it follows that $T (π_{n}, w_{n}) \to T (π, w)$ as $n \to \infty .$

We now verify (A1). First note that

$|g_{n} (β) - g (β)| \leq \sum_{j = 1}^{K} |g_{j} (β)| |w_{j, n} - w_{j}| + \sum_{j = 1}^{K} |w_{j, n}| |g_{j, n} (β) - g_{j} (β)| .$ (A2)

Denote $F_{j} = F (x_{j}^{T} β)$ and define

$\begin{aligned} Δ_{j, n} (β) & = g_{j, n} (β) - g_{j} (β) \\ = [\sqrt{π_{j, n} F_{j}} + \sqrt{(1 - π_{j, n}) (1 - F_{j})}] - [\sqrt{π_{j} F_{j}} + \sqrt{(1 - π_{j}) (1 - F_{j})}] \\ = [\sqrt{π_{j, n} F_{j}} - \sqrt{π_{j} F_{j}}] + [\sqrt{(1 - π_{j, n}) (1 - F_{j})} - \sqrt{(1 - π_{j}) (1 - F_{j})}] \\ = \sqrt{F_{j}} [\sqrt{π_{j, n}} - \sqrt{π_{j}}] + \sqrt{1 - F_{j}} [\sqrt{(1 - π_{j, n})} - \sqrt{(1 - π_{j})}] \end{aligned}$ (A3)

for $1 \leq j \leq K .$ Then using the algebraic equality $b^{1 / 2} - a^{1 / 2} = \frac{1}{2} (b - a) a^{- 1 / 2} - \frac{1}{2} (b^{1 / 2} - a^{1 / 2})^{2} a^{- 1 / 2}$ for $b \geq 0,$ $a > 0,$ we obtain

$\begin{aligned} Δ_{j, n} (β) & = \frac{1}{2} \sqrt{F_{j} / π_{j}} \{(π_{j, n} - π_{j}) - (\sqrt{π_{j, n}} - \sqrt{π_{j}})^{2}\} \\ - \frac{1}{2} \sqrt{(1 - F_{j}) / (1 - π_{j})} \{(π_{j, n} - π_{j}) + [\sqrt{(1 - π_{j, n})} - \sqrt{(1 - π_{j})}]^{2}\} . \end{aligned}$

Now since $(a^{1 / 2} - b^{1 / 2})^{2} \leq a^{- 1} (b - a)^{2}$ for $b \geq 0,$ $a > 0,$ we have

$\begin{aligned} |Δ_{j, n} (β)| & \leq \frac{1}{2} |π_{j, n} - π_{j}| \{\sqrt{F_{j} / π_{j}} + \sqrt{(1 - F_{j}) / (1 - π_{j})}\} \\ + \frac{1}{2} {|π_{j, n} - π_{j}|}^{2} \{\sqrt{F_{j} / π_{j}^{3}} + \sqrt{(1 - F_{j}) / (1 - π_{j})^{3}}\} . \end{aligned}$

But $\sqrt{F_{j} / π_{j}} \leq π_{j}^{- 1} \{\sqrt{F_{j} π_{j}} + \sqrt{(1 - F_{j}) (1 - π_{j})}\}$ and a similar inequality holds for $\sqrt{(1 - F_{j}) / (1 - π_{j})} .$ Then it follows that

$\begin{aligned} |Δ_{j, n} (β)| & \leq \frac{1}{2} \{\sqrt{F_{j} π_{j}} + \sqrt{(1 - F_{j}) (1 - π_{j})}\} [|π_{j, n} - π_{j}| {π_{j}^{- 1} + (1 - π_{j})^{- 1}} \\ + {|π_{j, n} - π_{j}|}^{2} {π_{j}^{- 2} + (1 - π_{j})^{- 2}}] . \end{aligned}$ (A4)

Now combining (A2)–(A4) and using $π_{j, n} \to π_{j}$ , it follows that Sup ${| g_{n} (β) - g (β) | : β \in Θ} \to 0.$ This completes the proof of (A1).

Proof of Theorem 3.1. Proof of Theorem 3.1 —

The proof follows from the continuity of $T (., .)$ and $(π_{N}, w_{N}) \overset{P}{\to} (π, w)$ as $N \to \infty .$

Proof of Theorem 3.2. Proof of Theorem 3.2 —

Let $T (π_{n}, w_{n}) = β_{n}$ and $T (π, w) = β .$ Then $β_{n}$ satisfies the following equation:

$\begin{aligned} 0 & = \frac{\partial}{\partial β} \sum_{j = 1}^{K} w_{j, n} \{\sqrt{π_{j, n} F (x_{j}^{T} β)} + \sqrt{(1 - π_{j, n}) (1 - F (x_{j}^{T} β))}\} \\ = \frac{1}{2} \sum_{j = 1}^{K} w_{j, n} x_{j} f (x_{j}^{T} β) \{\sqrt{\frac{π_{j, n}}{F (x_{j}^{T} β)}} - \sqrt{\frac{1 - π_{j, n}}{1 - F (x_{j}^{T} β)}}\} . \end{aligned}$

Let

$G_{j, n} (y) = \frac{\partial}{\partial y} \{\sqrt{π_{j, n} F (y)} + \sqrt{(1 - π_{j, n}) (1 - F (y))}\} = \frac{1}{2} f (y) \{\sqrt{\frac{π_{j, n}}{F (y)}} - \sqrt{\frac{1 - π_{j, n}}{1 - F (y)}}\} .$

The first derivative of $G_{j, n} (y)$ is

$G_{j, n}^{(1)} (y) = \frac{1}{2} f^{(1)} (y) (\sqrt{\frac{π_{j, n}}{F (y)}} - \sqrt{\frac{1 - π_{j, n}}{1 - F (y)}}) - \frac{1}{4} f (y)^{2} (\sqrt{\frac{π_{j, n}}{F (y)^{3}}} + \sqrt{\frac{1 - π_{j, n}}{(1 - F (y))^{3}}}),$ (A5)

and $λ (π_{n}, w_{n}, β_{n}) = \sum_{j = 1}^{K} w_{j, n} x_{j} G_{j, n} (x_{j}^{T} β_{n})$ . Using a Taylor expansion, we have

$G_{j, n} (x_{j}^{T} β_{n}) = G_{j, n} (x_{j}^{T} β) + G_{j, n}^{(1)} (x_{j}^{T} β) x_{j}^{T} (β_{n} - β) + \frac{1}{2} G_{j, n}^{(2)} (x_{j}^{T} β_{n}^{*}) [x_{j}^{T} (β_{n} - β)]^{2},$

where $β_{n}^{*}$ is a value between $β_{n}$ and $β .$ Note $G_{j, n}^{(2)}$ is bounded because F has bounded derivatives and F is bounded away from zero and one from the assumptions in the lemma. Further, we have $G_{j, n}^{(1)} (y) \to G_{j}^{(1)} (y)$ uniformly with respect to y since $π_{j, n} \to π_{j}$ , where $G_{j}^{(1)} (y)$ is equal to the expression at (A1) with $π_{j, n}$ replaced by $π_{j} .$ Now, substituting the expression (A5) into the equation $\sum_{j = 1}^{K} w_{j, n} x_{j} G_{j, n} (x_{j}^{T} β_{n}) = 0$ we obtain

$\begin{aligned} 0 & = \sum_{j = 1}^{K} w_{j, n} x_{j} [G_{j, n} (x_{j}^{T} β) + x_{j}^{T} G_{j, n}^{(1)} (x_{j}^{T} β) (β_{n} - β) + \frac{1}{2} (β_{n} - β)^{T} x_{j} x_{j}^{T} G_{j, n}^{(2)} (x_{j}^{T} β_{n}^{*}) (β_{n} - β)] \\ = \sum_{j = 1}^{K} w_{j, n} x_{j} G_{j, n} (x_{j}^{T} β) + \{\sum_{j = 1}^{K} w_{j, n} x_{j} x_{j}^{T} [G_{j, n}^{(1)} (x_{j}^{T} β) + \frac{1}{2} G_{j, n}^{(2)} (x_{j}^{T} β_{n}^{*}) x_{j}^{T} (β_{n} - β)]\} (β_{n} - β) \end{aligned}$

Since $w_{j, n} \to w_{j}$ as $n \to \infty$ , it follows that

$\sum_{j = 1}^{K} w_{j, n} x_{j} x_{j}^{T} G_{j}^{(1)} (x_{j}^{T} β) \to \sum_{j = 1}^{K} w_{j} x_{j} x_{j}^{T} G_{j}^{(1)} (x_{j}^{T} β) = Σ (β)$ (A6)

and the elements of the matrix $Σ_{n} = \frac{1}{2} \sum_{j = 1}^{K} w_{j} x_{j} x_{j}^{T} G_{j, n}^{(2)} (x_{j}^{T} β_{n}^{*}) x_{j}^{T} (β_{n} - β)$ go to zero, as $n \to \infty .$ Denote $λ (π_{n}, w_{n}, β) = \sum_{j = 1}^{K} w_{j, n} x_{j} G_{j, n} (x_{j}^{T} β) .$ Then we have $λ (π_{n}, w_{n}, β) + (Σ (β) + Σ_{n}) (β_{n} - β) = 0$ . Since $Σ (β)$ is nonsingular, the matrix $Σ (β) + Σ_{n}$ will be nonsingular for large n. Then (14) follows, and (15) follows from (14) and (A6). This completes the proof of Theorem 3.2.

Proof of Theorem 3.3. Proof of Theorem 3.3 —

Assume that $0 < F (y) < 1$ and $π_{j, N} \to π_{j}$ and $N \to \infty$ with $0 < π_{j} < 1,$ $j = 1, \dots, K .$ Then by an application of the algebraic equality $b^{1 / 2} - a^{1 / 2} = \frac{1}{2} (b - a) a^{- 1 / 2} - \frac{1}{2} (b^{1 / 2} - a^{1 / 2})^{2} a^{- 1 / 2}$ for $b \geq 0,$ $a > 0,$ to $\sqrt{π_{j, N} / F (y)}$ and $\sqrt{(1 - π_{j, N}) / (1 - F (y))}$ separately we have

$\begin{aligned} \sqrt{π_{j, N} / F (y)} - \sqrt{(1 - π_{j, N}) / (1 - F (y))} = \sqrt{π_{j} / F (y)} - \sqrt{(1 - π_{j}) / (1 - F (y))} \\ + \frac{1}{2} (π_{j, N} - π_{j}) [{π_{j} F (y)}^{- 1 / 2} + {(1 - π_{j}) (1 - F (y))}^{- 1 / 2}] + o (π_{j, N} - π_{j}) . \end{aligned}$ (A7)

Using (A7), as $N \to \infty,$ we obtain

$\begin{aligned} G_{j, N} (y) - G_{j} (y) \\ = \frac{\partial}{\partial y} \{\sqrt{π_{j, N} F (y)} + \sqrt{(1 - π_{j, N}) (1 - F (y))}\} - \frac{\partial}{\partial y} \{\sqrt{π_{j} F (y)} + \sqrt{(1 - π_{j}) (1 - F (y))}\} \\ = \frac{f (y)}{2} \{\sqrt{π_{j, N} / F (y)} - \sqrt{(1 - π_{j, N}) / (1 - F (y))} - \sqrt{π_{j} / F (y)} + \sqrt{(1 - π_{j}) / (1 - F (y))}\} \\ = \frac{f (y)}{4} (π_{j, N} - π_{j}) [{π_{j} F (y)}^{- 1 / 2} + {(1 - π_{j}) (1 - F (y))}^{- 1 / 2}] + o (π_{j, N} - π_{j}), \end{aligned}$ (A8)

Define ${\hat{G}}_{j} (y) =$ $\frac{\partial}{\partial y} {\sqrt{{\hat{p}}_{j} F (y)} + \sqrt{(1 - {\hat{p}}_{j}) (1 - F (y))}} .$ Note that ${\hat{p}}_{N} \overset{P}{\to} π$ as $N \to \infty .$ Then since $T (π, w) =$ $β_{0}$ and $λ (π, w, β_{0}) = 0$ , we have from (14) and (A8) that

$\begin{aligned} {\hat{β}}_{M H D} - β_{0} & = T ({\hat{p}}_{N}, w_{N}) - T (π, w) \\ = - λ ({\hat{p}}_{N}, w_{N}, β_{0}) [Σ^{- 1} (β_{0}) + o_{P} (1)] \\ = - \sum_{j = 1}^{K} w_{j, N} x_{j} {{\hat{G}}_{j} (x_{j}^{T} β_{0}) - G_{j} (x_{j}^{T} β_{0})} [Σ^{- 1} (β_{0}) + o_{P} (1)] \\ = - \frac{1}{4} Σ^{- 1} (β_{0}) \sum_{j = 1}^{K} w_{j, N} x_{j} f (x_{j}^{T} β_{0}) ({\hat{p}}_{j} - π_{j}) \\ \times [{π_{j} F (x_{j}^{T} β_{0})}^{- 1 / 2} + {(1 - π_{j}) (1 - F (x_{j}^{T} β_{0}))}^{- 1 / 2}] (1 + o_{P} (1)) . \end{aligned}$ (A9)

The result (16) now follows from (A9) and the fact that $\sqrt{N w_{j, N}} ({\hat{p}}_{j} - π_{j}) \overset{D}{\to} N (0, π_{j} (1 - π_{j}))$ as $N \to \infty,$ for $1 \leq j \leq K .$ This completes the proof of Theorem 3.3.

Proof of Theorem 3.4. Proof of Theorem 3.4 —

Let

$D_{N} (β) = \sum_{j = 1}^{K} w_{j, N} \{\sqrt{{\hat{p}}_{j} F (x_{j}^{T} β)} + \sqrt{(1 - \hat{p_{j}}) (1 - F (x_{j}^{T} β))}\} - \sum_{k = 1}^{p} p_{λ_{N}}^{(1)} (| β_{k}^{(0)} |) | β_{k} | .$

By arguments similar to those used in the proofs of Lemmas A.1 and A.2 of Tang and Karunamuni [46], there exists a $\sqrt{N}$ -consistent local maximizer $\overset{ˇ}{β} = ({\overset{ˇ}{β}}_{1}, 0^{T})^{T}$ of (11). By a Taylor expansion, with probability tending 1, we have

$\begin{aligned} D_{N} (({\hat{β}}_{P M H D 1}, {\hat{β}}_{P M H D 2})) \\ = D_{N} (({\overset{ˇ}{β}}_{1}, 0)) + ({\hat{β}}_{P M H D} - \overset{ˇ}{β})^{T} \sum_{j = 1}^{K} w_{j, N} x_{j} G_{j, N} (x_{j}^{T} \overset{ˇ}{β}) \\ + \frac{1}{2} ({\hat{β}}_{P M H D} - \overset{ˇ}{β})^{T} \sum_{j = 1}^{K} w_{j, N} x_{j} x_{j}^{T} G_{j, N}^{(1)} (x_{j}^{T} β^{*}) ({\hat{β}}_{P M H D} - \overset{ˇ}{β}) - \sum_{k = d + 1}^{p} p_{λ_{N}}^{(1)} (| β_{k}^{(0)} |) | {\hat{β}}_{P M H D k} |, \end{aligned}$ (A10)

where $β^{*}$ is between ${\hat{β}}_{P M H D}$ and $\overset{ˇ}{β}$ . Similar to the proof of Theorem 2.2 of Tang and Karunamuni [45], it holds that ${\hat{β}}_{P M H D} \overset{P}{\to} β_{0}$ . Then by (A6), we obtain

$\sum_{j = 1}^{K} w_{j, N} x_{j} x_{j}^{T} G_{j, N}^{(1)} (x_{j}^{T} β^{*}) = Σ (β_{0}) [1 + o_{p} (1)] .$ (A11)

Since $\sum_{j = 1}^{K} w_{j, N} x_{j 1} G_{j, N} (x_{j}^{T} \overset{ˇ}{β}) = 0$ , we have

$({\hat{β}}_{P M H D} - \overset{ˇ}{β})^{T} \sum_{j = 1}^{K} w_{j, N} x_{j} G_{j, N} (x_{j}^{T} \overset{ˇ}{β}) = {\hat{β}}_{P M H D 2}^{T} \sum_{j = 1}^{K} w_{j, N} x_{j 2} G_{j, N} (x_{j}^{T} \overset{ˇ}{β})$ (A12)

Using a Taylor expansion, we obtain

$\sum_{j = 1}^{K} w_{j, N} x_{j 2} G_{j, N} (x_{j}^{T} \overset{ˇ}{β}) = \sum_{j = 1}^{K} w_{j, N} x_{j 2} G_{j, N} (x_{j}^{T} β_{0}) + Σ_{21} (β_{0}) ({\overset{ˇ}{β}}_{1} - β_{01}) [1 + o_{p} (1)],$ (A13)

where $Σ_{21} (β_{0}) = \sum_{j = 1}^{K} w_{j} x_{j 2} x_{j 1}^{T} G_{j}^{(1)} (x_{j}^{T} β_{0})$ . By (A9) and (16), it follows that

$\sum_{j = 1}^{K} w_{j, N} x_{j 2} G_{j, N} (x_{j}^{T} β_{0}) = O_{p} (N^{- 1 / 2}) .$

If ${\hat{β}}_{P M H D} \neq \overset{ˇ}{β}$ , then by (A10)–(A13) and the fact that ${\overset{ˇ}{β}}_{1} - β_{01} = O_{p} (N^{- 1 / 2})$ , we have $D_{N} (({\hat{β}}_{P M H D 1}, {\hat{β}}_{P M H D 2})) < D_{N} (({\overset{ˇ}{β}}_{1}, 0))$ . This is a contradiction to the fact that ${\hat{β}}_{P M H D}$ is a maximizer of (11). So ${\hat{β}}_{P M H D 2} = 0$ and ${\hat{β}}_{P M H D 1} = {\overset{ˇ}{β}}_{1}$ .

We now prove asymptotic normality part. Consider $D_{N} ((β_{1}, 0))$ as a function of $β_{1}$ . Noting that with probability tending 1, ${\hat{β}}_{P M H D 1}$ is the $\sqrt{N}$ -consistent maximizer of $D_{N} ((β_{1}, 0))$ and satisfies

$\frac{\partial D_{N} ((β_{1}, 0))}{\partial β_{1}} |_{β_{1} = {\hat{β}}_{P M H D 1}} = \sum_{j = 1}^{K} w_{j, N} x_{j 1} G_{j, N} (x_{j}^{T} {\hat{β}}_{P M H D}) = 0.$

Using a Taylor expansion, we obtain

$\sum_{j = 1}^{K} w_{j, N} x_{j 1} G_{j, N} (x_{j}^{T} {\hat{β}}_{P M H D}) = \sum_{j = 1}^{K} w_{j, N} x_{j 1} G_{j, N} (x_{j}^{T} β_{0}) + Σ_{1} (β_{0}) ({\hat{β}}_{P M H D 1} - β_{01}) [1 + o_{p} (1)] .$

Hence, it follows that

$Σ_{1} (β_{0}) ({\hat{β}}_{P M H D 1} - β_{01}) [1 + o_{p} (1)] = - \sum_{j = 1}^{K} w_{j, N} x_{j 1} G_{j, N} (x_{j}^{T} β_{0}) .$ (A14)

By (A9), we have

$N^{1 / 2} \sum_{j = 1}^{K} w_{j, N} x_{j 1} G_{j, N} (x_{j}^{T} β_{0}) \overset{D}{\to} N (0, \frac{1}{16} Σ_{1}^{*} (β_{0}))) .$

Now (18) follows from the preceding expression and (A14). This completes the proof of Theorem 3.4. Furthermore, (19) easily follows from (18).

Funding Statement

Q. Tang's research was supported by the National Social Science Foundation of China [grant number 16BTJ019] and the Natural Science Foundation of Jiangsu Province of China [grant number BK20151481]. R.J. Karunamuni's research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Barnett V., Unusual outliers, in Data Analysis and Statistical Inference, Festschrift in Honour of Prof. Dr. Friedhelm Eicker, S. Schach and G. Trenkler, eds., Joseph Eul Verlag, Köln, 1992, pp. 93–113.
2.Basu A., Shioya H., and Park C., Statistical Inference: The Minimum Distance Approach, CRC Press, Florida, 2011. [Google Scholar]
3.Beran R., Minimum Hellinger distance estimators for parametric models, Ann. Stat. 5 (1977), pp. 445–463. doi: 10.1214/aos/1176343842 [DOI] [Google Scholar]
4.Beran R., An efficient and robust adaptive estimator of location, Ann. Stat. 6 (1978), pp. 292–313. doi: 10.1214/aos/1176344125 [DOI] [Google Scholar]
5.Bhattacharya R. and Kong M., Consistency and asymptotic normality of the estimated effective doses in bioassay, J. Stat. Plan. Inference 137 (2007), pp. 643–658. doi: 10.1016/j.jspi.2006.06.027 [DOI] [Google Scholar]
6.Bianco A.M. and Yohai V.J., Robust estimation in the logistic regression model, in Robust Statistics, Data Analysis, and Computer Intensive Methods (Schloss Thurnau, 1994), volume 109 of Lecture Notes in Statistics, Springer, New York, 1996, pp. 17–34.
7.Cantoni E. and Ronchetti E., Robust inference for generalized linear models, J. Am. Stat. Assoc. 96 (2001), pp. 1022–1030. doi: 10.1198/016214501753209004 [DOI] [Google Scholar]
8.Carroll R.J. and Pederson S., On robustness in the logistic regression model, J. R. Stat. Soc. Ser. B 55 (1993), pp. 693–706. [Google Scholar]
9.Copas J.B., Binary regression models for contaminated data (with discussion), J. R. Stat. Soc. Ser. B 50 (1988), pp. 225–265. [Google Scholar]
10.Croux C. and Haesbroeck G., Implementing the Bianco and Yohai estimator for logistic regression, Comput. Stat. Data Anal. 44 (2003), pp. 273–295. doi: 10.1016/S0167-9473(03)00042-2 [DOI] [Google Scholar]
11.Donoho D.L. and Liu R.C., The “automatic” robustness of minimum distance functionals, Ann. Stat. 16 (1988), pp. 552–586. doi: 10.1214/aos/1176350820 [DOI] [Google Scholar]
12.Fahrmeir L. and Kaufmann H., Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models, Ann. Stat. 13 (1985), pp. 342–368. doi: 10.1214/aos/1176346597 [DOI] [Google Scholar]
13.Fahrmeir L. and Tutz G., Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd ed., Springer, New York, 2001. [Google Scholar]
14.Fan J., Fan Y., and Barut E., Adaptive robust variable selection, Ann. Stat. 42 (2014), pp. 324–351. doi: 10.1214/13-AOS1191 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]
16.Fan J. and Lv J., Non-concave penalized likelihood with NP-dimensionality, IEEE Trans. Inf. Theory 57 (2011), pp. 5467–5484. doi: 10.1109/TIT.2011.2158486 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Gervini D., Robust adaptive estimators for binary regression models, J. Stat. Plan. Inference 131 (2005), pp. 297–311. doi: 10.1016/j.jspi.2004.02.006 [DOI] [Google Scholar]
18.Haberman S.J., Maximum likelihood estimates in exponential response models, Ann. Stat. 5 (1977), pp. 815–841. doi: 10.1214/aos/1176343941 [DOI] [Google Scholar]
19.Hampel F.R., The influence curve and its role in robust estimation, J. Am. Stat. Assoc. 69 (1974), pp. 383–393. doi: 10.1080/01621459.1974.10482962 [DOI] [Google Scholar]
20.He X. and Simpson D., Lower bounds for contamination bias: Globally minimax versus locally linear estimation, Ann. Stat. 21 (1993), pp. 314–337. doi: 10.1214/aos/1176349028 [DOI] [Google Scholar]
21.Hosseinian S. and Morgenthaler S., Robust binary regression, J. Stat. Plan. Inference 141 (2011), pp. 1497–1509. doi: 10.1016/j.jspi.2010.11.015 [DOI] [Google Scholar]
22.Huber P.J., Robust estimation of a location parameter, Ann. Math. Stat. 35 (1964), pp. 73–101. doi: 10.1214/aoms/1177703732 [DOI] [Google Scholar]
23.Huber P.J., Robust Statistics, Wiley, New York, 1981. [Google Scholar]
24.Künsch H.R., Stefanski L.A., and Carroll R.J., Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models, J. Am. Stat. Assoc. 84 (1989), pp. 460–466. [Google Scholar]
25.Karunamuni R.J., Kong L., and Wei T., Efficient robust doubly adaptive regularized regression with applications, Stat. Methods Med. Res. 28 (2019), pp. 2210–2226. doi: 10.1177/0962280218757560 [DOI] [PubMed] [Google Scholar]
26.Karunamuni R.J., Tang Q., and Zhao B., Robust and efficient estimation of effective dose, Comput. Stat. Data Anal. 90 (2015), pp. 47–60. doi: 10.1016/j.csda.2015.04.001 [DOI] [Google Scholar]
27.Le Cam L., Asymptotic Methods in Statistical Decision Theory, Springer, New York, 1986. [Google Scholar]
28.Li P. and Wiens D.P., Robustness of design in dose-response studies, J. R. Stat. Soc. Ser. B 73 (2011), pp. 215–238. doi: 10.1111/j.1467-9868.2010.00763.x [DOI] [Google Scholar]
29.Lindsay B.G., Efficiency versus robustness: The case for minimum Hellinger distance and related methods, Ann. Stat. 22 (1994), pp. 1081–1114. doi: 10.1214/aos/1176325512 [DOI] [Google Scholar]
30.Lindsay B.G., Statistical distances as loss functions in assessing model adequacy, in The Nature of Scientific Evidence: Statistical, Philosophical, and Empirical Considerations, M.L. Taper and S.R. Lele, eds., The University of Chicago Press, Chicago, 2004, pp. 439–464.
31.Little R.J.A., Generalized linear models for cross-classified data from the WFS. World Fertility Survey Technical Bulletins. Number 5, 1978.
32.Lv J. and Fan J., A unified approach to model selection and sparse recovery using regularized least squares, Ann. Stat. 37 (2009), pp. 3498–3528. doi: 10.1214/09-AOS683 [DOI] [Google Scholar]
33.Müller C.H. and Neykov N., Breakdown points of trimmed likelihood estimators and related estimators in generalized linear models, J. Stat. Plan. Inference 116 (2003), pp. 503–519. doi: 10.1016/S0378-3758(02)00265-3 [DOI] [Google Scholar]
34.Markatou M., Basu A., and Lindsay B., Weighted likelihood estimating equations: The discrete case with applications to logistic regression, J. Stat. Plan. Inference 57 (1997), pp. 215–232. doi: 10.1016/S0378-3758(96)00045-6 [DOI] [Google Scholar]
35.McCullagh P. and Nelder J.A., Generalized Linear Models, 2nd ed., Chapman & Hall, London, 1989. [Google Scholar]
36.Morgenthaler S., Least-absolute-deviations fits for generalized linear models, Biometrika 79 (1992), pp. 747–754. doi: 10.1093/biomet/79.4.747 [DOI] [Google Scholar]
37.Pregibon D., Logistic regression diagnostics, Ann. Stat. 9 (1981), pp. 705–724. doi: 10.1214/aos/1176345513 [DOI] [Google Scholar]
38.Pregibon D., Resistant fits for some commonly used logistic models with medical applications, Biometrics 38 (1982), pp. 485–498. doi: 10.2307/2530463 [DOI] [PubMed] [Google Scholar]
39.Simpson D.G., Minimum Hellinger distance estimation for the analysis of count data, J. Am. Stat. Associ., 82 (1987), pp. 802–807. doi: 10.1080/01621459.1987.10478501 [DOI] [Google Scholar]
40.Smucler E. and Yohai V.J., Robust and sparse estimators for linear regression models, Comput. Stat. Data Anal. 111 (2017), pp. 116–130. doi: 10.1016/j.csda.2017.02.002 [DOI] [Google Scholar]
41.Stather C., Robust statistical inference using Hellinger distance methods, Ph.D. diss., LaTrobe University, Australia, 1981.
42.Stefanski L.A., Carroll R.J., and Ruppert D., Optimally bounded score functions for generalized linear models with applications to logistic regression, Biometrika 73 (1986), pp. 413–424. [Google Scholar]
43.Stephenson B.J.K, Herring A.H., and Olshan A., Robust clustreing with subpopulation-specific deviations, J. Am. Stat. Assoc. 0 (2020), to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Tamura R. and Boos D.D., Minimum Hellinger distance estimation for multivariate location and covariance, J. Am. Stat. Assoc. 81 (1986), pp. 223–229. doi: 10.1080/01621459.1986.10478264 [DOI] [Google Scholar]
45.Tang Q. and Karunamuni R.J., Minimum distance estimation in a finite mixture regression model, J. Multivar. Anal. 120 (2013), pp. 185–204. doi: 10.1016/j.jmva.2013.05.008 [DOI] [Google Scholar]
46.Tang Q. and Karunamuni R.J., Robust variable selection for finite mixture regression mixture models, Ann. Inst. Stat. Math. 70 (2018), pp. 489–521. doi: 10.1007/s10463-017-0602-4 [DOI] [Google Scholar]
47.Tutz G., Regression for Categorical Data, Cambridge University Press, Cambridge, UK, 2012. [Google Scholar]
48.Victoria-Feser M. and Ronchetti E., Robust estimation for grouped data, J. Am. Stat. Assoc. 92 (1997), pp. 333–340. doi: 10.1080/01621459.1997.10473631 [DOI] [Google Scholar]
49.Wang X., Jiang Y., Huang M., and Zhang H., Robust variable selection with exponential squared loss, J. Am. Stat. Assoc. 108 (2013), pp. 632–643. doi: 10.1080/01621459.2013.766613 [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Wu J. and Karunamuni R.J., Efficient Hellinger distance estimates for semiparametric models, J. Multivar. Anal. 107 (2012), pp. 1–23. doi: 10.1016/j.jmva.2012.01.007 [DOI] [Google Scholar]
51.Wu J. and Karunamuni R.J., Profile Hellinger distance estimation, Statistics 49 (2015), pp. 711–740. doi: 10.1080/02331888.2014.946928 [DOI] [Google Scholar]
52.Wu J. and Karunamuni R.J., Efficient and robust tests for semiparametric models, Ann. Inst. Stat. Math. 70 (2018), pp. 761–788. doi: 10.1007/s10463-017-0608-y [DOI] [Google Scholar]
53.Zhang C.-H., Nearly unbiased variable selection under mini-max concave penalty, Ann. Stat. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]

[CIT0001] 1.Barnett V., Unusual outliers, in Data Analysis and Statistical Inference, Festschrift in Honour of Prof. Dr. Friedhelm Eicker, S. Schach and G. Trenkler, eds., Joseph Eul Verlag, Köln, 1992, pp. 93–113.

[CIT0002] 2.Basu A., Shioya H., and Park C., Statistical Inference: The Minimum Distance Approach, CRC Press, Florida, 2011. [Google Scholar]

[CIT0003] 3.Beran R., Minimum Hellinger distance estimators for parametric models, Ann. Stat. 5 (1977), pp. 445–463. doi: 10.1214/aos/1176343842 [DOI] [Google Scholar]

[CIT0004] 4.Beran R., An efficient and robust adaptive estimator of location, Ann. Stat. 6 (1978), pp. 292–313. doi: 10.1214/aos/1176344125 [DOI] [Google Scholar]

[CIT0005] 5.Bhattacharya R. and Kong M., Consistency and asymptotic normality of the estimated effective doses in bioassay, J. Stat. Plan. Inference 137 (2007), pp. 643–658. doi: 10.1016/j.jspi.2006.06.027 [DOI] [Google Scholar]

[CIT0006] 6.Bianco A.M. and Yohai V.J., Robust estimation in the logistic regression model, in Robust Statistics, Data Analysis, and Computer Intensive Methods (Schloss Thurnau, 1994), volume 109 of Lecture Notes in Statistics, Springer, New York, 1996, pp. 17–34.

[CIT0007] 7.Cantoni E. and Ronchetti E., Robust inference for generalized linear models, J. Am. Stat. Assoc. 96 (2001), pp. 1022–1030. doi: 10.1198/016214501753209004 [DOI] [Google Scholar]

[CIT0008] 8.Carroll R.J. and Pederson S., On robustness in the logistic regression model, J. R. Stat. Soc. Ser. B 55 (1993), pp. 693–706. [Google Scholar]

[CIT0009] 9.Copas J.B., Binary regression models for contaminated data (with discussion), J. R. Stat. Soc. Ser. B 50 (1988), pp. 225–265. [Google Scholar]

[CIT0010] 10.Croux C. and Haesbroeck G., Implementing the Bianco and Yohai estimator for logistic regression, Comput. Stat. Data Anal. 44 (2003), pp. 273–295. doi: 10.1016/S0167-9473(03)00042-2 [DOI] [Google Scholar]

[CIT0011] 11.Donoho D.L. and Liu R.C., The “automatic” robustness of minimum distance functionals, Ann. Stat. 16 (1988), pp. 552–586. doi: 10.1214/aos/1176350820 [DOI] [Google Scholar]

[CIT0012] 12.Fahrmeir L. and Kaufmann H., Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models, Ann. Stat. 13 (1985), pp. 342–368. doi: 10.1214/aos/1176346597 [DOI] [Google Scholar]

[CIT0013] 13.Fahrmeir L. and Tutz G., Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd ed., Springer, New York, 2001. [Google Scholar]

[CIT0014] 14.Fan J., Fan Y., and Barut E., Adaptive robust variable selection, Ann. Stat. 42 (2014), pp. 324–351. doi: 10.1214/13-AOS1191 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0015] 15.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]

[CIT0016] 16.Fan J. and Lv J., Non-concave penalized likelihood with NP-dimensionality, IEEE Trans. Inf. Theory 57 (2011), pp. 5467–5484. doi: 10.1109/TIT.2011.2158486 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0017] 17.Gervini D., Robust adaptive estimators for binary regression models, J. Stat. Plan. Inference 131 (2005), pp. 297–311. doi: 10.1016/j.jspi.2004.02.006 [DOI] [Google Scholar]

[CIT0018] 18.Haberman S.J., Maximum likelihood estimates in exponential response models, Ann. Stat. 5 (1977), pp. 815–841. doi: 10.1214/aos/1176343941 [DOI] [Google Scholar]

[CIT0019] 19.Hampel F.R., The influence curve and its role in robust estimation, J. Am. Stat. Assoc. 69 (1974), pp. 383–393. doi: 10.1080/01621459.1974.10482962 [DOI] [Google Scholar]

[CIT0020] 20.He X. and Simpson D., Lower bounds for contamination bias: Globally minimax versus locally linear estimation, Ann. Stat. 21 (1993), pp. 314–337. doi: 10.1214/aos/1176349028 [DOI] [Google Scholar]

[CIT0021] 21.Hosseinian S. and Morgenthaler S., Robust binary regression, J. Stat. Plan. Inference 141 (2011), pp. 1497–1509. doi: 10.1016/j.jspi.2010.11.015 [DOI] [Google Scholar]

[CIT0022] 22.Huber P.J., Robust estimation of a location parameter, Ann. Math. Stat. 35 (1964), pp. 73–101. doi: 10.1214/aoms/1177703732 [DOI] [Google Scholar]

[CIT0023] 23.Huber P.J., Robust Statistics, Wiley, New York, 1981. [Google Scholar]

[CIT0024] 24.Künsch H.R., Stefanski L.A., and Carroll R.J., Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models, J. Am. Stat. Assoc. 84 (1989), pp. 460–466. [Google Scholar]

[CIT0025] 25.Karunamuni R.J., Kong L., and Wei T., Efficient robust doubly adaptive regularized regression with applications, Stat. Methods Med. Res. 28 (2019), pp. 2210–2226. doi: 10.1177/0962280218757560 [DOI] [PubMed] [Google Scholar]

[CIT0026] 26.Karunamuni R.J., Tang Q., and Zhao B., Robust and efficient estimation of effective dose, Comput. Stat. Data Anal. 90 (2015), pp. 47–60. doi: 10.1016/j.csda.2015.04.001 [DOI] [Google Scholar]

[CIT0027] 27.Le Cam L., Asymptotic Methods in Statistical Decision Theory, Springer, New York, 1986. [Google Scholar]

[CIT0028] 28.Li P. and Wiens D.P., Robustness of design in dose-response studies, J. R. Stat. Soc. Ser. B 73 (2011), pp. 215–238. doi: 10.1111/j.1467-9868.2010.00763.x [DOI] [Google Scholar]

[CIT0029] 29.Lindsay B.G., Efficiency versus robustness: The case for minimum Hellinger distance and related methods, Ann. Stat. 22 (1994), pp. 1081–1114. doi: 10.1214/aos/1176325512 [DOI] [Google Scholar]

[CIT0030] 30.Lindsay B.G., Statistical distances as loss functions in assessing model adequacy, in The Nature of Scientific Evidence: Statistical, Philosophical, and Empirical Considerations, M.L. Taper and S.R. Lele, eds., The University of Chicago Press, Chicago, 2004, pp. 439–464.

[CIT0031] 31.Little R.J.A., Generalized linear models for cross-classified data from the WFS. World Fertility Survey Technical Bulletins. Number 5, 1978.

[CIT0032] 32.Lv J. and Fan J., A unified approach to model selection and sparse recovery using regularized least squares, Ann. Stat. 37 (2009), pp. 3498–3528. doi: 10.1214/09-AOS683 [DOI] [Google Scholar]

[CIT0033] 33.Müller C.H. and Neykov N., Breakdown points of trimmed likelihood estimators and related estimators in generalized linear models, J. Stat. Plan. Inference 116 (2003), pp. 503–519. doi: 10.1016/S0378-3758(02)00265-3 [DOI] [Google Scholar]

[CIT0034] 34.Markatou M., Basu A., and Lindsay B., Weighted likelihood estimating equations: The discrete case with applications to logistic regression, J. Stat. Plan. Inference 57 (1997), pp. 215–232. doi: 10.1016/S0378-3758(96)00045-6 [DOI] [Google Scholar]

[CIT0035] 35.McCullagh P. and Nelder J.A., Generalized Linear Models, 2nd ed., Chapman & Hall, London, 1989. [Google Scholar]

[CIT0036] 36.Morgenthaler S., Least-absolute-deviations fits for generalized linear models, Biometrika 79 (1992), pp. 747–754. doi: 10.1093/biomet/79.4.747 [DOI] [Google Scholar]

[CIT0037] 37.Pregibon D., Logistic regression diagnostics, Ann. Stat. 9 (1981), pp. 705–724. doi: 10.1214/aos/1176345513 [DOI] [Google Scholar]

[CIT0038] 38.Pregibon D., Resistant fits for some commonly used logistic models with medical applications, Biometrics 38 (1982), pp. 485–498. doi: 10.2307/2530463 [DOI] [PubMed] [Google Scholar]

[CIT0039] 39.Simpson D.G., Minimum Hellinger distance estimation for the analysis of count data, J. Am. Stat. Associ., 82 (1987), pp. 802–807. doi: 10.1080/01621459.1987.10478501 [DOI] [Google Scholar]

[CIT0040] 40.Smucler E. and Yohai V.J., Robust and sparse estimators for linear regression models, Comput. Stat. Data Anal. 111 (2017), pp. 116–130. doi: 10.1016/j.csda.2017.02.002 [DOI] [Google Scholar]

[CIT0041] 41.Stather C., Robust statistical inference using Hellinger distance methods, Ph.D. diss., LaTrobe University, Australia, 1981.

[CIT0042] 42.Stefanski L.A., Carroll R.J., and Ruppert D., Optimally bounded score functions for generalized linear models with applications to logistic regression, Biometrika 73 (1986), pp. 413–424. [Google Scholar]

[CIT0043] 43.Stephenson B.J.K, Herring A.H., and Olshan A., Robust clustreing with subpopulation-specific deviations, J. Am. Stat. Assoc. 0 (2020), to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0044] 44.Tamura R. and Boos D.D., Minimum Hellinger distance estimation for multivariate location and covariance, J. Am. Stat. Assoc. 81 (1986), pp. 223–229. doi: 10.1080/01621459.1986.10478264 [DOI] [Google Scholar]

[CIT0045] 45.Tang Q. and Karunamuni R.J., Minimum distance estimation in a finite mixture regression model, J. Multivar. Anal. 120 (2013), pp. 185–204. doi: 10.1016/j.jmva.2013.05.008 [DOI] [Google Scholar]

[CIT0046] 46.Tang Q. and Karunamuni R.J., Robust variable selection for finite mixture regression mixture models, Ann. Inst. Stat. Math. 70 (2018), pp. 489–521. doi: 10.1007/s10463-017-0602-4 [DOI] [Google Scholar]

[CIT0047] 47.Tutz G., Regression for Categorical Data, Cambridge University Press, Cambridge, UK, 2012. [Google Scholar]

[CIT0048] 48.Victoria-Feser M. and Ronchetti E., Robust estimation for grouped data, J. Am. Stat. Assoc. 92 (1997), pp. 333–340. doi: 10.1080/01621459.1997.10473631 [DOI] [Google Scholar]

[CIT0049] 49.Wang X., Jiang Y., Huang M., and Zhang H., Robust variable selection with exponential squared loss, J. Am. Stat. Assoc. 108 (2013), pp. 632–643. doi: 10.1080/01621459.2013.766613 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0050] 50.Wu J. and Karunamuni R.J., Efficient Hellinger distance estimates for semiparametric models, J. Multivar. Anal. 107 (2012), pp. 1–23. doi: 10.1016/j.jmva.2012.01.007 [DOI] [Google Scholar]

[CIT0051] 51.Wu J. and Karunamuni R.J., Profile Hellinger distance estimation, Statistics 49 (2015), pp. 711–740. doi: 10.1080/02331888.2014.946928 [DOI] [Google Scholar]

[CIT0052] 52.Wu J. and Karunamuni R.J., Efficient and robust tests for semiparametric models, Ann. Inst. Stat. Math. 70 (2018), pp. 761–788. doi: 10.1007/s10463-017-0608-y [DOI] [Google Scholar]

[CIT0053] 53.Zhang C.-H., Nearly unbiased variable selection under mini-max concave penalty, Ann. Stat. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]

PERMALINK

Regularized robust estimation in binary regression models

Qingguo Tang

Rohana J Karunamuni

Boxiao Liu

Abstract

1. Introduction

2. Regularized minimum-distance estimators

3. Asymptotic properties of estimators

Theorem 3.1

Theorem 3.2

Theorem 3.3

Theorem 3.4

4. Monte Carlo studies

4.1. Estimation

Table 1. Biases and MSEs of MHDE, MSCDE, and MLE for Models I and II.

Table 2. Simulation results for Models I and II with different covariates for different groups.

Table 3. Biases and MSEs of MHDE, MSCDE, and MLE for Model III.

4.2. Variable selection

Table 4. Comparison of (penalized) MHDE, MSCDE, and MLE for Model IV.

Table 5. Comparison of (penalized) MHDE, MSCDE, and MLE for Model V.

Table 6. Comparison of (penalized) MHDE, MSCDE, and MLE for Model VI.

5. Real-data applications

5.1. Example 1

Table 7. Parametric estimates and MSEs for models (20) and (21).

Figure 1.

5.2. Example 2

Table 8. Parametric estimates and MSEs for the three methods.

Figure 2.

6. Discussion

Acknowledgments

Appendix.

Lemma A.1

Proof.

Lemma A.2

Proof.

Proof of Theorem 3.1. Proof of Theorem 3.1 —

Proof of Theorem 3.2. Proof of Theorem 3.2 —

Proof of Theorem 3.3. Proof of Theorem 3.3 —

Proof of Theorem 3.4. Proof of Theorem 3.4 —

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases