Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2020 Sep 18;49(3):574–598. doi: 10.1080/02664763.2020.1822304

Regularized robust estimation in binary regression models

Qingguo Tang a, Rohana J Karunamuni b,CONTACT, Boxiao Liu b,*
PMCID: PMC9041772  PMID: 35706765

Abstract

In this paper, we investigate robust parameter estimation and variable selection for binary regression models with grouped data. We investigate estimation procedures based on the minimum-distance approach. In particular, we employ minimum Hellinger and minimum symmetric chi-squared distances criteria and propose regularized minimum-distance estimators. These estimators appear to possess a certain degree of automatic robustness against model misspecification and/or for potential outliers. We show that the proposed non-penalized and penalized minimum-distance estimators are efficient under the model and simultaneously have excellent robustness properties. We study their asymptotic properties such as consistency, asymptotic normality and oracle properties. Using Monte Carlo studies, we examine the small-sample and robustness properties of the proposed estimators and compare them with traditional likelihood estimators. We also study two real-data applications to illustrate our methods. The numerical studies indicate the satisfactory finite-sample performance of our procedures.

Keywords: Binary regression, maximum likelihood, minimum-distance methods, variable selection, efficiency, robustness

2010 Mathematics Subject Classification: 62F35

1. Introduction

Data involving explanatory (covariate) variables and binary responses are common in many disciplines, including health, medicine, environmental science, agriculture, behavioral science, social science, and education. If the response is one of two possible outcomes and covariates are observed for each experimental subject, binary regression models are commonly used [35]. In a typical binary regression model, we have a random sample of response variables Yj{0,1} and covariates xjRp, j=1,,K. The probability of a positive response is modeled as a function of a linear combination of the covariates: P(Yj=1|xj)=pj, where pj=F(xjTβ) with F denoting a known cumulative distribution function (CDF), commonly known as the link function, and β the unknown regression parameter that needs to be estimated. When F is the logistic distribution we have a logit model, and when F is the standard normal distribution we have a probit model. Chapter 4 of McCullagh and Nelder [35] provides an excellent account of methods of estimation and data analysis procedures based on binary regression models.

In many experiments, the units under study can be classified into K groups in such a way that the individuals in a group have identical values for all the covariates. Thus, for the jth combination of experimental conditions characterized by the p-dimensional vector xj, observations are available for nj individuals. Then of the N=j=1Knj individuals under study, nj share the covariate vector xj, j=1,,K. These groups are known as covariate classes [35]. Working with grouped data has the additional advantage that, depending on the size of the groups, it becomes possible to test the goodness of fit of the model.

A classic example where grouped data arise naturally is in the ‘effective dose level’ estimation of dose–response studies; see, e.g. Bhattacharya and Kong [5], Li and Wiens [28] and Karunamuni et al. [26]. Specifically, in pharmacology or toxicology studies, experimenters are often interested in estimating the effective dose level EDp, where 0<p<1. The EDP is the dose at which 100p% of the subjects show a response. Generally, K groups of test subjects characterized by different dose levels xj (j=1,,K) are collected, where each subject in the corresponding group is collected independently. The number of subjects in the groups is nj (1jK), and the number of subjects showing a response at dose level xj (1jK) is mj. In the dose–response context, it is generally assumed that x1<x2<<xK. That is, the outcome of interest is usually measured at several increasing dose levels. For every subject, a binary response is produced: ‘1’ indicates a response, and ‘0’ indicates no response. The model then reduces to P(Yj=1|xj)=F(xjTβ) with β=(β0,β1)T and xj=(1,xj)T for parameters β0 and β1>0. Many such examples can be found in the literature; see, e.g. McCullagh and Nelder [35] and Tutz [47].

For statistical inference in binary regression models, the maximum likelihood approach is by far the most widely used method. Specifically, let nj denote the number of observations in group j, and let Yj denote the number of units with the attribute of interest in group j, j=1,,K. Then the conditional distribution of Yj given xj is binomial with parameters pj=F(xjTβ) and nj, j=1,,K. Hence the log-likelihood function of the data {(Y1,x1), (Y2,x2),,(YK,xK)} is given by

lN(β)=j=1KYjln(F(xjTβ))+(njYj)ln(1F(xjTβ)), (1)

and the maximum likelihood estimator (MLE) βˆMLE of the parameter β is obtained by maximizing lN(β). For simultaneous estimation and variable selection, an estimator is generally constructed by maximizing the penalized log-likelihood function lN(β)PN(β), where PN(β) is a penalty function on β. This idea defines a penalized (regularized) estimator of β as

βˆPMLE=argmaxβ{lN(β)PN(β)}. (2)

Under some assumptions, βˆMLE exists and is asymptotically unique; it is consistent and asymptotically normal as N and all nj, j=1,,K. Moreover, it is asymptotically efficient compared to a wide class of other estimators [12,18]. However, both βˆMLE and βˆPMLE are sensitive to atypical data and model misspecification. In particular, observations with extreme covariates have a large influence on these estimators, and if they are accompanied by misclassified responses and a misclassified link function, the resulting estimates can be seriously biased [23].

For grouped data, the lack of robustness of βˆMLE has been examined by several authors; see, e.g. Barnett [1], Victoria-Feser and Ronchetti [48] and Hosseinian and Morgenthaler [21]. For non-grouped data, the non-robustness of βˆMLE has been extensively discussed; see, e.g. Pregibon [37,38], Stefanski et al. [42], Copas [9], Künsch et al. [24], Morgenthaler [36], Carroll and Pederson [8], Bianco and Yohai [6], Markatou et al. [34], Cantoni and Ronchetti [7], Croux and Haesbroeck [10], Müller and Neykov [33], Gervini [17], and Hosseinian and Morgenthale [21], among others.

A robust methodology is vital in data analysis because outliers and model misspecifications are common in practical applications. Moreover, efficient methods are essential in practice. These considerations have motivated our research. We propose regularized estimation of β using the minimum-distance approach. Minimum-distance estimators possess a certain degree of automatic robustness to model misspecification [11]. Furthermore, certain minimum-distance estimators achieve efficiency under the model. In particular, minimum Hellinger distance (MHD) estimators for parametric models attain efficiency under the model and have excellent robustness properties in the presence of outliers and/or model misspecification [3,4,39]. Moreover, Lindsay [30] has shown that the maximum likelihood and MHD estimators are members of a larger class of efficient estimators with various second-order efficiency properties. For discrete data, Simpson [39] has shown that the breakdown point of MHD estimators is 1/2; that is, they achieve maximum robustness in the presence of outliers (see also [20]). Another distance measure that is intimately related to the Hellinger distance is the symmetric chi-squared distance introduced by Lindsay [29]. Lindsay [29,30] studied non-regularized estimators using several distance measures and showed that the minimum symmetric chi-squared distance (MSCD) generates highly efficient and robust estimators.

In this paper, we investigate simultaneous robust estimation and variable selection for binary regression models with grouped data. For this purpose, we employ the squared Hellinger distance or the symmetric chi-squared distance as the measure of adequacy, combined with a penalty function. Specifically, the proposed penalized (regularized) estimator of β is constructed by minimizing D(P,Q)+PN(β), where D(P,Q) is a distance measure between two probability distributions P and Q based on an estimated model and the true model, respectively, and PN(β) is a penalty function. We use distance measures instead of the log-likelihood function (1) to develop robust estimators for β. We investigate asymptotic properties including consistency and asymptotic normality of the proposed estimators. In particular, we show that the proposed regularized estimators of β have desirable asymptotic properties, such as oracle properties [15]. Using Monte Carlo studies we examine their small-sample and robustness properties and compare them with the corresponding regularized MLE of β.

The remainder of this paper is organized as follows. Section 2 develops proposed robust regularized regression estimators. Section 3 gives the asymptotic properties of the estimators. Section 4 presents the finite-sample performance of the estimators. In Section 5, we illustrate and compare the proposed estimators using two real-data applications. Finally, Section 6 contains a discussion of the results. The proofs of main results are given in the Appendix.

2. Regularized minimum-distance estimators

To develop the methodology, we first consider two discrete probability distributions P={pi:iI} and Q={qi:iI}, where I is some discrete set, pi, qi>0 for all iI, and pi=qi=1. Then the squared Hellinger distance between the probability distributions P and Q is defined by H2(P,Q)=iI(piqi)2 [3,39], and the symmetric chi-squared distance between the probability distributions P and Q is defined as C2(P,Q)=2iI(piqi)2(pi+qi) [29,30]. Using the inequalities pi+qi(pi+qi)22(pi+qi), it follows that 14C2(P,Q)H2(P,Q)12C2(P,Q) [27]. Therefore, there is a very strong near-equivalence relationship between the Hellinger distance and the symmetric chi-squared distance. That is, the two distances generate equivalent topologies on the space of distributions: there is a C2-ball inside every H2-ball and vice-versa. The Hellinger distance H is not strongly affected by the presence of a few outliers, and these bounds show that this property carries over in some degree to the symmetric chi-squared distance C2. On the other hand, both distances are closely linked to the total variation distance defined by V(P,Q)=12iI|piqi| via the relationships V2(P,Q)C2(P,Q)/4V(P,Q) and V2(P,Q)H2(P,Q)2V(P,Q).

In order to construct estimators of β for binary regression with grouped data, we compute Hellinger and symmetric chi-squared distances with the following two discrete probability distributions:

PN=w1,Npˆ1,,wK,NpˆK,w1,N(1pˆ1),,wK,N(1pˆK)T (3)
QN=w1,Np1,..,wK,NpK,w1,N(1p1),..,wK,N(1pK)T, (4)

where pˆj=Yjnj, pj=F(xjTβ), wj,N=njN, nj is the number of observations in group j, Yj is the number of units with the attribute of interest in group j, j=1,,K, and N=j=1Knj. Note that pˆj is a consistent estimator of pj, 1jK. Thus, the probability distributions PN and QN are based on an estimated model and the true model, respectively. The squared Hellinger distance between PN and QN is given by

H2(PN,QN)=j=1Kwj,N(pˆjpj)2+wj,N(1pˆj1pj)2=j=1Kwj,NpˆjF(xjTβ)2+1pˆj1F(xjTβ)2, (5)

and the symmetric chi-squared distance C2(PN,QN) is given by

C2(PN,QN)=2j=1Kwj,N[pˆjF(xjTβ)]2[pˆj+F(xjTβ)]+[(1pˆj)(1F(xjTβ))]2[(1pˆj)+(1F(xjTβ))]. (6)

Then MHD and MSCD estimators of β can be obtained by minimizing H2(PN,QN) and C2(PN,QN), respectively, with respect to β. By simplifying (5) and (6), MHD and MSCD estimators can also be obtained as follows:

βˆMHD=argmaxβj=1Kwj,NpˆjF(xjTβ)+(1pˆj)(1F(xjTβ)) (7)

and

βˆMSCD=argminβj=1Kwj,N[pˆjF(xjTβ)]2/{[pˆj+F(xjTβ)][2pˆjF(xjTβ)]}. (8)

The asymptotic properties of βˆMHD and βˆMSCD can be established following the techniques developed in Stather [41] and Karunamuni et al. [26]. The estimators βˆMHD and βˆMSCD have excellent robustness properties in the presence of outliers and/or model misspecification [3,4,20,29,30,39,44].

We now discuss the simultaneous estimation and variable selection problem. A penalty function generally facilitates variable selection in regression models. As discussed in the Introduction, in the present context simultaneous estimation and variable selection is generally carried out by maximizing the penalized log-likelihood function lN(β)PN(β), where lN(β) is the (conditional) log-likelihood function defined by (1), and PN(β) is a penalty function on β. It can be shown that the resulting estimator βˆPMLE (see (2)) has nice properties, including the oracle properties [15]; namely, it performs as well as if the true underlying model were given in advance. Such regularized methods have been widely used for simultaneous coefficient estimation and variable selection by identifying covariate variables that are associated with a response variable. However, penalized likelihood procedures are not robust to outliers and model misspecification [15]. In other words, the estimator βˆPMLE can be highly unstable if the model is not completely correct and if outliers are present.

We propose the following approach: we replace the log-likelihood function lN(β) with a distance measure, such as H2(P,Q) or C2(P,Q), to develop a robust regularized estimator. In view of (5), a penalized MHD estimator of β is obtained by minimizing H2(PN,QN)+PN(β) w.r.t. β. Equivalently, in view of (7) a regularized MHD estimator of β can be obtained as

β~PMHD=argmaxβj=1Kwj,NpˆjF(xjTβ)+(1pˆj)(1F(xjTβ))PN(β). (9)

In view of (8), one can construct a regularized MSCD estimator of β as

β~PMSCD=argminβj=1Kwj,N[pˆjF(xjTβ)]2/[pˆj+F(xjTβ)][2pˆjF(xjTβ))]+PN(β)j=1K. (10)

In variable selection problems, it is assumed that some components of β are equal to zero. The goal is to identify and estimate the subset model. It has been argued that folded concave penalties are preferable to convex penalties such as the l1-penalty in terms of both model-estimation accuracy and variable selection consistency [16,32].

Let pλN(|t|) denote a folded concave penalty function defined on t(,+) satisfying

  1. pλN(t) are increasing and concave in t[0,+);

  2. pλN(t) are differentiable in t(0,+) with pλN(1)(0):=pλN(1)(0+)a1λN, pλN(1)(t)a1λN for t(0,a2λN], pλN(1)(t)a3λN for t[0,+), and pλN(1)(t)=0 for t[aλN,+) with a prespecified constant a>a2, where pλN(1) denotes the first derivative of pλN, and a1, a2, and a3 are fixed positive constants.

The above family of general folded concave penalties contains several popular penalties, including the SCAD penalty [15] and the MCP penalty [53].

With a concave penalty function and in view of (9), the proposed regularized MHD estimator of β is defined by

βˆPMHD=argmaxβj=1Kwj,NpˆjF(xjTβ)+(1pˆj)(1F(xjTβ))k=1ppλN(1)(|βk(0)|)|βk|j=1K, (11)

where β(0)=(β1(0),,βp(0))T is an initial robust estimator of β. For example, β(0) can be obtained from (7) or (8) above. We will show that βˆPMHD has the oracle properties. Similarly, in view of (10), the proposed regularized MSCD estimator of β is defined by

βˆPMSCD=argminβj=1Kwj,N[pˆjF(xjTβ)]2/[pˆj+F(xjTβ)][2pˆjF(xjTβ)]+k=1ppλN(1)(|βk(0)|)|βk|. (12)

As briefly mentioned in the Introduction, it has been argued in the literature that all minimum-distance estimators, including the MHD and MSCD estimators, are automatically robust with respect to the stability of the quantity being estimated [11]. In other words, they are only slightly affected by small departures from the true model. Furthermore, MHD estimators for count data have excellent robustness to outliers and model misspecification [39]. This can be attributed to the fact that both the Hellinger and symmetric chi-squared distances generate topologies that are equivalent to the topology of the total variation distance, which is known to produce highly robust estimators. For discrete data, MHD estimators attain the highest breakdown point 1/2 [20,39]. The breakdown point of an estimator is the proportion of incorrect observations (i.e. arbitrary values) it can handle before giving an arbitrarily large result [19,22,23]. Further, Lindsay [29] has shown that the non-regularized MSCD estimators are highly efficient and robust in general. Thus, we can also expect that the proposed regularized MHD and MSCD estimators, namely βˆPMHD and βˆPMSCD, to be highly efficient and robust.

Apart from Lindsay [29] and Karunamuni et al. [26], we are not aware of any significant work on the use of C2(P,Q) in statistical inference. However, H2(P,Q) has been widely implemented for both continuous and discrete distributions. The literature is too extensive for a complete listing here. Some recent developments and important references can be found in the articles Wu and Karunamuni [50–52] and Tang and Karunamuni [45,46], and in the monograph of Basu et al. [2].

3. Asymptotic properties of estimators

We first introduce some notation. Let IK=[0,1]×[0,1]××[0,1] (K terms) denote the K-product space of the interval [0,1] and let ={w:w=(w1,,wK)IK, wj>0, j=1Kwj=1}. Define GK=IK×. For 1jK, let wj,N=nj/N and pˆj=Yj/nj, with YjB(nj,pj) and N=j=1Knj as defined in Section 2. We assume that the Yj's are independent, 1jK. Let wN and pˆN denote K-dimensional vectors with components wj,N (1jK) and pˆj (1jK), respectively. Then (pˆN,wN)GK. Let Θ Rp denote the parameter space of β.

The main result of this paper is Theorem 3.4 below, which establishes the oracle properties of the penalized MHD estimator βˆPMHD defined by (11). We first present some asymptotic properties of the MHD estimator βˆMHD defined by (7). These results on βˆMHD are helpful to establish oracle properties of βˆPMHD. Generally, it is convenient to formulate βˆMHD as a functional value. We define a functional T:GKΘ such that T(π,w) is a value of β Θ defined by

argmaxβj=1KwjπjF(xjTβ)+(1πj)(1F(xjTβ)). (13)

If T(π,w) is not uniquely defined then we choose one of the possible values arbitrarily. In terms of the functional T, the MHD estimator βˆMHD defined by (7) is equal to T(pˆN,wN). This formulation of the MHD estimator makes it easier to prove the asymptotic results (e.g. [3,26,41,45], among others).

The following theorem establishes the consistency of the MHD estimator βˆMHD. The proofs of theorems given below are relegated to Appendix. Let β0 denote the true parameter value of β.

Theorem 3.1

Suppose that Θ is a compact subset of Rp and F is continuous and strictly increasing on R. Suppose further that π=(π1,,πK)T with πj=F(xjTβ0) for 1jK, and the xj's span Rp. Assume that wj,Nwj>0 as N for 1jK. Then the MHD estimator βˆMHD is a consistent estimator of β0; i.e. βˆMHDP β0 as N, where P stands for the convergence in probability.

The next theorem lays the necessary foundation for a result on the asymptotic normality of the MHD estimator βˆMHD.

Theorem 3.2

Suppose Θ is a compact subset of Rp and let C={xjTβ:βΘ,1jK}. Suppose further that F is strictly increasing and thrice differentiable with derivatives f, f(1), and f(2) bounded on C. Assume that F(C)[δ,1δ] for some δ>0. Let (π,w)GK be such that T(π,w) is unique, and let {(πn,wn)GK:n1} be a deterministic sequence such that (πn,wn)(π,w) as n. Let Σ(β) be a p×p matrix defined by Σ(β)=j=1KwjxjxjTGj(1)(xjTβ), and let λ(π,w,β) be a p×1 vector defined by λ(π,w,β)=j=1KwjxjGj(xjTβ), where Gj(y)=ddy{πjF(y)+(1πj)(1F(y))} for 1jK. Assume that the matrix Σ(β) is nonsingular. Then, as n, we have

T(πn,wn)T(π,w)=[Σ1(β)+o(1)]λ(πn,wn,β), (14)

where λ(πn,wn,β) is obtained from λ(π,w,β) by replacing (π,w) with (πn,wn). Then we have

T(πn,wn)T(π,w)=4[Σ1(β)+o(1)]λ(πn,wn,β), (15)

where Σ is a p×p matrix defined by

Σ(β)=j=1Kwjf2(xjTβ)[{πjF(xjTβ)}1/2+{(1πj)(1F(xjTβ))}1/2]2F(xjTβ)(1F(xjTβ))xjxjT.

The next theorem establishes the asymptotic normality of the MHD estimator βˆMHD.

Theorem 3.3

Assume that the conditions of Theorems 3.1 and 3.2 hold. Further, assume that the expansion (14) holds for T(pˆN,wN) with op(1), where βˆMHD=T(pˆN,wN). Let (π,w)GK be such that T(π,w) is unique and T(π,w)=β0. Then, as N, we have

N(βˆMHDβ0)DN0,116Σ1(β0)Σ(β0)Σ1(β0), (16)

where Σ(β) and Σ(β) are as defined in Theorem 3.2. For πj=F(xjTβ0) for 1jK, we have

N(βˆMHDβ0)DN(0,Σ1(β0)), (17)

where Σ(β0)=j=1KwjxjxjTf2(xjTβ0)/[F(xjTβ0)(1F(xjTβ0))].

In the next theorem we show that the penalized MHD estimator βˆPMHD defined by (11) has the oracle properties. Without loss of generality, let β=(β1T,β2T)T, where β1Rd and β2Rpd. The vector of true parameters is denoted by β0=(β01T,β02T)T with each element of β01 being nonzero and β02=0.

Theorem 3.4

Assume that the conditions of Theorem 3.3 hold. Let pλN() be general folded concave penalty functions satisfying assumptions (a) and (b) defined in Section 2. If λN0 and NλN as N, then the penalized MHD estimator βˆPMHD=(βˆPMHD1T,βˆPMHD2T)T defined by (11) satisfies

  1. Sparsity: P(βˆPMHD2=0)1;

  2. Asymptotic normality:
    N(βˆPMHD1β01)DN0,116Σ11(β0)Σ1(β0)Σ11(β0), (18)
    where Σ1(β)=j=1Kwjxj1xj1TGj(1)(xjTβ) and
    Σ1(β)=j=1Kwjf2(xjTβ)[{πjF(xjTβ)}1/2+{(1πj)(1F(xjTβ))}1/2]2F(xjTβ)(1F(xjTβ))xj1xj1T
    with xj=(xj1T,xj2T)T and xj1Rd. If πj=F(xjTβ0) for 1jK, then we have
    N(βˆPMHD1β01)DN(0,Σ11(β0)), (19)
    where Σ1(β0)=j=1Kwjxj1xj1Tf2(xjTβ0)/[F(xjTβ0)(1F(xjTβ0))].

Recall the regularized MLE βˆPMLE defined by (2) in the Introduction. For a general folded concave penalty function pλN() as in Theorem 3.3, it can be shown that βˆPMLE satisfies a asymptotic normality property such as (19) when πj=F(xjTβ0) for 1jK. That is, we have N(βˆPMLE1β01)DN(0,Σ11(β0)) as N under some regularity conditions. By comparing the preceding result with (19), one can see the asymptotic equivalence of the penalized MHD estimator βˆPMHD and the penalized MLE βˆPMLE. Thus, the penalized MHD and MLE estimators share some (asymptotic) optimality properties. It can be shown that the penalized MSCD estimator βˆPMSCD defined by (12) is also asymptotically equivalent to the penalized MLE βˆPMLE under some conditions. The advantage of penalized and non-penalized MHD and MSCD estimators is that they possess excellent robustness properties (see Section 5), which the corresponding MLE's generally lack.

It can be shown that the MLE βˆMLE exhibits the same asymptotic normality property as (17). By comparing (17) and (19), we observe that the penalized MHD estimator βˆPMHD possesses the oracle properties and is asymptotically as good as the MLE for estimating β0 given β02=0. Thus, βˆPMHD is a fully efficient oracle procedure with excellent robustness properties. The penalized MSCD estimator βˆPMSCD also possesses the same properties. This is the most striking feature of βˆPMHD and βˆPMSCD. To the best of our knowledge, no other estimators have this property in the present context. Also from Theorem 3.4 we note that βˆPMHD is asymptotically unbiased. This is because we have employed an adaptive penalty function to define βˆPMHD. For non-adaptive penalty functions, such as the SCAD penalty [15], there would be an extra bias term, and this extra term would be a function of pλN(1)(|t|), the first derivative of pλN(|t|). This extra bias term however would be negligible for large |t| in the case of the SCAD penalty function. To select the regularization parameter λN in practice, we can use a data-driven method, such as cross-validation, AIC, or BIC.

4. Monte Carlo studies

4.1. Estimation

In this subsection, we conduct simulation studies to compare the finite-sample performance of MHD and MSCD estimators (denoted as MHDE and MSCDE in this section) defined by (7) and (8), respectively, with the traditional MLE (see circa (1)). The behavior of MHDE and MSCDE is studied under contamination models. For our simulation, we considered K = 10 groups. Within each group, the sample size is set to nj=n, j=1,,10. Thus, for the ith group, we generated n data points based on a Bernoulli distribution with probability of success F(xjTβ), j=1,,10, where F(.) is a CDF. We used the CDF of the Logistic (1,1.2) distribution as the CDF for F in this subsection. We computed the bias and the mean squared error (MSE) of each estimator based on M replications as follows:

Bias(βˆm)=i=1M(βˆm,iβm)M,MSE(βˆm)=i=1M(βˆm,iβm)2M,m=0,1,2,

that is the average of each performance measure over M repetitions. In this subsection, we set M = 1000. We also set βT=(β0,β1)=(1.5,0.4) to be the true parameter vector and xj=(1,xj)T, where xj=j for j=1,,10. We first examined the following two models:

  • Model I: F1(y)=F(y);

  • Model II: F2(y)=0.9F(y)+0.1.

Model I is the clean model (i.e. there is no contamination), and Models II represents an overall increase in the response for 10% of the observations.

Table 1 presents the simulation results for the Bias and MSE of the three estimators under Models I and II. For Model I, the biases for MLE considerably lower than those of MHDE and MSCDE. Indeed, MLE performs better than MSCDE and MHDE for Model I, but the MSE differences are small. For Model II, the biases of the three methods are comparative. Based on the MSE values for Model II, MSCDE performs the best, followed by MHDE. MSCDE is better in applications where the subjects may show a response not caused by the stimulus/treatment under examination; e.g. if they recover naturally.

Table 1. Biases and MSEs of MHDE, MSCDE, and MLE for Models I and II.

Model n Estimator Bias(β0) Bias(β1) MSE(β0) MSE(β1)
  30 MLE 0.0075 −0.0007 0.1111 0.0028
I 30 MHDE −0.0291 0.0051 0.1124 0.0030
  30 MSCDE −0.0730 0.0126 0.1208 0.0032
  50 MLE 0.0021 0.0001 0.0767 0.0020
I 50 MHDE −0.0210 0.0038 0.0763 0.0020
  50 MSCDE −0.0473 0.0082 0.0892 0.0023
  30 MLE 0.6113 −0.0532 0.4678 0.0056
II 30 MHDE 0.5850 −0.0481 0.4419 0.0052
  30 MSCDE 0.5805 −0.0428 0.4255 0.0049
  50 MLE 0.5992 −0.0497 0.4259 0.0043
II 50 MHDE 0.5838 −0.0466 0.4081 0.0040
  50 MSCDE 0.5721 −0.0426 0.4032 0.0037

Next, we considered a case where different covariates are assigned for different groups; see, e.g. Stephenson et al. [43]. Specifically, we set xj=0 for j = 1, 2, xj=j for j = 3, 4, xjN(j,1) for j = 5, 6, 7 and xjU[j1,j+1] for j = 8, 9, 10. Table 2 displays the simulation results for the Bias and MSE of the three estimators under Models I and II. We observe that the results in Table 2 are similar to those in Table 1; that is, MLE performs better than MSCDE and MHDE for Model I, whereas MSCDE and MHDE do better than MLE for Model II.

Table 2. Simulation results for Models I and II with different covariates for different groups.

Model n Estimator Bias(β0) Bias(β1) MSE(β0) MSE(β1)
  30 MLE −0.2879 0.0099 0.1426 0.0017
I 30 MHDE −0.3270 0.0153 0.1584 0.0017
  30 MSCDE −0.3326 0.0178 0.1789 0.0020
  50 MLE −0.2959 0.0120 0.1190 0.0009
I 50 MHDE −0.3158 0.0147 0.1295 0.0009
  50 MSCDE −0.3218 0.0158 0.1353 0.0014
  30 MLE 0.3330 −0.0416 0.3033 −0.0367
II 30 MHDE 0.3033 −0.0367 0.1898 0.0040
  30 MSCDE 0.2940 −0.0319 0.1700 0.0037
  50 MLE 0.3252 −0.0381 0.1744 0.0034
II 50 MHDE 0.3014 −0.0349 0.1612 0.0031
  50 MSCDE 0.2777 −0.0315 0.1520 0.0030

We now generate the data from the following model:

  • Model III: P(Yji=1|xj)=F(xjTβ+εji),

where βT=(β1,β2,β3)=(1,0.4,0.8) and xj=(1,xj1,xj2)T; xj1 and xj2 are mutually independent and xjkU[0,2] for k = 1, 2; εji are independent random variables with common mixture distribution (1α)N(0,0.52)+αN(8,0.22); and F is the CDF of the Logistic (1,1.2) distribution. Model III includes the case where there are errors in the measurement or recording of the xj. We consider two cases: α=0 and α=0.1. When α=0, εji follow the normal distribution N(0,0.52). When α=0.1, εji are from a contaminated normal distribution with about 10% are from a N(8,0.22) distribution, which can be interpreted as an outlier distribution.

Table 3 reports the simulation results for the Bias and MSE of MHDE, MSCDE, and MLE for Model III with n = 30 based on 1000 replications. One can see from Table 3 when α=0, MLE has a smaller MSE than MSCDE and MHDE, but MHDE and MSCDE have smaller biases in absolute value than MLE. When α=0.1, MSCDE and MHDE outperform MLE, and MSCDE behaves better than MHDE. These results show that MHDE and MSCDE appear to be more robust than MLE for outliers. We also see from Tables 13 that MSCDE is more robust than MHDE. The simulation results with the CDF of the N(0,1) distribution as the CDF for F give similar results and are omitted.

Table 3. Biases and MSEs of MHDE, MSCDE, and MLE for Model III.

α Method Bias(β0) Bias(β1) Bias(β2) MSE(β0) MSE(β1) MSE(β2)
  MLE 0.0712 −0.0223 −0.0258 0.2441 0.1128 0.0976
0 MHDE 0.0425 −0.0168 −0.0137 0.2486 0.1155 0.1006
  MSCDE −0.0021 −0.0045 0.0012 0.2705 0.1246 0.1078
  MLE 0.5833 −0.0702 −0.1466 0.5178 0.1808 0.1258
0.1 MHDE 0.5682 −0.0668 −0.1397 0.5037 0.1775 0.1153
  MSCDE 0.5515 −0.0649 −0.1316 0.4905 0.1767 0.1146

4.2. Variable selection

In this subsection, we carried out a simulation study to compare the performance of the penalized MHDE and penalized MSCDE defined by (11) and (12), respectively, with that of the penalized MLE defined by (2). For all three estimators, we have used an adaptive penalty function of the form k=1ppλN(1)(|βk(0)|)|βk|, with β(0) being the corresponding non-penalized MHDE and MSCDE defined in Section 2, and the MLE based on log-likelihood function lN(β) defined by (1). Further, we have employed the SCAD penalty function with a = 3.7 for pλN(.). The tuning parameter λN in pλN(.) is chosen by the method given in Fan et al. [14].

We considered K = 20 groups. For each group, the sample size is set to nj=n, j=1,,K. In this section, we set n = 10, 20. For the ith group, we generated n data points based on a Bernoulli distribution with probability of success F(xjTβ), j=1,,K. The simulation results presented in this section are based on 500 replications.

We first considered the case where the data are generated from the following model:

Model IV:F(xjTβ)=L(xjTβ),

where L(y) denotes the CDF of the Logistic (2,3) distribution, βT=(β1,β2,β3,β4)=(1.2,0,0.9,0), xj=(1,xj1,xj2,xj3)T, where xj1, xj2, and xj3 are mutually independent and xjkU[0,2] for k = 1, 2, 3.

We measured the estimation accuracy by the average l1-losses: |βˆ1β1| and |βˆ3β3| over 500 replications. We also evaluated the selection accuracy by the average counts of false positives (FPs) and false negatives (FNs), i.e. the number of noise covariates included in the model and the number of signal covariates not included. Table 4 gives the results for Model IV. From Table 4, we observe that both the penalized MSCDE and the penalized MHDE are comparable to the penalized MLE in variable selection, but have bigger absolute biases than the penalized MLE. Table 4 also shows that the penalized MSCDE outperforms the penalized MHDE.

Table 4. Comparison of (penalized) MHDE, MSCDE, and MLE for Model IV.

  Method FP FN |βˆ1β1| |βˆ3β3|
K = 20, n = 10 MLE 0.2880 0.2640 0.4381 0.3891
  MHDE 0.1820 0.3880 0.4823 0.4677
  MSCDE 0.1760 0.2800 0.4519 0.4096
K = 20, n = 20 MLE 0.2100 0.2100 0.3969 0.3527
  MHDE 0.1440 0.2580 0.4234 0.3869
  MSCDE 0.1260 0.2540 0.3953 0.3826

Next, we generated the data from the following model:

Model V:F(xjTβ)=(1ς)L(xjTβ)+ς,

where 0<ς<1, and L(y), β, xj are as defined in Model IV. Table 5 gives the results for Model V. In this case, we see from Table 5 that both the penalized MSCDE and the penalized MHDE outperform the penalized MLE. Table 5 also shows that when ς changes from 0.1 to 0.2, FP, FN and l1-loss for the three methods become large.

Table 5. Comparison of (penalized) MHDE, MSCDE, and MLE for Model V.

  ς Method FP FN |βˆ1β1| |βˆ3β3|
K = 20, n = 10 0.1 MLE 0.2320 0.1940 0.5417 0.3335
    MHDE 0.1540 0.1040 0.5356 0.3123
    MSCDE 0.1520 0.1260 0.5054 0.3073
K = 20, n = 20 0.1 MLE 0.1500 0.1740 0.5036 0.3180
    MHDE 0.1120 0.0920 0.4970 0.2851
    MSCDE 0.1360 0.0780 0.4833 0.2672
K = 20, n = 10 0.2 MLE 0.3080 0.1960 0.8085 0.3576
    MHDE 0.2040 0.1180 0.8022 0.3133
    MSCDE 0.2120 0.1240 0.7912 0.3082
K = 20, n = 20 0.2 MLE 0.2420 0.1640 0.8678 0.3294
    MHDE 0.1160 0.0960 0.8509 0.3121
    MSCDE 0.1320 0.0760 0.8405 0.3085

We now generate the data from the following model:

Model VI:P(Yji=1|xj)=F(xjTβ+εji),

where βT=(β1,,β7)=(0.6,0,0.9,0,0,1.1,0); xj=(1,xj1,,xj6)T, where xj1,,xj6 are mutually independent and xjkU[0,2] for k=1,,6; εji are independent random variables with common mixture distribution (1α)N(0,0.52)+αN(0,102); and F is the CDF of the Logistic (2,3) distribution. Table 6 reports the simulation results for Model VI with α=0,0.1 based on 500 replications. Table 6 shows that when α=0, the penalized MSCDE and penalized MHDE perform better than the penalized MLE in variable selection as well as the estimation of β1. Whereas the penalized MLE has low l1-losses for β3 and β6 than those for the penalized MSCDE and penalized MHDE. When α=0.1, the penalized MHDE and penalized MSCDE outperform the penalized MLE. The values in Table 6 also reveal that the estimation and variable selection for three methods become better when n increases. Based on Tables 46, we conclude that the penalized MSCDE and penalized MHDE are more robust than the penalized MLE in the presence of outliers.

Table 6. Comparison of (penalized) MHDE, MSCDE, and MLE for Model VI.

    Method FP FN |βˆ1β1| |βˆ3β3| |βˆ6β6|
α=0, K = 20, n = 20 MLE 0.5140 0.5920 0.4797 0.2390 0.2234
    MHDE 0.3340 0.4000 0.4580 0.2611 0.2308
    MSCDE 0.1880 0.5380 0.4670 0.2707 0.2498
α=0, K = 20, n = 40 MLE 0.3820 0.5280 0.4366 0.2187 0.2032
    MHDE 0.2780 0.3300 0.3711 0.2209 0.2165
    MSCDE 0.2120 0.4300 0.4294 0.2203 0.2316
α=0.1, K = 20, n = 20 MLE 0.6600 0.6260 0.4861 0.2683 0.2575
    MHDE 0.2880 0.3820 0.4589 0.2625 0.2422
    MSCDE 0.2280 0.5600 0.4759 0.2468 0.2352
α=0.1, K = 20, n = 40 MLE 0.5100 0.6020 0.4605 0.2466 0.2392
    MHDE 0.3040 0.3180 0.3742 0.2308 0.2207
    MSCDE 0.2280 0.5600 0.4759 0.2468 0.2352

5. Real-data applications

In this section, we analyze two real-data sets. For each data set, we estimate the vector β using MHDE and MSCDE defined by (7) and (8), respectively, and the MLE βˆMLE, see circa (1).

5.1. Example 1

We first apply our methods to analyze a data set on Caesarian birth previously analyzed by Fahrmeir and Tutz [13] using the maximum likelihood approach. The response variable of interest is the occurrence or nonoccurrence of infection of type I or II. Three dichotomous covariates that may influence the risk of infection were considered: Was the Caesarian section planned or not? Were risk factors such as diabetes or excessive weight present? Were antibiotics given as a prophylaxis? The aim is to analyze the effects of the covariates on the risk of infection, and in particular to determine whether antibiotics can decrease this risk. We define a binary response variable Y with Y = 0 if there is no infection and Y = 1 if there is infection of either type. We constructed logistic and probit models to fit this data set:

P(Yj=1|xj)=F(xjTβ)=exjTβ/(1+exjTβ), (20)
P(Yj=1|xj)=F(xjTβ)=Φ(xjTβ), (21)

where xj=(1,xj1,xj2,xj3)T, β=(β0,β1,β2,β3)T, and Φ() is the CDF of the standard normal distribution. We set xj1=1 if the Caesarian was not planned and xj1=0 if it was planned; xj2=1 indicates that antibiotics were given, and xj3=1 indicates that there were risk factors present.

Fahrmeir and Tutz [13] implemented the maximum likelihood method to estimate the unknown parameter vector β. We compute MHDE, MSCDE and MLE for β. Table 7 gives the parametric estimates for the three methods and the MSE defined by MSE =17j=17(pˆjpj)2, where pj is the probability of infection for the jth group and pˆj=F(xjTβˆ), with βˆ being an estimator of β. Figure 1 exhibits the scatterplots for the points (pj,pˆj), j=1,,7, for models (20) and (21) under the three methods, where pˆj is estimated by deleting the jth group data. From the MSE values in Table 7, we note that MSCDE clearly the best one, and MHDE comes close second. The scatterplots in Figure 1 are generally similar for the three methods, but Panels (c) and (f) of Figure 1 appear to show that MLE is taking large values for some groups at p=0. In this case, MSCDE and MHDE appear to have better robustness with respect to possible deviations of the postulated model from true model. In Figure 1, comparing (a), (b) and (c) with (d), (e) and (f), respectively, we note that the scatterplots for model (20) are similar to those for model (21), meaning that both models may not be the true model.

Table 7. Parametric estimates and MSEs for models (20) and (21).

Model Method βˆ0 βˆ1 βˆ2 βˆ3 MSE
  MHDE −3.0 0.58 3.55 −3.3 0.0099
(20) MSCDE −3.9 1.52 4.0 −3.5 0.0064
  MLE −1.89 1.07 2.03 −3.25 0.0163
  MHDE −2.1 0.45 2.4 −1.95 0.0095
(21) MSCDE −2.95 0.9 3.0 −2.05 0.0066
  MLE −1.09 0.61 1.2 −1.9 0.0174

Figure 1.

Figure 1.

Scatterplot for points (pj,pˆj),j=1,,7. Panels (a), (b), and (c) are for the MHDE, MSCDE and MLE for model (20). Panels (d), (e), and (f) are for the MHDE, MSCDE and MLE for model (21). The diagonal line is y = x.

5.2. Example 2

Next we analyzed the data set given in Little [31] using our methods and the maximum likelihood method. These data present a distribution of 1607 married and fecund women interviewed in the Fiji Fertility Survey of 1975, classified by age, level of education, desire for more children, and contraceptive use. We view the use of contraception as the response variable Y and age ( X1), education ( X2), and desire for more children ( X3) as the covariates. We set Y = 1 for contraceptive use and Y = 0 for no contraceptives. The women are classified into K = 16 groups in terms of their age, education, and desire for more children. We use the following logistic model:

P(Yj=1|xj)=F(xjTβ)=exjTβ/(1+exjTβ), (22)

where xj=(1,xj1,xj2,xj3)T and β=(β0,β1,β2,β3)T. For age, we set xj1=21.5 for j=1,,4; xj1=27 for j=5,,8; xj1=34.5 for j=9,,12; and xj1=44.5 for j=13,,16. We also set xj2=1 for a higher level of education and xj2=0 for a lower level. Further, xj3=1 denotes a desire for more children and xj3=0 denotes no such desire.

The parameters of model (22) are estimated using MHDE, MSCDE and MLE. Table 8 gives the estimates for the three methods and the MSE defined by MSE =116j=116(pˆjpj)2, where pj is the proportion of contraceptive use for the jth group and pˆj=F(xjTβˆ). MHDE and MSCDE give almost the same results, and they have lower MSE values than MLE.

Table 8. Parametric estimates and MSEs for the three methods.

Method βˆ0 βˆ1 βˆ2 βˆ3 MSE
MHDE −2.4500 0.0600 0.4000 −0.9000 0.0064
MSCDE −2.4500 0.0600 0.4000 −0.9500 0.0064
MLE −0.3110 0.0740 −0.0196 −1.1224 0.0193

To evaluate the prediction performance of the three methods, we applied leave-one-out cross-validation to the data; i.e. to predict the proportion of contraceptive use for the jth group, we omitted the data for this group when fitting the model. Figure 2 displays the boxplots for the absolute prediction errors |pjpˆj|, j=1,,16, for the three methods. The mean values of these errors for MHDE, MSCDE, and MLE are 0.0900, 0.0892, and 0.1226, respectively. These observations and Figure 2 suggest that MSCDE and MHDE have better prediction performance than MLE. Since we set the age xj1=21.5 for j=1,,4 and set the age to be the median for the other groups, it is more appropriate to use the following model to fit this data set:

P(Yji=1|xj)=F(xjTβ+εji),j=1,,16,i=1,,ni,

where the εji are random errors and the CDF F is as in (22). As observed in simulated results of Section 4, MSCDE and MHDE have better robustness than MLE for the contaminated model (5.2). In practical applications, the true model for a given data set is generally unknown, and the postulated model is usually not correct. Simulation results in Section 4 suggest that MSCDE and MHDE may offer some protection when the postulated model appears to deviate from the true model. This is an added advantage of using minimum-distance methods such as MSCD and MHD, which are generally robust to model misspecification.

Figure 2.

Figure 2.

Boxplots for the absolute prediction error |pjpˆj|, j=1,,16, for the three methods. Here 1, 2, and 3 are boxplots for MHDE, MSCDE, and MLE.

6. Discussion

In this paper, we have investigated simultaneous robust estimation and variable selection for binary regression models with grouped data. In many practical situations, the data are available only in a grouped form, or the observed data can be formed into groups based on the covariates observed for each subject. Working with grouped data has the additional advantage that it is possible to test the goodness of fit of a postulated model. The maximum likelihood approach is the most widely used method in inference for binary regression models. However, MLEs are sensitive to atypical data and model misspecification. Their lack of robustness has motivated researchers to develop more robust approaches. A well-known alternative method is the minimum-distance approach, which has been observed to produce estimators having excellent robustness properties against model misspecification and the presence of outliers. We have examined two minimum-distance estimation methods, namely MHD and MSCD, for estimation in binary regression models with grouped data. The results of simulation and two real-data analysis show that MHD and MSCD estimators have good robustness properties, and the MSCD estimator is marginally better than the MHD counterpart in small samples, but both are equally efficient asymptotically. Further, they have outperformed MLE in the presence of outliers and model misspecification.

Regularization methods play an important role in identifying covariates that truly affect the outcome of a response in models containing covariates and a response variable. They have been widely used for simultaneous coefficient estimation and variable selection by identifying the covariates that are associated with a response variable. The importance of robust procedures has also been stressed for regularization methods [14,15,25,40,46,49]. Many well-known regularization methods are based on solving an optimization problem formed by the sum of a ‘loss function’ and a penalty function. We have constructed regularized estimators using the squared Hellinger and symmetric chi-squared distances as loss functions combined with an adaptive l1-penalty function. These techniques have produced both robust and efficient regularized estimators for binary regression models with grouped data. We have shown that our estimators satisfy the oracle properties. Such optimal regularized procedures are not yet available for binary regression models with grouped data. Furthermore, our numerical studies have shown that our penalized estimators are more stable in the presence of outliers and model misspecification than the corresponding MLE. Overall, the full efficiency combined with excellent robustness and computational feasibility make our proposed estimators very appealing in practice. Thus, we expect our regularized methods to be very useful in practical applications.

Acknowledgments

We wish to thank the Editor, an Associate Editor and three reviewers for careful reading of our paper and for their helpful comments that led to a substantial improvement in this paper. Q. Tang's research was supported by the National Social Science Foundation of China (16BTJ019) and the Natural Science Foundation of Jiangsu Province of China (Grant No. BK20151481). R.J. Karunamuni's research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

Appendix.

Here we give detailed proofs of the theorems stated in Section 3. The proofs are similar to those used in Stather [41] and Karunamuni et al. [26], but we have extended them here for binary regression models.

Let β0 denote the true parameter value of β. The next two lemmas show that, under certain conditions, T(π,w) is unique and continuous.

Lemma A.1

Suppose that Θ is a compact subset of Rp and F is a continuous CDF on R. Then T(π,w) exists for all (π,w)GK. Further, suppose that F is continuous and strictly increasing on R and that πj=F(xjTβ0), 1jK, with the xj's span Rp. Then we have T(π,w)=β0 uniquely for any w.

Proof.

Observe that

  1. F(xjTβ) is a continuous function of β. Therefore, πjF(xjTβ) and (1πj)(1F(xjTβ)) are continuous functions of β.

  2. g(β)=j=1Kwj{πjF(xjTβ)+(1πj)(1F(xjTβ))} is a continuous function of β.

  3. g(β) is bounded (when K is fixed), because |wj|1, 0πj1, 0F(xjTβ)1.

Then from the above facts we note that the maximum of

j=1Kwj{πjF(xjTβ)+(1πj)(1F(xjTβ))}

is attained in at least one point.

For the second part, we argue as follows. Using basic calculus, it is easy to show the maximum of πjF(xjTβ)+(1πj)(1F(xjTβ)) is attained when πj=F(xjTβ). That is, if πj=F(xjTβ0), 1jK, then T(π,w)=β0 is a solution for β. Since F is one-to-one by assumptions, it then follows that xjTβ=xjTβ0,j=1,,p. We then have β=β0 by xj's span Rp. Hence the result.

Lemma A.2

Suppose that Θ is a compact subset of Rp, F is continuous and strictly increasing on R, and (π,w)GK is such that T(π,w) is unique with π=(π1,,πK)T, 0<πj<1 for 1jK. Suppose that the xj's span Rp. Then T is continuous at (π,w).

Proof.

Let πn=(π1,n,π2,n,,πK,n)T and wn=(w1,n,w2,n,,wK,n)T. Assume πn×wnGK and (πn,wn)(π,w) as n. We will show T(πn,wn)T(π,w) as n. Then T is continuous at (π,w). Denote β=T(π,w) and βn=T(πn,wn). Define functions

gj,n(β)=πj,nF(xjTβ)+(1πj,n)(1F(xjTβ)),gn(β)=j=1Kwj,ngj,n(β),gj(β)=πjF(xjTβ)+(1πj)(1F(xjTβ)),

and

g(β)=j=1Kwjgj(β).

It is enough to show that

Supgn(β)g(β):βΘ0. (A1)

Then from (A1) we have |gn(βn)g(βn)|0. We also obtain g(βn)g(β) as n using the continuity of g. Then it follows that gn(βn)g(β) as n. Then the uniqueness of T(π,w) and the compactness of Θ, it follows that T(πn,wn)T(π,w) as n.

We now verify (A1). First note that

gn(β)g(β)j=1Kgj(β)wj,nwj+j=1Kwj,ngj,n(β)gj(β). (A2)

Denote Fj=F(xjTβ) and define

Δj,n(β)=gj,n(β)gj(β)=[πj,nFj+(1πj,n)(1Fj)][πjFj+(1πj)(1Fj)]=[πj,nFjπjFj]+[(1πj,n)(1Fj)(1πj)(1Fj)]=Fj[πj,nπj]+1Fj[(1πj,n)(1πj)] (A3)

for 1jK. Then using the algebraic equality b1/2a1/2=12(ba)a1/212(b1/2a1/2)2a1/2 for b0, a>0, we obtain

Δj,n(β)=12Fj/πj(πj,nπj)(πj,nπj)212(1Fj)/(1πj)(πj,nπj)+[(1πj,n)(1πj)]2.

Now since (a1/2b1/2)2a1(ba)2 for b0, a>0, we have

Δj,n(β)12πj,nπjFj/πj+(1Fj)/(1πj)+12πj,nπj2Fj/πj3+(1Fj)/(1πj)3.

But Fj/πjπj1Fjπj+(1Fj)(1πj) and a similar inequality holds for (1Fj)/(1πj). Then it follows that

Δj,n(β)12Fjπj+(1Fj)(1πj)πj,nπj{πj1+(1πj)1}+πj,nπj2{πj2+(1πj)2}. (A4)

Now combining (A2)–(A4) and using πj,nπj, it follows that Sup {|gn(β)g(β)|:βΘ}0. This completes the proof of (A1).

Proof of Theorem 3.1. Proof of Theorem 3.1 —

The proof follows from the continuity of T(.,.) and (πN,wN)P(π,w) as N.

Proof of Theorem 3.2. Proof of Theorem 3.2 —

Let T(πn,wn)=βn and T(π,w)=β. Then βn satisfies the following equation:

0=βj=1Kwj,nπj,nF(xjTβ)+(1πj,n)(1F(xjTβ))=12j=1Kwj,nxjf(xjTβ)πj,nF(xjTβ)1πj,n1F(xjTβ).

Let

Gj,n(y)=yπj,nF(y)+(1πj,n)(1F(y))=12f(y)πj,nF(y)1πj,n1F(y).

The first derivative of Gj,n(y) is

Gj,n(1)(y)=12f(1)(y)πj,nF(y)1πj,n1F(y)14f(y)2πj,nF(y)3+1πj,n(1F(y))3, (A5)

and λ(πn,wn,βn)=j=1Kwj,nxjGj,n(xjTβn). Using a Taylor expansion, we have

Gj,n(xjTβn)=Gj,n(xjTβ)+Gj,n(1)(xjTβ)xjT(βnβ)+12Gj,n(2)(xjTβn)[xjT(βnβ)]2,

where βn is a value between βn and β. Note Gj,n(2) is bounded because F has bounded derivatives and F is bounded away from zero and one from the assumptions in the lemma. Further, we have Gj,n(1)(y)Gj(1)(y) uniformly with respect to y since πj,nπj, where Gj(1)(y) is equal to the expression at (A1) with πj,n replaced by πj. Now, substituting the expression (A5) into the equation j=1Kwj,nxjGj,n(xjTβn)=0 we obtain

0=j=1Kwj,nxj[Gj,n(xjTβ)+xjTGj,n(1)(xjTβ)(βnβ)+12(βnβ)TxjxjTGj,n(2)(xjTβn)(βnβ)]=j=1Kwj,nxjGj,n(xjTβ)+j=1Kwj,nxjxjTGj,n(1)(xjTβ)+12Gj,n(2)(xjTβn)xjT(βnβ)(βnβ)

Since wj,nwj as n, it follows that

j=1Kwj,nxjxjTGj(1)(xjTβ)j=1KwjxjxjTGj(1)(xjTβ)=Σ(β) (A6)

and the elements of the matrix Σn=12j=1KwjxjxjTGj,n(2)(xjTβn)xjT(βnβ) go to zero, as n. Denote λ(πn,wn,β)=j=1Kwj,nxjGj,n(xjTβ). Then we have λ(πn,wn,β)+(Σ(β)+Σn)(βnβ)=0. Since Σ(β) is nonsingular, the matrix Σ(β)+Σn will be nonsingular for large n. Then (14) follows, and (15) follows from (14) and (A6). This completes the proof of Theorem 3.2.

Proof of Theorem 3.3. Proof of Theorem 3.3 —

Assume that 0<F(y)<1 and πj,Nπj and N with 0<πj<1, j=1,,K. Then by an application of the algebraic equality b1/2a1/2=12(ba)a1/212(b1/2a1/2)2a1/2 for b0, a>0, to πj,N/F(y) and (1πj,N)/(1F(y)) separately we have

πj,N/F(y)(1πj,N)/(1F(y))=πj/F(y)(1πj)/(1F(y))+12(πj,Nπj)[{πjF(y)}1/2+{(1πj)(1F(y))}1/2]+o(πj,Nπj). (A7)

Using (A7), as N, we obtain

Gj,N(y)Gj(y)=yπj,NF(y)+(1πj,N)(1F(y))yπjF(y)+(1πj)(1F(y))=f(y)2πj,N/F(y)(1πj,N)/(1F(y))πj/F(y)+(1πj)/(1F(y))=f(y)4(πj,Nπj)[{πjF(y)}1/2+{(1πj)(1F(y))}1/2]+o(πj,Nπj), (A8)

Define Gˆj(y)= y{pˆjF(y)+(1pˆj)(1F(y))}. Note that pˆNPπ as N. Then since T(π,w)= β0 and λ(π,w,β0)=0, we have from (14) and (A8) that

βˆMHDβ0=T(pˆN,wN)T(π,w)=λ(pˆN,wN,β0)[Σ1(β0)+oP(1)]=j=1Kwj,Nxj{Gˆj(xjTβ0)Gj(xjTβ0)}[Σ1(β0)+oP(1)]=14Σ1(β0)j=1Kwj,Nxjf(xjTβ0)(pˆjπj)×[{πjF(xjTβ0)}1/2+{(1πj)(1F(xjTβ0))}1/2](1+oP(1)). (A9)

The result (16) now follows from (A9) and the fact that Nwj,N(pˆjπj)DN(0,πj(1πj)) as N, for 1jK. This completes the proof of Theorem 3.3.

Proof of Theorem 3.4. Proof of Theorem 3.4 —

Let

DN(β)=j=1Kwj,NpˆjF(xjTβ)+(1pjˆ)(1F(xjTβ))k=1ppλN(1)(|βk(0)|)|βk|.

By arguments similar to those used in the proofs of Lemmas A.1 and A.2 of Tang and Karunamuni [46], there exists a N-consistent local maximizer βˇ=(βˇ1,0T)T of (11). By a Taylor expansion, with probability tending 1, we have

DN((βˆPMHD1,βˆPMHD2))=DN((βˇ1,0))+(βˆPMHDβˇ)Tj=1Kwj,NxjGj,N(xjTβˇ)+12(βˆPMHDβˇ)Tj=1Kwj,NxjxjTGj,N(1)(xjTβ)(βˆPMHDβˇ)k=d+1ppλN(1)(|βk(0)|)|βˆPMHDk|, (A10)

where β is between βˆPMHD and βˇ. Similar to the proof of Theorem 2.2 of Tang and Karunamuni [45], it holds that βˆPMHDPβ0. Then by (A6), we obtain

j=1Kwj,NxjxjTGj,N(1)(xjTβ)=Σ(β0)[1+op(1)]. (A11)

Since j=1Kwj,Nxj1Gj,N(xjTβˇ)=0, we have

(βˆPMHDβˇ)Tj=1Kwj,NxjGj,N(xjTβˇ)=βˆPMHD2Tj=1Kwj,Nxj2Gj,N(xjTβˇ) (A12)

Using a Taylor expansion, we obtain

j=1Kwj,Nxj2Gj,N(xjTβˇ)=j=1Kwj,Nxj2Gj,N(xjTβ0)+Σ21(β0)(βˇ1β01)[1+op(1)], (A13)

where Σ21(β0)=j=1Kwjxj2xj1TGj(1)(xjTβ0). By (A9) and (16), it follows that

j=1Kwj,Nxj2Gj,N(xjTβ0)=Op(N1/2).

If βˆPMHDβˇ, then by (A10)–(A13) and the fact that βˇ1β01=Op(N1/2), we have DN((βˆPMHD1,βˆPMHD2))<DN((βˇ1,0)). This is a contradiction to the fact that βˆPMHD is a maximizer of (11). So βˆPMHD2=0 and βˆPMHD1=βˇ1.

We now prove asymptotic normality part. Consider DN((β1,0)) as a function of β1. Noting that with probability tending 1, βˆPMHD1 is the N-consistent maximizer of DN((β1,0)) and satisfies

DN((β1,0))β1β1=βˆPMHD1=j=1Kwj,Nxj1Gj,N(xjTβˆPMHD)=0.

Using a Taylor expansion, we obtain

j=1Kwj,Nxj1Gj,N(xjTβˆPMHD)=j=1Kwj,Nxj1Gj,N(xjTβ0)+Σ1(β0)(βˆPMHD1β01)[1+op(1)].

Hence, it follows that

Σ1(β0)(βˆPMHD1β01)[1+op(1)]=j=1Kwj,Nxj1Gj,N(xjTβ0). (A14)

By (A9), we have

N1/2j=1Kwj,Nxj1Gj,N(xjTβ0)DN0,116Σ1(β0)).

Now (18) follows from the preceding expression and (A14). This completes the proof of Theorem 3.4. Furthermore, (19) easily follows from (18).

Funding Statement

Q. Tang's research was supported by the National Social Science Foundation of China [grant number 16BTJ019] and the Natural Science Foundation of Jiangsu Province of China [grant number BK20151481]. R.J. Karunamuni's research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Barnett V., Unusual outliers, in Data Analysis and Statistical Inference, Festschrift in Honour of Prof. Dr. Friedhelm Eicker, S. Schach and G. Trenkler, eds., Joseph Eul Verlag, Köln, 1992, pp. 93–113.
  • 2.Basu A., Shioya H., and Park C., Statistical Inference: The Minimum Distance Approach, CRC Press, Florida, 2011. [Google Scholar]
  • 3.Beran R., Minimum Hellinger distance estimators for parametric models, Ann. Stat. 5 (1977), pp. 445–463. doi: 10.1214/aos/1176343842 [DOI] [Google Scholar]
  • 4.Beran R., An efficient and robust adaptive estimator of location, Ann. Stat. 6 (1978), pp. 292–313. doi: 10.1214/aos/1176344125 [DOI] [Google Scholar]
  • 5.Bhattacharya R. and Kong M., Consistency and asymptotic normality of the estimated effective doses in bioassay, J. Stat. Plan. Inference 137 (2007), pp. 643–658. doi: 10.1016/j.jspi.2006.06.027 [DOI] [Google Scholar]
  • 6.Bianco A.M. and Yohai V.J., Robust estimation in the logistic regression model, in Robust Statistics, Data Analysis, and Computer Intensive Methods (Schloss Thurnau, 1994), volume 109 of Lecture Notes in Statistics, Springer, New York, 1996, pp. 17–34.
  • 7.Cantoni E. and Ronchetti E., Robust inference for generalized linear models, J. Am. Stat. Assoc. 96 (2001), pp. 1022–1030. doi: 10.1198/016214501753209004 [DOI] [Google Scholar]
  • 8.Carroll R.J. and Pederson S., On robustness in the logistic regression model, J. R. Stat. Soc. Ser. B 55 (1993), pp. 693–706. [Google Scholar]
  • 9.Copas J.B., Binary regression models for contaminated data (with discussion), J. R. Stat. Soc. Ser. B 50 (1988), pp. 225–265. [Google Scholar]
  • 10.Croux C. and Haesbroeck G., Implementing the Bianco and Yohai estimator for logistic regression, Comput. Stat. Data Anal. 44 (2003), pp. 273–295. doi: 10.1016/S0167-9473(03)00042-2 [DOI] [Google Scholar]
  • 11.Donoho D.L. and Liu R.C., The “automatic” robustness of minimum distance functionals, Ann. Stat. 16 (1988), pp. 552–586. doi: 10.1214/aos/1176350820 [DOI] [Google Scholar]
  • 12.Fahrmeir L. and Kaufmann H., Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models, Ann. Stat. 13 (1985), pp. 342–368. doi: 10.1214/aos/1176346597 [DOI] [Google Scholar]
  • 13.Fahrmeir L. and Tutz G., Multivariate Statistical Modelling Based on Generalized Linear Models, 2nd ed., Springer, New York, 2001. [Google Scholar]
  • 14.Fan J., Fan Y., and Barut E., Adaptive robust variable selection, Ann. Stat. 42 (2014), pp. 324–351. doi: 10.1214/13-AOS1191 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]
  • 16.Fan J. and Lv J., Non-concave penalized likelihood with NP-dimensionality, IEEE Trans. Inf. Theory 57 (2011), pp. 5467–5484. doi: 10.1109/TIT.2011.2158486 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Gervini D., Robust adaptive estimators for binary regression models, J. Stat. Plan. Inference 131 (2005), pp. 297–311. doi: 10.1016/j.jspi.2004.02.006 [DOI] [Google Scholar]
  • 18.Haberman S.J., Maximum likelihood estimates in exponential response models, Ann. Stat. 5 (1977), pp. 815–841. doi: 10.1214/aos/1176343941 [DOI] [Google Scholar]
  • 19.Hampel F.R., The influence curve and its role in robust estimation, J. Am. Stat. Assoc. 69 (1974), pp. 383–393. doi: 10.1080/01621459.1974.10482962 [DOI] [Google Scholar]
  • 20.He X. and Simpson D., Lower bounds for contamination bias: Globally minimax versus locally linear estimation, Ann. Stat. 21 (1993), pp. 314–337. doi: 10.1214/aos/1176349028 [DOI] [Google Scholar]
  • 21.Hosseinian S. and Morgenthaler S., Robust binary regression, J. Stat. Plan. Inference 141 (2011), pp. 1497–1509. doi: 10.1016/j.jspi.2010.11.015 [DOI] [Google Scholar]
  • 22.Huber P.J., Robust estimation of a location parameter, Ann. Math. Stat. 35 (1964), pp. 73–101. doi: 10.1214/aoms/1177703732 [DOI] [Google Scholar]
  • 23.Huber P.J., Robust Statistics, Wiley, New York, 1981. [Google Scholar]
  • 24.Künsch H.R., Stefanski L.A., and Carroll R.J., Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models, J. Am. Stat. Assoc. 84 (1989), pp. 460–466. [Google Scholar]
  • 25.Karunamuni R.J., Kong L., and Wei T., Efficient robust doubly adaptive regularized regression with applications, Stat. Methods Med. Res. 28 (2019), pp. 2210–2226. doi: 10.1177/0962280218757560 [DOI] [PubMed] [Google Scholar]
  • 26.Karunamuni R.J., Tang Q., and Zhao B., Robust and efficient estimation of effective dose, Comput. Stat. Data Anal. 90 (2015), pp. 47–60. doi: 10.1016/j.csda.2015.04.001 [DOI] [Google Scholar]
  • 27.Le Cam L., Asymptotic Methods in Statistical Decision Theory, Springer, New York, 1986. [Google Scholar]
  • 28.Li P. and Wiens D.P., Robustness of design in dose-response studies, J. R. Stat. Soc. Ser. B 73 (2011), pp. 215–238. doi: 10.1111/j.1467-9868.2010.00763.x [DOI] [Google Scholar]
  • 29.Lindsay B.G., Efficiency versus robustness: The case for minimum Hellinger distance and related methods, Ann. Stat. 22 (1994), pp. 1081–1114. doi: 10.1214/aos/1176325512 [DOI] [Google Scholar]
  • 30.Lindsay B.G., Statistical distances as loss functions in assessing model adequacy, in The Nature of Scientific Evidence: Statistical, Philosophical, and Empirical Considerations, M.L. Taper and S.R. Lele, eds., The University of Chicago Press, Chicago, 2004, pp. 439–464.
  • 31.Little R.J.A., Generalized linear models for cross-classified data from the WFS. World Fertility Survey Technical Bulletins. Number 5, 1978.
  • 32.Lv J. and Fan J., A unified approach to model selection and sparse recovery using regularized least squares, Ann. Stat. 37 (2009), pp. 3498–3528. doi: 10.1214/09-AOS683 [DOI] [Google Scholar]
  • 33.Müller C.H. and Neykov N., Breakdown points of trimmed likelihood estimators and related estimators in generalized linear models, J. Stat. Plan. Inference 116 (2003), pp. 503–519. doi: 10.1016/S0378-3758(02)00265-3 [DOI] [Google Scholar]
  • 34.Markatou M., Basu A., and Lindsay B., Weighted likelihood estimating equations: The discrete case with applications to logistic regression, J. Stat. Plan. Inference 57 (1997), pp. 215–232. doi: 10.1016/S0378-3758(96)00045-6 [DOI] [Google Scholar]
  • 35.McCullagh P. and Nelder J.A., Generalized Linear Models, 2nd ed., Chapman & Hall, London, 1989. [Google Scholar]
  • 36.Morgenthaler S., Least-absolute-deviations fits for generalized linear models, Biometrika 79 (1992), pp. 747–754. doi: 10.1093/biomet/79.4.747 [DOI] [Google Scholar]
  • 37.Pregibon D., Logistic regression diagnostics, Ann. Stat. 9 (1981), pp. 705–724. doi: 10.1214/aos/1176345513 [DOI] [Google Scholar]
  • 38.Pregibon D., Resistant fits for some commonly used logistic models with medical applications, Biometrics 38 (1982), pp. 485–498. doi: 10.2307/2530463 [DOI] [PubMed] [Google Scholar]
  • 39.Simpson D.G., Minimum Hellinger distance estimation for the analysis of count data, J. Am. Stat. Associ., 82 (1987), pp. 802–807. doi: 10.1080/01621459.1987.10478501 [DOI] [Google Scholar]
  • 40.Smucler E. and Yohai V.J., Robust and sparse estimators for linear regression models, Comput. Stat. Data Anal. 111 (2017), pp. 116–130. doi: 10.1016/j.csda.2017.02.002 [DOI] [Google Scholar]
  • 41.Stather C., Robust statistical inference using Hellinger distance methods, Ph.D. diss., LaTrobe University, Australia, 1981.
  • 42.Stefanski L.A., Carroll R.J., and Ruppert D., Optimally bounded score functions for generalized linear models with applications to logistic regression, Biometrika 73 (1986), pp. 413–424. [Google Scholar]
  • 43.Stephenson B.J.K, Herring A.H., and Olshan A., Robust clustreing with subpopulation-specific deviations, J. Am. Stat. Assoc. 0 (2020), to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Tamura R. and Boos D.D., Minimum Hellinger distance estimation for multivariate location and covariance, J. Am. Stat. Assoc. 81 (1986), pp. 223–229. doi: 10.1080/01621459.1986.10478264 [DOI] [Google Scholar]
  • 45.Tang Q. and Karunamuni R.J., Minimum distance estimation in a finite mixture regression model, J. Multivar. Anal. 120 (2013), pp. 185–204. doi: 10.1016/j.jmva.2013.05.008 [DOI] [Google Scholar]
  • 46.Tang Q. and Karunamuni R.J., Robust variable selection for finite mixture regression mixture models, Ann. Inst. Stat. Math. 70 (2018), pp. 489–521. doi: 10.1007/s10463-017-0602-4 [DOI] [Google Scholar]
  • 47.Tutz G., Regression for Categorical Data, Cambridge University Press, Cambridge, UK, 2012. [Google Scholar]
  • 48.Victoria-Feser M. and Ronchetti E., Robust estimation for grouped data, J. Am. Stat. Assoc. 92 (1997), pp. 333–340. doi: 10.1080/01621459.1997.10473631 [DOI] [Google Scholar]
  • 49.Wang X., Jiang Y., Huang M., and Zhang H., Robust variable selection with exponential squared loss, J. Am. Stat. Assoc. 108 (2013), pp. 632–643. doi: 10.1080/01621459.2013.766613 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Wu J. and Karunamuni R.J., Efficient Hellinger distance estimates for semiparametric models, J. Multivar. Anal. 107 (2012), pp. 1–23. doi: 10.1016/j.jmva.2012.01.007 [DOI] [Google Scholar]
  • 51.Wu J. and Karunamuni R.J., Profile Hellinger distance estimation, Statistics 49 (2015), pp. 711–740. doi: 10.1080/02331888.2014.946928 [DOI] [Google Scholar]
  • 52.Wu J. and Karunamuni R.J., Efficient and robust tests for semiparametric models, Ann. Inst. Stat. Math. 70 (2018), pp. 761–788. doi: 10.1007/s10463-017-0608-y [DOI] [Google Scholar]
  • 53.Zhang C.-H., Nearly unbiased variable selection under mini-max concave penalty, Ann. Stat. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES