Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2025 Feb 5;52(13):2384–2412. doi: 10.1080/02664763.2025.2461715

Influence diagnostics in the Heckman selection models based on EM algorithms

Marcos S Oliveira a,CONTACT, Marcos O Prates b, Christian E Galarza c, Victor H Lachos d
PMCID: PMC12490367  PMID: 41048364

Abstract

This study presents diagnostic techniques for Heckman selection models estimated using the EM algorithm. The focus is on the selection t and normal models, based on the bivariate Student's-t and bivariate normal distributions, respectively. The Heckman selection model is a key econometric tool for estimating relationships while addressing selection bias. Relying on the EM-type algorithm, we develop global and local influence analyses based on the conditional expectation of the complete-data log-likelihood function, exploring four perturbation schemes for local influence analysis. To assess the effectiveness of the proposed diagnostic measures in identifying influential observations, we conducted a simulation study, complemented by two real-data applications that demonstrate how these techniques can effectively identify influential points. The proposed algorithms and methodologies are incorporated into the R package HeckmanEM.

Keywords: Case-deletion, Heckman selection model, local influence, model perturbation, multivariate Student's-t

1. Introduction

The Heckman selection model, introduced by Heckman [14], is a widely used method in econometrics and statistics to correct for selection bias in datasets where the outcomes are observed only for a non-random subset of the population. The model originally assumes bivariate normality (SLn), simplifying its mathematical formulation. However, real-world data often deviate from this assumption, exhibiting skewness or heavy tails, which can lead to biased estimates if not properly addressed. To address these limitations, Marchenko and Genton [24] extended the model to incorporate a bivariate Student's-t error distribution (SLt), offering greater flexibility for modeling heavier tails with the addition of the degrees of freedom parameter. This extension enhances resistance to outliers, a common issue in empirical datasets.

Numerous studies have advanced the understanding of Heckman selection models and their variations. For example, Lee [20] provided a generalized framework for selection models, while Bastos and Barreto-Souza [3] introduced a sample selection model based on the bivariate Birnbaum–Saunders distribution. Subsequently, Bastos et al. [4] generalized the Heckman model by allowing selection bias and dispersion parameters to depend on covariates. Lee [21] developed nonparametric methods for estimating treatment effects under selection bias, and more recently, Saulo et al. [33] extended the model to a broad class of symmetric distributions. Despite these advancements and the relevance of the area, there remains a significant gap in diagnostic methodologies specifically tailored to Heckman selection models, particularly regarding resistance and influence analysis.

Recognizing the need for reliable estimation methods under the SLn [36] and SLt model [19] proposed an EM-type algorithm that improves the computation of maximum likelihood estimates by involving the first two moments of a truncated respective multivariate distribution in the E-step. This estimation methodology served as the foundation for the development of the diagnostic methods for Heckman models presented in this work.

The importance of influence diagnostics in the Heckman selection models lies in its sensitivity to influential observations. Identifying the impact of specific data points is crucial, as even a few influential cases can disproportionately affect parameter estimates and predictions, leading to potentially misleading conclusions.

This study seeks to fill the existing gap by adapting diagnostic procedures to Heckman selection models and offering practical tools to enhance the validity of analyses in real-world applications. Although diagnostic methodologies for regression models have been extensively studied, influence diagnostics for Heckman selection models, particularly in local influence analysis, remain unexplored. The observed log-likelihood function of the SLn model includes intractable integrals, complicating the application of Cook's approach (see [7]). To address these challenges, Zhu and Lee [38] introduced local influence analysis using the Q-displacement function, aligning with the E-step of the EM algorithm; additionally, Zhu et al. [39] explored case-deletion methods for identifying influential observations.

Building on the work of Zhu and Lee [38], this study adapts their diagnostic procedures for both the SLn and SLt Heckman models and demonstrates their application with real-world data. Our framework identifies influential observations and assesses their impact on the stability and reliability of model estimates. These diagnostic tools enable practitioners to refine their models, ensuring that decisions are not unduly influenced by anomalous data points. The proposed methods are implemented in the R package HeckmanEM, providing researchers and analysts with accessible solutions for applying Heckman selection models.

The paper is structured as follows. Section 2 introduces the multivariate Student's-t distribution, its truncated version, and the extended multivariate skew-t and extended skew-normal distributions. Section 3 discusses the SLn and SLt models and the EM-type algorithm for maximum likelihood estimates. Section 4 derives diagnostic measures for global and local influence, considering four perturbation schemes. The simulation study and two real-data applications are presented in Sections 5 and 6, respectively, while Section 7 provides final remarks and conclusions. Proofs for the derived quantities are in the Appendix.

2. Background

2.1. The multivariate Student's-t distribution and its truncated version

A p-dimensional random variable W following a multivariate Student's-t (MVT) distribution with location vector μ, positive-definite scale-covariance matrix Σ, and degrees of freedom ν is denoted by Wtp(μ,Σ,ν), with its probability density function (pdf) denoted by tp(wμ,Σ,ν). For α=(α1,,αp) and β=(β1,,βp), the cumulative distribution function (cdf) is represented by:

Tp(α,β;μ,Σ,ν)=αβtp(wμ,Σ,ν)dw.

Special cases include Tp(β;μ,Σ,ν) for α=, and Tp(β;ν) and tp(βν) when μ=0 and Σ=Ip. When p = 1, the subscript p is omitted. As ν approaches infinity, W converges in distribution to a multivariate normal (MVN) distribution Np(μ,Σ). A key property of W is its representation as a scale mixture of an MVN random vector and a positive random variable:

W=μ+U1/2Z, (1)

where ZNp(0,Σ) and is independent of UG(ν/2,ν/2), with G(α,β) being a gamma distribution with mean α/β.

Considering the Borel set B within Rp:

B={(w1,,wp)Rp:α1w1β1,,αpwpβp}={wRp:αxβ}, (2)

a p-dimensional random vector X following a doubly truncated Student's-t (TMVT) distribution within the truncation region B, denoted by XTtp(μ,Σ,ν;B), has the pdf:

Ttp(xμ,Σ,ν;B)=tp(xμ,Σ,ν)Tp(α,β;μ,Σ,ν),αxβ.

The cdf of X within αxβ is:

TTp(xμ,Σ,ν;B)=1Tp(α,β;μ,Σ,ν)αxtp(yμ,Σ,ν)dy=Tp(α,x;μ,Σ,ν)Tp(α,β;μ,Σ,ν).

In the next subsection, we will introduce the multivariate extended skew-t and skew-normal distributions, which are pivotal for improving computational efficiency in moment calculations within the EM algorithm for Heckman selection models.

2.2. The multivariate extended skew-t distribution

The multivariate extended skew-t (EST) distribution for a p-dimensional random vector X with location vector μ, positive-definite dispersion matrix Σ, skewness vector λ, τR, and degrees of freedom ν is denoted as XESTp(μ,Σ,λ,τ,ν). The pdf is:

ESTp(xμ,Σ,λ,τ,ν) =tp(xμ,Σ,ν)T(τ/1+λλ;ν)T{ν+pν+δ(x)(τ+λΣ1/2(xμ));ν+p}. (3)

From Valeriano et al. [35], the mean vector and variance-covariance matrix of X are given by:

E[X]=μ+η1Σ1/2Δ,Cov[X]=Σ1/2[γ(IpΔΔ)+(η2η12)ΔΔ]Σ1/2, (4)

where Δ=λ/(1+λλ)1/2, γ=ν+η2ν1, η1=νν1(1+τ~2ν)t(τ~ν)T(τ~;ν),ν>1, and η2=ν(ν1)ν2T((ν2)/ντ~;ν2)T(τ~;ν)ν,ν>2, with τ~=τ/(1+λλ)1/2. When τ=0, the distribution simplifies to the skew-t distribution as described by Lachos et al. [18]. In the limits τ+ and ν+, it converges to the Student's-t and multivariate extended skew-normal (ESN) distributions, respectively. The ESN pdf is given by:

ESNp(xμ,Σ,λ,τ)=ϕp(xμ,Σ)Φ(τ+λΣ1/2(xμ))Φ(τ/1+λλ), (5)

denoted as XESNp(μ,Σ,λ,τ). From Galarza et al. [11], the mean vector and variance-covariance matrix of an ESN random vector are:

E[X]=μ+ηΣ1/2λ,Cov[X]=ΣηΣ1/2λ(ηλτ1+λλλ)Σ1/2, (6)

where η=ϕ(τ0,1+λλ)/Φ(τ/1+λλ). Here, ϕp(μ,Σ) and Φp(μ,Σ) represent the pdf and cdf of Np(μ,Σ), respectively, with the subscript p omitted for p = 1. Refer to Galarza et al. [10] and Galarza et al. [11] for more details on the EST and ESN distribution properties.

The moments from Equation (4) are essential for the E-step in the EM algorithm of the Heckman selection-t model. Despite their asymmetry, the EST and ESN distributions naturally arise in selection models, belonging to a broader class known as the multivariate selection elliptical family (see, [1]).

3. The Heckman selection model

Sample selection bias and missing data often pose significant challenges in research. The SL model tackles these issues using two equations: a linear equation for the dependent variable and a Probit equation for the sample selection process. The linear equation describes the relationship between the independent and dependent variables, while the Probit equation estimates the probability of a sample being selected. The outcome equation is:

Y1i=xiβ+ϵ1i, (7)

and the sample selection mechanism is described by the latent linear equation:

Y2i=wiγ+ϵ2i, (8)

for i{1,,n}. Here, βRp and γRq are unknown parameters, xi=(xi1,,xip) and wi=(wi1,,wiq) are known characteristics. The covariates in xi and wi can overlap, and the exclusion restriction is met when at least one element of wi is not in xi. The sample selection indicator is Ci=I(Y2i>0). We observe the outcome V1i only if Ci>0, which means Y1i=V1i if Ci=1, and Y1i=NA (missing data) if Ci=0. Therefore, the observed data for the ith subject is (Vi,Ci), where Vi represents the vector of censored readings and Ci=I(Y2i>0) is the censoring indicator.

3.1. The classical Heckman selection model

Heckman [15] assumes that the error terms are independently distributed according to a bivariate normal distribution:

(ϵ1iϵ2i)N2(0,Σ),Σ=(σ2ρσρσ1), (9)

where the second diagonal element equals one due to the probit link associated with the latent variable Y2i, ensuring model identifiability. The model defined in (7)–(9) is referred to as the Heckman selection (SLn) model, with parameter vector θ=(β,γ,σ2,ρ). When the selection effect is absent ( ρ=0), it indicates that the unobserved outcomes are missing at random.

Using Bayes' rule, the conditional pdf of an observation Y1i=V1i(Ci=1) is (see, [24]):

f(Y1iCi=1,xi,wi;θ)=ϕ(V1ixiβ,σ2)Φ(wiγ+ρσ(V1ixiβ)1ρ2)/Φ(wiγ), (10)

which belongs to the ESN family, as discussed in Subsection 2.2, i.e.

Y1i=V1i(Ci=1)ESN(μ=xiβ,Σ=σ2,λ=ρ1ρ2,τ=wiγ1ρ2).

From (6), we have η=1ρ2ϕ(wiγ)/Φ(wiγ) and Σ1/2λ=σρ/1ρ2, thus the mean equation for the observed outcomes is:

E[Y1iCi=1,xi,wi;θ]=xiβ+ρσλN(wiγ), (11)

where λN(a)=ϕ(a)/Φ(a) is the inverse Mills ratio. The SLn problem can be viewed as a model misspecification case, combining a linear component, xiβ, with a nonlinear correction term, ρσλN(wiγ). Heckman [15] proposed a two-step procedure to address this, which is less efficient than ML estimation but remains robust even if the error terms are not jointly normal. In the two-step procedure, the standard probit model P(Ci=ciwi;γ)=(Φ(wiγ))ci(1Φ(wiγ))1ci provides the estimate γˆ; then λN(wiγˆ) is an additional covariate in (11), and the least squares coefficient of λN(wiγˆ) estimates ρσ. This can be implemented in R using the sampleSelection library [16]. Alternatively, the ML estimate of θ can be computed by maximizing the likelihood function given the observed data (V,C):

L(θV,C)=i=1n{ϕ(V1i|xiβ,σ2)Φ(wiγ+ρσ(V1ixiβ)1ρ2)}ci{Φ(wiγ)}1ci, (12)

or via the ECM algorithm discussed in Zhao et al. [36] and Lachos et al. [19], implemented in the R package HeckmanEM [19].

3.2. The Heckman selection-t model

Marchenko and Genton [24] introduced the selection-t (SLt) model to handle heavy-tailed distributions. This model assumes that the error terms in Equations (7)-(8) follow a bivariate Student's-t distribution with unknown degrees of freedom, ν:

(ϵ1iϵ2i)t2(0,Σ,ν),Σ=(σ2ρσρσ1). (13)

The model, defined by (7)–(8) and (13), is known as the Heckman selection-t (SLt) model, with parameter vector θ=(β,γ,σ2,ρ,ν). According to Miao et al. [28] (Theorem 5), all SLt model parameters are identifiable from the observed data P(Y1i,Ci=1),i=1,,n.

The conditional pdf of an observed outcome Y1i=V1i(Ci=1) (see, [24]) is given by:

f(Y1iCi=1,xi,wi;θ) =t(V1ixiβ,σ2,ν)T(wiγ+ρσ(V1ixiβ)1ρ2ν+1ν+δ(V1i);ν+1)/T(wiγ;ν), (14)

that is,

Y1i=V1i(Ci=1)EST(μ=xiβ,Σ=σ2,λ=ρ1ρ2,τ=wiγ1ρ2,ν).

Using Equation (4), the conditional expectation for the observed outcome simplifies to

E[Y1iCi=1,xi,wi]=xiβ+σρλν(wiγ),ν>1, (15)

by identifying the terms τ~=wiγ, η1=ν+(wiγ)2ν1t(wiγ|ν)T(wiγ;ν), Σ1/2Δ=σρ, and defining λν(a)=ν+a2ν1t(a;ν)T(a;ν) as a specialized function. Like the SLn model, the traditional OLS regression produces inconsistent results when ρ=0. Marchenko and Genton [24, Figure 1] found that for negative values of the selection linear predictor wiγ, the conditional expectation (15) is typically underestimated in the SLn model when the degrees of freedom ν are moderate. However, this bias diminishes as the degrees of freedom increase.

The likelihood function of θ for the SLt model, given the observed sample (V,C), is expressed as:

L(θV,C) =i=1n{t(V1ixiβ,σ2,ν)T(wiγ+ρσ(V1ixiβ)1ρ2ν+1ν+δ(V1i);ν+1)}ci ×{T(wiγ;ν)}1ci. (16)

As seen, it is similar to the SLn case due to the standard probit model P(Ci=ci|wi;γ)=(T(wiγ;ν))ci(1T(wiγ;ν))1ci. There are no closed-form expressions for the ML estimates of the parameters in (16); thus, the ML estimates are obtained numerically or via the ECM algorithm implemented in the R package HeckmanEM. We briefly outline the EM-type algorithm from Lachos et al. [19], where all parameters are updated (M-step) by treating both the outcome (Y1i) and sample selection ( Y2i) as missing data [26,34]. Further technical details are in Lachos et al. [19].

Disregarding censoring momentarily, consider observations for n independent individuals:

Yit2(μi,Σ,ν),i{1,,n}.

Here, Yi=(Y1i,Y2i) represents the vector of independent responses for sample unit i, with

μi=Xicβc,Xic=(xi00wi),βc=(βγ),

and the dispersion matrix Σ depends on an unknown parameter vector (σ,ρ), as defined in (9). Using representation (1) and temporarily disregarding censoring, the distribution of Yi can be hierarchically expressed as follows:

YiUi=uiN2(μi,ui1Σ),UiG(ν/2,ν/2). (17)

Consider y=(y1,,yn), V=(V1,,Vn), C=(C1,,Cn), u=(U1,,Un), where we observe (Vi,Ci) for the ith subject. In the estimation process, y and u are considered as hypothetical missing data and augmented with the observed dataset to form yc=(V,C,y,u). Therefore, the EM-type algorithm is applied to the complete-data log-likelihood function:

c(θyc)=i=1nic(θ),

where

ic(θ)=12{ln|Σ|+ui(yiμi)Σ1(yiμi)}+lnh(uiν)+c.

Here, c represents a constant that does not depend on θ, and h(uiν) denotes the pdf of the G(ν/2,ν/2) distribution. The EM algorithm for the SLt model can be outlined in the following two steps:

  • E-step:
    with the current estimation θ=θˆ(k) at the kth stage of the algorithm, the E-step offers the conditional expectation of the complete-data log-likelihood function
    Q(θθˆ(k)) =E[c(θyc)V,C,θˆ(k)]=i=1nE[ic(θyc)Vi,Ci,θˆ(k)] =i=1nQi(θθˆ(k)), (18)
    where
    Qi(θθˆ(k)) =Qi(β,γ,σ2,ρ,νθˆ(k)) =12ln|Σ|12tr ×[(uyi2ˆ(k)uyiˆ(k)μiμi(uyiˆ(k))+uiˆ(k)μiμi)Σ1], (19)
    with uiˆ(k)=E(UiVi,Ci,θˆ(k)), uyiˆ(k)=E(UiYiVi,Ci,θˆ(k)), and uyi2ˆ(k)=E(UiYiYiVi,Ci,θˆ(k)). Consider that calculating κiˆ(k)=E{lnh(Uiν)Vi,Ci,θˆ(k)} directly poses analytical challenges. To circumvent this, they opt for the constrained ML step (CML-step) to update ν instead. In the CML-step, the actual log-likelihood function is maximized under specific constraints, rather than the Q-function. Parameter transformations ψ=σ2(1ρ2) and ρ=ρσ are utilized to derive closed-form expressions in the M-Step.
  • M-step:

    the conditional maximization of Q(θθˆ(k)) is performed regarding βc,σ2,ρ, yielding updated estimates βcˆ(k+1), σˆ2(k+1), ρˆ(k+1). The closed-form expressions for these estimates, along with a suitable approach for computing ML estimate standard errors for the SLt model and residual analysis, are detailed in Lachos et al. [19]. These methodologies are implemented in the R package HeckmanEM.

4. Influence diagnostics

Techniques for diagnosing influence focus on determining how sensitive a model's parameter estimates are to changes in the dataset or the model's underlying assumptions. Two main approaches are employed to identify influential data points. The first approach, known as the case-deletion technique, assesses the impact of excluding a particular observation by comparing the parameter estimates before and after its removal. This involves fitting one or more models without the observation and evaluating the differences using metrics like likelihood or Cook's distance [6]. The second approach, termed the local influence method, explores the effect of making minor adjustments to an observation on the analysis results, instead of completely omitting it Cook [7]. Building on the work of Zhu and Lee [39], we introduce case-deletion measures and local influence measures for the Heckman selection-t model, utilizing the Q-function derived during the E-step of the EM algorithm; see Zhu et al. [38]. We start by discussing the case-deletion measures, then move on to the local influence measures, and finally describe the perturbation schemes used.

4.1. Case-deletion measures

The case-deletion approach is frequently employed to examine the impact of removing the ith observation from a dataset. In this context, any quantity with the subscript ‘ [i]’ represents the original quantity with the ith observation excluded. For example, yc[i]=(V[i],C[i],y[i],u[i]) refers to the complete data with the ith observation removed. Let θˆ[i]=(βcˆ[i],σˆ[i]2,ρˆ[i]) be the maximizer of the function Q[i](θθˆ)=E[c(θyc[i])V,C,θˆ], where θˆ represents the ML estimates of θ. To assess the effect of the ith observation on θˆ, we compare the difference between θˆ[i] and θˆ. If the removal of an observation significantly alters the estimates, it indicates that the observation is influential. In other words, if θˆ[i] markedly differs from θˆ, the ith observation may be considered influential. Since calculating θˆ[i] for every observation can be computationally intensive, the following one-step approximation θ~[i] is used to alleviate the computational load (see [8,39]):

θ~[i]=θˆ+{Q¨(θˆθˆ)}1Q˙[i](θˆθˆ),fori{1,,n}, (20)

where

Q˙[i](θˆθˆ)=Q[i](θθˆ)θ|θ=θˆandQ¨(θˆθˆ)=2Q(θθˆ)θθ|θ=θˆ, (21)

represent the gradient vector and the Hessian matrix evaluated at θˆ, respectively. Specifically, the Hessian matrix plays a critical role in the method developed by Zhu et al. [39] (see also [37]) for computing case-deletion diagnostic measures and for assessing the local influence of a given perturbation scheme. These formulas can be derived straightforwardly from Equation (18). The elements of the gradient vector Q˙[i](θˆθˆ)=(Q˙[i]βc(θˆθˆ),Q˙[i]σ2(θˆθˆ),Q˙[i]ρ(θˆθˆ)) are given by:

Q˙[i]βc(θˆθˆ) =Q[i](θθˆ)βc|θ=θˆ=jiXjcΣ1uyjˆjiujˆXjcΣ1Xjcβc,Q˙[i]σ2(θˆθˆ) =Q[i](θθˆ)σ2|θ=θˆ=12jitr(Σ1B)+12jitr(ΓˆjΣ1BΣ1),Q˙[i]ρ(θˆθˆ) =Q[i](θθˆ)ρ|θ=θˆ=12jitr(Σ1D)+12jitr(ΓˆjΣ1DΣ1),

where

Γˆj ={uyj2ˆuyjˆμjμj(uyjˆ)+ujˆμjμj},B=Σσ2=(1ρ2σρ2σ0),andD =Σρ=(0σσ0). (22)

The elements of the Hessian matrix Q¨(θθˆ)=i=1n2Qi(θθˆ)/θθ, where θ=(βc,σ2,ρ) is the parameter vector, are given by:

2Qi(θθˆ)βcβc =uiˆXicΣ1Xic,2Qi(θθˆ)βcσ2 =XicΣ1BΣ1uyiˆ+uiˆXicΣ1BΣ1Xicβc,2Qi(θθˆ)βcρ =XicΣ1DΣ1uyiˆ+uiˆXicΣ1DΣ1Xicβc,2Qi(θθˆ)σ2σ2 =12tr(Σ1BΣ1BΣ1E)+12tr[Γˆi(Σ1EΣ12Σ1BΣ1BΣ1)],2Qi(θθˆ)σ2ρ =12tr(Σ1DΣ1BΣ1F)+12tr[Γˆi(Σ1FΣ12Σ1BΣ1DΣ1)],2Qi(θθˆ)ρρ =12tr(Σ1DΣ1D)tr(ΓˆiΣ1DΣ1DΣ1),

where Γˆi, B, and D are defined as in (22), E=2Σ2σ2=(0ρ4σ3ρ4σ30), and F=2Σσ2ρ=(012σ12σ0). We obtain the Hessian matrix Q¨(θˆθˆ) by evaluating these second-order derivatives at θ=θˆ.

To assess influential observations, we can develop case-deletion measures such as the generalized Cook's distance and the likelihood distance. Using the distance metric proposed by Zhu et al. [39] to quantify the difference between θˆ[i] and θˆ, we define the generalized Cook's distance as follows:

GDi=(θˆ[i]θˆ){Q¨(θˆθˆ)}(θˆ[i]θˆ),i=1,,n. (23)

By substituting (20) into (23), we derive the following approximation of the generalized Cook's distance:

GDi1=Q˙[i](θˆθˆ){Q¨(θˆθˆ)}1Q˙[i](θˆθˆ).

4.2. Local influence

In this subsection, we calculate the normal curvature of local influence, following the method described by Cook [7], for several standard perturbation schemes applied to the model or data. Specifically, we will investigate case-weight perturbation, scale matrix perturbation, explanatory variable perturbation, and response perturbation. By examining these methods, we aim to accurately assess the sensitivity and resistance of our model, enabling us to make any necessary adjustments to enhance its reliability.

Let υ=(υ1,,υg) be the perturbation vector varying within an open region ΥRg. Consider that c(θ,υyc) denote the complete-data log-likelihood function of the perturbed model. We assume there exists υ0Υ such that c(θ,υ0yc)=c(θyc) for all θ. Let us define

Q(θ,υθˆ) =E[c(θ,υyc)V,C,θˆ]andθˆ(υ) =argmaxθ{Q(θ,υθˆ)}=(βcˆ(υ),σˆ2(υ),ρˆ(υ)).

The influence graph is defined as α(υ)=(υ,fQ(υ)), with fQ(υ) representing the Q-displacement function, given by:

fQ(υ)=2[Q(θˆθˆ)Q(θˆ(υ)θˆ)].

Building on the methodology of Cook [7] and Zhu and Lee [38], the normal curvature CfQ,h of α(υ) at υ0 in the direction of a unit vector h can be employed to analyze the local behavior of the Q-displacement function. We define

υ=2Q(θ,υθˆ)θυ|θ=θˆ(υ)andQ¨υ0=2Q(θˆ(υ)θˆ)υυ|υ=υ0.

Next, it can be demonstrated that

CfQ,h=2hQ¨υ0h=2hυ0{Q¨(θˆθˆ)}1υ0h,

with Q¨(θˆθˆ) as defined in (21).

Adopting the approach described by Cook [7], the symmetric matrix Q¨υ0 offers crucial insights for detecting influential observations. We begin by applying the spectral decomposition:

2Q¨υ0=k=1gξkϵkϵk,

where {(ξk,ϵk),k=1,,g} are eigenvalue-eigenvector pairs of 2Q¨υ0 with ξ1ξr>ξr+1==0, and orthonormal eigenvectors ϵk,fork=1,,g. Zhu and Lee [38] proposed examining all eigenvectors corresponding to nonzero eigenvalues to extract additional insights, employing the following approach:

ξ~k=ξkξ1++ξr,ϵk2=(ϵk12,,ϵkg2)andM(0)=k=1rξ~kϵk2.

Let M(0)l=k=1rξ~kϵkl2 represent the lth component of M(0), where the assessment of influential cases involves visually inspecting M(0)l for l=1,,g, plotted against the index l, and considering a case influential if M(0)l exceeds a specified benchmark.

Using normal curvature to evaluate observation influence can be problematic due to the variability of CfQ,h, which is not invariant under uniform scale changes. To address this issue, Zhu and Lee [38] introduced the concept of conformal normal curvature, inspired by Poon and Poon [32], defined as:

BfQ,h=CfQ,htr[2Q¨υ0],

which is straightforward to compute and satisfies 0BfQ,h1. Let hl be a basic perturbation vector where the lth entry is 1 and all other entries are 0. Zhu and Lee [38] demonstrated that M(0)l=BfQ,hl for all l, allowing us to derive M(0)l from BfQ,hl.

At present, there is no standard guideline for determining the influence magnitude of a given case. Let M(0)¯ and SM(0) represent the mean and standard error of M(0)l, with l=1,,g. Since the vectors ϵk are orthonormal, it is straightforward to establish that M(0)¯=1/g. Poon and Poon [32] suggested using 2M(0)¯ as a reference for M(0), though alternative functions can be used. For example, Zhu and Lee [38] proposed M(0)¯+2SM(0) to account for the variance of M(0)l. According to Lee and Xu [22], choosing a benchmark function based on M(0)¯ is subjective; they recommended M(0)¯+cSM(0), where c is a constant adaptable to specific applications. In this study, we use c=3.5, a choice supported by Massuia et al. [25], who found it effective in empirical research.

4.3. Perturbation schemes

This section examines the matrix across various perturbation strategies within the Heckman selection-t model. Case-weight perturbation is employed to identify observations that notably impact the log-likelihood function, potentially exerting significant influence on maximum likelihood estimates. Scale perturbation involves adjustments to the scale matrix Σ, highlighting individuals whose likelihood displacement within the scale structure is most pronounced. Response perturbation focuses on varying response values to identify observations that strongly affect their predicted outcomes. Explanatory variable perturbation helps pinpoint the values of continuous explanatory variables that are highly sensitive, as indicated by changes in the log-likelihood. Each perturbation scheme is considered in its partitioned form:

υ0=(βc,σ2,ρ),

where

βc =2Q(θ,υθˆ)βcυ|θ=θˆ(υ0),σ2=2Q(θ,υθˆ)σ2υ|θ=θˆ(υ0)andρ =2Q(θ,υθˆ)ρυ|θ=θˆ(υ0),

with βc R(p+q)×g, σ2R1×g and ρR1×g. Analytical expressions are provided in the following four propositions, with proofs available in the Appendix section.

4.3.1. Case-weight perturbation

We investigate assigning arbitrary weights to the expected value of the complete-data log-likelihood function (perturbed Q-function), allowing us to account for deviations in various directions through

Q(θ,υθˆ) =E[c(θ,υyc)V,C,θˆ]=i=1nυiE[ic(θyc)V,C,θˆ] =i=1nυiQi(θθˆ). (24)

Here, υ=(υ1,,υn) is an n×1 vector, and υ0=(1,,1). Note that if υi=0 and υj=1 for ji, the ith observation is excluded from the complete-data log-likelihood function.

Proposition 4.1

Under the case-weight perturbation scheme defined in (24), the elements of the matrix υ0 are given by

βc =i=1nXicΣ1uyiˆi=1nuiˆXicΣ1Xicβc,σ2 =12i=1ntr(Σ1B)+12i=1ntr(ΓˆiΣ1BΣ1)andρ =12i=1ntr(Σ1D)+12i=1ntr(ΓˆiΣ1DΣ1),

where Γˆi, B, and D are defined as in (22).

Proof.

See the Appendix.

4.3.2. Scale perturbation

To assess deviations from the assumption about the scale matrix Σ, we examine the perturbation given by

Σ(υi)=υi1Σ,i=1,,n. (25)

In this perturbation scheme, the original model corresponds to υ0=(1,,1)Rn. Furthermore, the perturbed Q-function, as in (18), replaces Σ with Σ(υi).

Proposition 4.2

Under the scale perturbation scheme defined in (25), the elements of the matrix υ0 are given by:

βc =i=1nXicΣ1uyiˆi=1nuiˆXicΣ1Xicβc,σ2 =12i=1ntr(ΓˆiΣ1BΣ1)andρ =12i=1ntr(ΓˆiΣ1DΣ1),

where Γˆi, B, and D are defined as in (22).

Proof.

See the Appendix.

4.3.3. Response perturbation

To introduce a perturbation of the response variables (y1,,yn), we replace yi with

yi(υ)=yi+υi12, (26)

where 12 is a 2×1 vector of ones. The perturbed Q-function follows (18), replacing yi with yi(υ). Here, the vector υ0=0Rn signifies no perturbation.

Proposition 4.3

Under the response perturbation scheme defined in (26), the elements of the matrix υ0 are as follows:

βc =i=1nuiˆXicΣ112,σ2 =12i=1ntr(ΓˆiΣ1BΣ1)andρ =12i=1ntr(ΓˆiΣ1DΣ1),

where Γˆi={2(uyiˆ)12+2uiˆμi12}, and B and D are defined as in (22).

Proof.

See the Appendix.

4.3.4. Explanatory variable perturbation

There are three potential methods for perturbing a specific continuous explanatory variable: as a covariate in the primary regression (outcome model), as a covariate in the selection equation (selection model), or simultaneously in both equations. Here, we will focus on the first scenario, while the other two can be addressed similarly.

In this scenario, we aim to perturb a specific continuous explanatory variable in the primary regression. Under this condition, the perturbed explanatory matrix is given by:

Xic(υ)=(xi(υi)00wi), (27)

where xi(υi)=xi+υi1u, 1u=(0,,1,,0) is a 1×p vector with 1 in the uth column, u=1,,p. This approach addresses situations where the continuous covariate xi is measured with error. The perturbed Q-function follows (18), with Xic(υ) replacing Xic. The unperturbed case is achieved by setting υ0=0Rn.

Proposition 4.4

Under the explanatory variable perturbation scheme defined in (27), υ0 has the following elements:

βc =i=1nGiΣ1uyiˆi=1nuiˆGiΣ1Xicβci=1nuiˆXicΣ1Giβc,σ2 =12i=1ntr(ΓˆiΣ1BΣ1)andρ =12i=1ntr(ΓˆiΣ1DΣ1).

where Γˆi={uyiˆGiβc(Giβc)uyiˆ+uiˆ(Giβc)Xicβc+uiˆ(Xicβc)Giβc} and Gi is a matrix of dimension 2×(p+q) obtained by

Gi=Xic(υ)υi=(01001×q01×p01×q),

with 1 in the first row of the uth column, u=1,,p. B and D are defined as in (22).

Proof.

See the Appendix.

Note that while it is not feasible to cover all pertinent perturbation schemes in detail, the key lies in finding an appropriate υ. As long as the perturbed complete-data log-likelihood function c(θ,υYc) remains sufficiently smooth, ensuring that all necessary derivatives for diagnostic measures are well-defined, conducting local influence analysis becomes feasible without significant complications. Finally, it is important to note that our approach to developing diagnostic techniques is based on Q¨(θˆθˆ). In the context of linear mixed models, Pan and Foster [31] suggest using E[Q¨(θˆθˆ)] instead of Q¨(θˆθˆ) to detect potentially influential observations. They demonstrate that E[Q¨(θˆθˆ)] is block-diagonal, which facilitates the interpretation of diagnostic measures in this setting.

The next section provides a brief simulation study to evaluate the effectiveness of the proposed diagnostic measures in identifying outliers.

5. Simulation studies

To evaluate the effectiveness of the proposed diagnostic measures, we conducted a simulation study focusing on the SLn and SLt models. The Monte Carlo simulations were designed to assess the ability of the global diagnostic measure (GD), derived via the case-deletion technique (see Section 4.1), to detect influential points in the response variable. Although analogous analyses can be extended to local influence measures, this simulation study concentrates on the global approach for simplicity.

The simulations were based on the classical Heckman selection model, as defined in Equations (7) and (8), with a sample size of n=100. The regression coefficients were set to β=(1.0,0.5), and the selection parameters to γ=(1.28,0.30,0.50). Covariates were specified as X=(1,Unif(1,1)) and W=(X,N(0,σ2)), with σ=1 and ρ=0.6. The error terms followed a bivariate normal distribution, as outlined in Equation (9), and were assumed to be independent.

Data generation adhered to the Heckman model framework, implemented using the rHeckman function from the HeckmanEM package in R. We considered three levels of censoring, 10%, 20%, and 40% to examine whether the degree of censoring affects the detection of influential points.

To identify these points, we perturbed the minimum and maximum values within each sample by k standard deviations, with k ranging from 0 to 3. Specifically, the minimum value was reduced, and the maximum value increased by k standard deviations. Even without perturbation, these extreme points were treated as potential influential candidates due to their positions in the distribution tails. Each sample was subsequently fitted to the SLn and SLt models, and the GD measure was computed. For the SLt model two fitting strategies were performed: 1) we fixed the degrees of freedom (SLt with fixed ν) in the estimated value when no synthetic influential observation was added; 2) the degree of freedom was estimated for the datasets (SLt with adjusted ν) with the inclusion of the outliers. A point was classified as influential if both the minimum and maximum values exceeded the GD benchmark threshold. This process was repeated across 500 Monte Carlo replicates.

The results, summarized in Table 1, present the percentage of detected influential points for both the SLn and SLt models, alongside the mean and standard deviation (SD) of the influence measure, compared to the benchmark value. As expected, for the SLn model, the percentage of influential points increased with higher values of k, regardless of the censoring level, showcasing the diagnosis capacity to detect the influential observations as they become more severe. Meanwhile, the SLt fit presents two different behaviors. When the ν parameter is fixed, the SLt loses its capacity to adapt to outliers. Therefore, the diagnosis detects the shifted observations as influential. However, when the degree of freedom is re-estimated in the model fit, the model absorbs the outliers making the tails fatter and not exceeding the threshold of the diagnosis tool. This helps us understand why the SLt model has significant resistance compared to the SLn.

Table 1.

Influence diagnostic analysis: case-deletion in SLn and SLt models under different censoring types and threshold variations ( Ymin=Yminksd(y),Ymax=Ymax+ksd(y)).

  SLn SLt with fixed ν SLt with adjusted ν
  k k k
Statistic 0 1 2 3 0 1 2 3 0 1 2 3
  10% Censoring
% Influential1 2.0 46.4 95.4 99.6 0.8 27.7 81.8 93.0 0.0 0.2 0.2 0.0
Mean measure 0.020 0.025 0.032 0.044 0.021 0.022 0.028 0.035 0.020 0.019 0.019 0.018
SD2measure 0.029 0.055 0.117 0.203 0.025 0.039 0.079 0.134 0.025 0.022 0.019 0.017
Benchmark 0.140 0.160 0.160
  20% Censoring
% Influential 1.8 44.8 91.2 99.8 0.2 28.1 81.6 93.5 0.0 0.2 0.2 0.0
Mean measure 0.021 0.024 0.033 0.043 0.021 0.023 0.029 0.036 0.020 0.019 0.019 0.018
SD measure 0.027 0.051 0.110 0.190 0.023 0.038 0.079 0.134 0.025 0.022 0.019 0.017
Benchmark 0.140 0.160 0.160
  40% Censoring
% Influential 1.2 31.6 83.0 97.6 0.0 18.6 69.3 88.8 0.0 0.2 0.0 0.0
Mean measure 0.020 0.023 0.029 0.036 0.020 0.021 0.026 0.031 0.020 0.020 0.019 0.019
SD measure 0.025 0.044 0.089 0.147 0.021 0.034 0.067 0.109 0.022 0.020 0.018 0.016
Benchmark 0.140 0.160 0.160

1% Influential: represents the percentage of Monte Carlo replicates in which both the minimum and maximum observations were jointly detected as influential (exceeded the benchmark value).

2SD: Standard deviation.

Moreover, in the SLt with adjusted ν model, the mean and standard deviation of the GD measure remained stable or slightly decreased as k increased, indicating resistance. This behavior contrasts with the SLn and SLt with fixed ν models, where both the mean and standard deviation of GD increased with k. Additionally, higher censoring levels led to a slight decrease in the detection of influential points for the SLn and SLt with fixed ν models, whereas the SLt with adjusted ν model remained unaffected, emphasizing its resilience to outliers regardless of the degree of censoring when the degree of freedom is estimated in the process.

Figure 1 allows us to visualize, for one of the Monte Carlo iterations with 10% censoring, the effect of the GD measure with varying k. Clearly, for the SLn and the SLt with fixed ν models, the influential observation became more extreme as k increased, being more severe for the SLn model. However, the GD measure maintained stable for the SLt with adjusted ν. In this case, there was an inverse association between ν and k, that is, ν reduced as k increased. This effect happened to absorb the outliers, showcasing the resistance of the SLt model.

Figure 1.

Figure 1.

Approximate generalized Cook's distance ( GDi) for the SLn (left), SLt with fixed ν (center), and SLt with adjusted ν (right) models in a single Monte Carlo replication, considering minimum and maximum perturbations of k standard deviations (k = 0, 1, 2, 3) with 10% censoring.

This clearly demonstrates that the proposed diagnostic tools are effective in accurately identifying whether an observation is influential, depending on the fitted distribution. It is well established that heavy-tailed models are robust to outliers, and the diagnostic measures effectively capture this robustness.

The next section presents two real-world applications that further illustrate the efficacy of the proposed methodology.

6. Applications

6.1. Ambulatory expenditures

To illustrate the methodologies discussed, we applied them to analyze ambulatory expenditures data originally from Cameron and Trivedi [5] and later re-examined by Marchenko and Genton [24] using ML estimation in Stata, by Ding [9] using Bayesian methods, and by Lachos et al. [19] utilizing an efficient EM-type algorithm.

For our analysis, we selected the same covariates as used by Marchenko and Genton [24], Ding [9], and Lachos et al. [19]. Specifically, we focused on log expenditures (lambexp) as the outcome variable. The covariates in the outcome equation included x=(1, age, blhisp, educ, female, ins, totchr), representing age, ethnicity, education status, gender, insurance status, and number of chronic diseases, respectively. The income variable was included in the selection equation, w=(x, income), to ensure the exclusion restriction. The dataset had 526 missing values out of a total of 3328 observations.

Using the HeckmanEM package in R, we fitted both SLn and SLt models. Covariates educ and ins were found non-significant in both models, leading to their exclusion upon model readjustment to x=(1, age, blhisp, female, totchr) with w=(x, income). The estimation results are detailed in Table 2. Notably, all covariates proved significant in both the outcome and selection models for SLn and SLt. As noted by Marchenko and Genton [24], Ding [9], and Lachos et al. [19], the SLn model's 95% confidence interval of ρ contained zero (0.563,0.249), indicating weak evidence of selection bias. In contrast, the SLt model exhibited a 95% confidence interval of ρ of (0.618,0.116), suggesting a significant selection bias effect.

Table 2.

Ambulatory expenditures data: ML estimates, standard errors, and information criteria.

  SLn SLt
Parameter Estimate Std. error z p Estimate Std. error z p
Outcome model        
intercept 5.317 0.173 30.745 0.000 5.471 0.135 40.374 0.000
age 0.209 0.024 8.653 0.000 0.202 0.023 8.738 0.000
blhisp −0.233 0.068 −3.411 0.001 −0.200 0.059 −3.367 0.001
female 0.342 0.071 4.831 0.000 0.295 0.059 4.973 0.000
totchr 0.534 0.051 10.413 0.000 0.505 0.041 12.438 0.000
Selection model        
intercept 0.126 0.115 1.088 0.277 0.091 0.124 0.735 0.462
age 0.088 0.026 3.344 0.001 0.098 0.029 3.427 0.001
blhisp −0.435 0.060 −7.189 0.000 −0.465 0.064 −7.235 0.000
female 0.687 0.060 11.404 0.000 0.748 0.066 11.308 0.000
totchr 0.780 0.068 11.457 0.000 0.869 0.083 10.437 0.000
income 0.005 0.001 4.512 0.000 0.006 0.001 4.492 0.000
σ 1.274 0.021   1.203 0.025  
ρ −0.157 0.207   −0.367 0.128  
ν     13.089    
AIC 11713.780 11686.060
BIC 11719.890 11692.170

Lachos et al. [19] performed residuals analysis and concluded the SLt model provided a better fit than the SLn model. Subsequently, we investigated the dataset for influential observations using the case-deletion approach ( GDi), M(0) from conformal curvature BfQ,dl, and perturbation schemes outlined in Section 4.3.

Figure 2 presents the approximate generalized Cook's distance GDi for the SLn (left panel) and SLt (right panel) fitted models. Higher GDi values indicate greater impact of the ith observation on ML parameter estimates. Adapting the suggestion of Barros et al. [2], we used (2×npar)/n as a benchmark for GDi, where npar denotes the number of estimated model parameters. To enhance clarity, we highlighted observations with high GDi values. Notably, fewer observations exceeded the benchmark in the SLt model (18 points, 0.54% of data) compared to the SLn model (55 points, 1.65% of data), underscoring the resistance of the heavy-tailed model, as expected. This highlights the efficacy of our diagnostic approach in identifying influential observations. Heavy-tail distributions are recognized for their resilience to influential observations (see, e.g.[12,25,27]). Figure 2 confirms the resistant behavior of the Student-t distribution in assessing GDi. Additionally, it is noteworthy that all influential observations identified in the SLt model also appeared in the SLn model, though with lower GDi values.

Figure 2.

Figure 2.

Ambulatory expenditures data: approximate generalized Cook's distance ( GDi) for the SLn (left) and SLt (right) fitted models.

Next, we conducted a local influence study based on M(0), guided by Sections 4.2 and 4.3. Here, we used the criterion M(0)l>M(0)¯+3.5SM(0), for l=1,,3328, to identify influential observations.

Figure 3 presents SLn and SLt model results under case-weight perturbation, scale matrix perturbation, and response perturbation schemes. To enhance clarity, we highlighted points with high M(0)l values. Analysis of case-weight and scale matrix perturbations revealed a greater number of influential points detected in the SLn model compared to the SLt model, consistent with observations in Figure 2. Conversely, response perturbation yielded similar findings of influential observations across both models.

Figure 3.

Figure 3.

Ambulatory expenditures data: index plot of M(0) under case-weight, scale matrix, and response perturbations for SLn (left) and SLt (right) fitted models. Horizontal lines indicate the Lee and Xu [22] benchmark for M(0) with c=3.5.

Figure 4 illustrates the explanatory variable perturbation for the two continuous covariates included in the primary regression. As anticipated, the SLn model identifies several influential points when perturbing the age and totchr covariates. In contrast, the SLt model designates only a few observations as influential points when perturbing the totchr covariate (number of chronic diseases). Regarding the age covariate, the SLt model effectively accommodates the observations, with no additional influential points identified.

Figure 4.

Figure 4.

Ambulatory expenditures data: index plot of M(0) under explanatory variable perturbation for SLn (left) and SLt (right) fitted models. Horizontal lines indicate the Lee and Xu [22] benchmark for M(0) with c=3.5.

To further assess the effectiveness of the proposed diagnostic measures, we refitted the SLn and SLt models by excluding specific data points. Based on the points identified as potentially influential by our proposal, we implemented the following strategy: for the SLn fit, we initially excluded 55 non-influential observations (with the lowest values of GD), and alternatively, for comparison purposes, we removed all 55 observations with GD values above the benchmark. Similarly, for the SLt fit, we excluded 18 non-influential observations (with the lowest values of GD) and then eliminated all 18 observations with GD values above the benchmark. Table 3 displays the relative percentage changes (RC) in these estimates, calculated as

RCηˆ=|ηˆηˆ[i]ηˆ|×100%, (28)

where η = β0,,β4,γ0,,γ5,σ,ρ,ν and ηˆ[i] denotes the ML estimates of ηˆ after the set has been removed. As expected, when we remove points with low values of GD (considered non-influential), we observe in Table 3 that the relative percentage changes are very small, indicating that their removal does not significantly impact the ML estimates for both models. Conversely, when we exclude the points identified by the GD measure as influential, for both models, we observe a substantial percentage of relative changes (many exceeding 10%) in the ML estimates. Therefore, our proposal correctly discriminates the influential points from the non-influential.

Table 3.

Ambulatory expenditures data: relative changes (in %) of ML estimates.

  SLn SLt
  Dropping 55 points Dropping 55 points Dropping 18 points Dropping 18 points
Parameter without influence with influence without influence with influence
Outcome model    
intercept 0.12 0.68 0.08 0.17
age 0.06 3.44 0.35 1.47
blhisp 0.44 6.00 0.26 5.48
female 0.97 4.66 0.42 4.04
totchr 0.03 4.10 0.18 2.40
Selection model    
intercept 2.27 120.75 4.16 78.05
age 1.96 18.12 1.38 3.72
blhisp 0.96 7.75 0.47 3.06
female 0.19 13.14 0.71 1.23
totchr 0.69 22.55 0.61 6.11
income 3.78 51.65 0.55 45.02
σ 0.90 3.15 0.62 0.61
ρ 6.35 8.79 0.70 8.71
ν   3.80 0.33

Lower GD values. GD values above the benchmark.

6.2. Mroz: labor supply data

In this second application, our focus is on analyzing missing econometric data through a reexamination of the dataset originally introduced by Mroz [29]. Our goal is to estimate the wage offer function for married women using diagnostic tools proposed in the methodology. The dataset, referred to as the ‘Mroz data’, consists of observations on 753 married white women across 21 variables. This dataset is available in the R package AER [17], accessible via the command data("PSID1976"). Notably, the variable of interest, female wage, is missing for 325 (43%) of the 753 women in the sample. To illustrate diagnostic techniques, we adopt the same set of covariates used by Ogundimu and Hutton [30]. Specifically, the logarithm of wage depends on education status and city, represented as x=(1,educ,city). The selection equation incorporates husband's wage, number of children aged 5 years or younger, marginal tax rate of the wife, and the wife's father's educational attainment, alongside educational and city variables. Thus, w=(x,huswage,kid5,mtr,fatheduc).

We fitted both SLn and SLt models using the HeckmanEM package in R. Table 4 summarizes the parameter estimates and their corresponding p-values. Notably, while the statistical significance of covariates in both models is similar, the SLt model yielded a small estimated value of ν=3.001. This suggests that the SLn model is inadequate for the Mroz data. Moreover, the estimate of σ decreased from 0.8 in the SLn model to 0.5 in the SLt model. Both models indicated a high value of ρ close to −1, indicating non-random sample selection.

Table 4.

Mroz data: ML estimates, standard errors, and information criteria.

  SLn SLt
Parameter Estimate Std. error z p Estimate Std. error z p
Outcome model        
intercept 0.669 0.239 2.798 0.005 0.332 0.170 1.959 0.051
educ 0.066 0.018 3.559 0.000 0.087 0.013 6.719 0.000
city 0.107 0.082 1.306 0.192 0.094 0.059 1.602 0.110
Selection model        
intercept 3.802 0.764 4.975 0.000 5.934 0.953 6.228 0.000
huswage −0.103 0.015 −6.812 0.000 −0.153 0.021 −7.387 0.000
kids5 −0.415 0.078 −5.345 0.000 −0.585 0.108 −5.438 0.000
mtr −5.782 0.847 −6.825 0.000 −8.448 1.089 −7.756 0.000
fatheduc −0.020 0.013 −1.617 0.106 −0.012 0.016 −0.793 0.428
educ 0.112 0.024 4.653 0.000 0.118 0.029 4.140 0.000
city −0.040 0.107 −0.370 0.712 −0.097 0.123 −0.784 0.433
σ 0.800 0.028   0.501 0.030  
ρ −0.780 0.040   −0.733 0.061  
ν     3.001    
AIC 1765.604 1678.728
BIC 1770.228 1683.352

Figure 5 presents the normal probability plot of residuals generated by the HeckmanEM package, highlighting a better fit for the SLt model, corroborated by lower AIC and BIC values (as seen in Table 4). Additionally, Figure 6 shows the approximate generalized Cook's distance ( GDi) for the Mroz data in both SLn and SLt models. Interestingly, the SLt model identified only 4 influential points (specifically 84, 176, 369, and 423), representing a 73.3% reduction from the 15 influential points identified by the SLn model.

Figure 5.

Figure 5.

Mroz data: normal probability plot of normalized quantile residual for SLn (left) and SLt (right) models.

Figure 6.

Figure 6.

Mroz data: approximate generalized Cook's distance ( GDi) for the SLn (left) and SLt (right) fitted models.

Furthermore, a local influence study based on M(0) (Sections 4.2 and 4.3) was conducted. Figure 7 displays results for both SLn (left) and SLt (right) models under various perturbations (case-weight, scale matrix, response, and explanatory variables). Regarding the latter, we exclusively present the perturbation graph for the continuous covariate educ.

Figure 7.

Figure 7.

Mroz data: index plot of M(0) under perturbations of case-weight, scale matrix, response, and explanatory variables (covariate educ) for SLn (left) and SLt (right) fitted models. Horizontal lines mark the benchmark for M(0) by Lee and Xu [22] with c=3.5.

Based on Figure 7, it is evident that several points identified in Figure 6 ( GDi) reappear during the case-weight, scale matrix, and explanatory variable educ perturbations, but exclusively in the SLn model fit. In contrast, these points do not exhibit prominence in the SLt model fit. As previously discussed, these outcomes align with the inherent characteristics of the SLn and SLt models. The efficacy of the proposed diagnostic techniques in detecting these discrepancies is noteworthy. Lastly, concerning the response perturbation, similar patterns among the highlighted points are observed across both model fits.

Based on the influence methods, the resistance of the SLt model to atypical observations is reinforced. In particular, these diagnostic tools enable us to quantify how much the ML estimates of θ are impacted by altering a single observation Yi by ξ units. Specifically, we modify a single observation yi to yi(ξ)=yi+ξ, and then compute the relative change in estimates ((θˆ(ξ)θˆ)/θˆ), where θˆ represents the original estimate and θˆ(ξ) denotes the estimate with the contaminated data. In this instance, we manipulated the observation corresponding to subject 369, varying ξ from 5 to 5 in increments of 1. Figure 8 illustrates the relative changes in the estimate β=(β0,β1,β2), corresponding to (intercept, educ, city) of the outcome model, for different levels of ξ, under both SLn and SLt models. As anticipated, the SLt model exhibits less pronounced fluctuations in estimates when subjected to variations in ξ, compared to the SLn model.

Figure 8.

Figure 8.

Mroz data: relative changes in ML estimates of β0, β1, and β2 for SLn and SLt models under different contamination of ξ on subject 369. Percentage change = 100×((θˆ(ξ)θˆ)/θˆ), where θˆ denotes the original estimate and θˆ(ξ) denotes the estimate for the contaminated data.

An important point to emphasize is the significant role that diagnostic techniques play in model inference. To illustrate this, we re-estimated the SLn and SLt models using the Mroz dataset, excluding all points identified as influential in the response perturbation analysis (see Figure 7). The revised parameter estimates and their corresponding p-values are presented in Table 5. A notable change was observed in the covariate fatheduc, included in the selection equation. Initially, in the SLn model, its p-value was 0.106 (see Table 4), making it non-significant at the 5% level. However, after re-estimation, the p-value dropped to 0.04, rendering the variable statistically significant and altering the corresponding inference. In the SLt model, no significant change in the conclusions of the model were obtained when the potential influential observations were removed. This is an indication of the resistance of the SLt model. Regardless, these results underscore the importance of diagnostic tools in detecting possible influential observations and the need to study their effects on the fit of the model, which can even change the model conclusion, as observed for SLn.

Table 5.

Mroz data: ML estimates, standard errors, and information criteria after excluding influential points identified in response perturbation.

  SLn SLt
Parameter Estimate Std. error z p Estimate Std. error z p
Outcome model        
intercept 0.749 0.229 3.273 0.001 0.343 0.170 2.016 0.044
educ 0.062 0.018 3.484 0.001 0.086 0.013 6.640 0.000
city 0.095 0.080 1.190 0.235 0.095 0.059 1.604 0.109
Selection model        
intercept 3.768 0.743 5.075 0.000 5.910 0.953 6.201 0.000
huswage −0.099 0.014 −6.993 0.000 −0.148 0.021 −7.136 0.000
kids5 −0.366 0.072 −5.056 0.000 −0.558 0.108 −5.144 0.000
mtr −5.770 0.825 −6.996 0.000 −8.436 1.090 −7.738 0.000
fatheduc −0.023 0.011 −2.040 0.042 −0.013 0.016 −0.862 0.389
educ 0.113 0.024 4.787 0.000 0.117 0.028 4.125 0.000
city −0.056 0.105 −0.533 0.594 −0.103 0.123 −0.837 0.403
σ 0.814 0.027   0.503 0.030  
ρ −0.851 0.028   −0.740 0.060  
ν     3.001    
AIC 1721.049 1676.327
BIC 1725.661 1680.941

In conclusion, our diagnostic methodology effectively identified influential points in the analysis of real data. Moreover, it confirmed the SLt model is superior in its resistance by significantly reducing the number of influential observations compared to the SLn model.

7. Conclusions

To the best of our knowledge, this article is the first to introduce diagnostic tools designed to identify outliers and influential observations in Heckman selection models, filling this gap in the literature. Specifically, this is done for the SLt and SLn models, which assume joint distributions of outcome and sample selection following either the Student's-t or normal distribution. Our approach utilizes the Q-function derived from the EM algorithm specific to these models. Nevertheless, the techniques employed can be extended to any selection model that relies on EM algorithm.

From the real data analysis and simulation studies, we found that our diagnostic tools effectively distinguish between influential and non-influential observations. Additionally, our findings complement the resistant likelihood-based inference methods pioneered by Lachos et al. [19] for analyzing SLt (and SLn) models, particularly suited for selection bias scenarios. Further, from the simulation study, we were not only able to show that our diagnostic tools correctly detect influential observation but, as a side effect, help the reader understand why the SLt model is resistant to outliers. Our proposed methodology has been integrated into the R package HeckmanEM, offering practitioners a user-friendly tool for applying these diagnostics in various domains. Furthermore, this package promotes the reproducibility of our research outcomes, supporting transparency and reliability in subsequent applications.

Future work encompasses the development of diagnostic tools to another type of Heckman selection models that rely on EM-type algorithms, e.g.the Heckman selection contaminated normal model (SLcn) introduced by Lim et al. [23] or to understand the relationship between SLn and SLt models with the broader family of extended skew-elliptical distributions [10,11] to build influence diagnostics from a broader perspective.

Appendix.

The following results of matrix differentiation will be used in the proofs of some propositions.

Lemma A.1

Let A be an n×n symmetric matrix, and let x, t, and a be vectors of dimension n×1. Then

tAtt=2At,att=a,Att=A,Att=vec(A).

Proof.

The proof can be found in Graham [13].

Lemma A.2

Let A denote a positive definite n×n matrix, which is therefore symmetric, and let t be a scalar. Then,

A1t =A1AtA1,tr(A)t=tr(At),|A|t =|A|tr(A1At),log|A|t=tr(A1At).

Proof.

The proof can be found in Graham [13].

Proof Proof of Proposition 4.1 —

Starting from Equation (24), we can express Q(θ,υθˆ) as a summation: Q(θ,υθˆ)=i=1sυiQi(θθˆ). Substituting Qi(θθˆ(k)) as defined in (19), and omitting the superscript (k) for simplicity, we obtain:

Q(θ,υθˆ)=i=1nQi(θ,υθˆ)=i=1n[12υiln|Σ|12υitr(ΓiˆΣ1)],

where Γiˆ=uyi2ˆuyiˆμiμi(uyiˆ)+uiˆμiμi, and μi=Xicβc. Now, applying the results of Lemmas 1 and 2, and considering B=Σσ2 and D=Σρ, we have: Qi(θ,υθˆ)βc=υi[XicΣ1uyiˆuiˆXicΣ1Xicβc], Qi(θ,υθˆ)σ2=12υitr(Σ1B)+12υitr(ΓiˆΣ1BΣ1), and Qi(θ,υθˆ)ρ=12υitr(Σ1D)+12υitr(ΓiˆΣ1DΣ1).

Now, differentiating with respect to υ and evaluating at θˆ=θˆ(υ0), we obtain:

βc =2Q(θ,υθˆ)βcυ|θ=θˆ(υ0)=i=1n2Qi(θ,υθˆ)βcυ|θ=θˆ(υ0) =i=1n[XicΣ1uyiˆuiˆXicΣ1Xicβc],σ2 =i=1n[12tr(Σ1B)+12tr(ΓiˆΣ1BΣ1)]andρ =i=1n[12tr(Σ1D)+12tr(ΓiˆΣ1DΣ1)].

Proof Proof of Proposition 4.2 —

The perturbed Q-function is as defined in (18), where Σ(υi)=υi1Σ is used in place of Σ. Therefore, we have:

Q(θ,υθˆ)=i=1nQi(θ,υθˆ)=i=1n[12ln[υi1|Σ|]12tr(ΓiˆυiΣ1)].

By leveraging Lemmas A.1 and A.2 and proceeding through analogous steps as those in the proof of Proposition 4.1, the result ensues.

Proof Proof of Proposition 4.3 —

The perturbed Q-function follows (18), where we replace yi(υ)=yi+υi12 with yi. Therefore, the perturbed Q-function is expressed as

Q(θ,υθˆ)=i=1nQi(θ,υθˆ)=i=1n[12ln|Σ|12tr(ΓiˆΣ1)],

where Γiˆ is updated from the proposed perturbation, specifically: Γiˆ=uyi2ˆuyiˆμiμi(uyiˆ)+uiˆμiμi2(uyiˆ)υi12+2uiˆυi2+2uiˆυiμi12. Now, applying the results of Lemmas 1 and 2, and considering the same B and D as in the proof of Proposition 4.1, we obtain: Qi(θ,υθˆ)βc=XicΣ1uyiˆuiˆXicΣ1(υi12+μi), Qi(θ,υθˆ)σ2=12tr(Σ1B)+12tr(ΓiˆΣ1BΣ1), and Qi(θ,υθˆ)ρ=12tr(Σ1D)+12tr(ΓiˆΣ1DΣ1), with updated Γiˆ. Again, differentiating with respect to υ and evaluating at θˆ=θˆ(υ0), we obtain:

βc=2Q(θ,υθˆ)βcυ|θ=θˆ(υ0)=i=1n2Qi(θ,υθˆ)βcυ|θ=θˆ(υ0)=i=1nuiˆXicΣ112,σ2=i=1n[12tr(ΓˆiΣ1BΣ1)]andρ=i=1n[12tr(ΓˆiΣ1DΣ1)],

where Γˆi={2(uyiˆ)12+2uiˆμi12}.

Proof Proof of Proposition 4.4 —

Consider the perturbed explanatory matrix

Xic(υ)=(xi(υi)00wi),

where xi(υi)=xi+υi1u. Here, 1u=(0,,1,,0) is a 1×p vector with 1 in the uth column, u=1,,p. The perturbed Q-function is defined as in (18), by replacing Xic(υ) with Xic. The unperturbed case corresponds to υ0=0Rn. When substituting Xic(υ) with Xic, Γˆi is updated accordingly, following the procedure outlined in the proof of Proposition 4.3. The conclusion follows from applying Lemmas A.1 and A.2 and following the same steps as in the proofs of the preceding propositions.

Funding Statement

The research conducted by Marcos S. Oliveira was supported by Grant no. 401418/2022-7 Conselho Nacional de Desenvolvimento Científico e Tecnológico - CNPq. Marcos O. Prates acknowledges support from CNPq grant 309186/2021-8, Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG) grant APQ-01837-22, and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES). Christian E. Galarza acknowledges the support from the ESPOL Dean of Research. Victor Lachos acknowledges partial financial support from UConn - CLAS's Summer Research Funding Initiative 2023 and Research Excellence Program - UConn.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Arellano-Valle R., Branco M., and Genton M., A unified view on skewed distributions arising form selections, Can. J. Stat. 34 (2006), pp. 581–601. [Google Scholar]
  • 2.Barros M., Galea M., González M., and Leiva V., Influence diagnostics in the Tobit censored response model, Stat. Methods. Appt. 19 (2010), pp. 379–397. [Google Scholar]
  • 3.Bastos F.S. and Barreto-Souza W., Birnbaum–Saunders sample selection model, J. Appl. Stat. 48 (2021), pp. 1896–1916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Bastos F.S., Barreto-Souza W., and Genton M.G., A generalized heckman model with varying sample selection bias and dispersion parameters, Stat. Sin. 32 (2022), pp. 1911–1938. [Google Scholar]
  • 5.Cameron A.C. and Trivedi P.K., Microeconometrics Using Stata, Vol. 5, Stata Press, College Station, TX, 2009. [Google Scholar]
  • 6.Cook R.D., Detection of influential observation in linear regression, Technometrics 19 (1977), pp. 15–18. [Google Scholar]
  • 7.Cook R.D., Assessment of local influence, J. R. Stat. Soc. Ser. B. 48 (1986), pp. 133–169. [Google Scholar]
  • 8.Cook R.D. and Weisberg S., Residuals and Influence in Regression, Chapman & Hall/CRC, Boca Raton, FL, 1982. [Google Scholar]
  • 9.Ding P., Bayesian robust inference of sample selection using selection-t models, J. Multivar. Anal. 124 (2014), pp. 451–464. [Google Scholar]
  • 10.Galarza C.E., Matos L.A., Castro L.M., and Lachos V.H., Moments of the doubly truncated selection elliptical distributions with emphasis on the unified multivariate skew-t distribution, J. Multivar. Anal. 189 (2022), p. 104944. [Google Scholar]
  • 11.Galarza C.E., Matos L.A., Dey D.K., and Lachos V.H., On moments of folded and doubly truncated multivariate extended skew-normal distributions, J. Comput. Graph. Stat. 31 (2022), pp. 455–465. [Google Scholar]
  • 12.Garay A., Castro L., Leskow J., and Lachos V.H., Censored linear regression models for irregularly observed longitudinal data using the multivariate-t distribution, Stat. Methods. Med. Res. 26 (2014), pp. 542–566. [DOI] [PubMed] [Google Scholar]
  • 13.Graham A., Kronecker Products and Matrix Calculus: With Applications, Ellis Horwood series in mathematics and its applications, Horwood, 1981. [Google Scholar]
  • 14.Heckman J., Shadow prices, market wages, and labor supply, Econometrica 42 (1974), pp. 679–694. [Google Scholar]
  • 15.Heckman J., Sample selection bias as a specification error, Econometrica 47 (1979), pp. 153–161. [Google Scholar]
  • 16.Henningsen A., Toomet O., and Petersen S., sampleSelection: Sample selection models, R Package Version 1.2-0 https://cran.r-project.org/web/packages/sampleSelection/index.html (2019).
  • 17.Kleiber C. and Zeileis A., Applied Econometrics with R, Springer-Verlag, New York, 2008. [Google Scholar]
  • 18.Lachos V.H., Ghosh P., and Arellano-Valle R.B., Likelihood based inference for skew–normal independent linear mixed models, Stat. Sin. 20 (2010), pp. 303–322. [Google Scholar]
  • 19.Lachos V.H., Prates M.O., and Dey D.K., Heckman selection-t model: Parameter estimation via the EM-algorithm, J. Multivar. Anal. 184 (2021), p. 104737. [Google Scholar]
  • 20.Lee L.F., Generalized econometric models with selectivity, Econometrica 51 (1983), pp. 507–512. [Google Scholar]
  • 21.Lee M.-j., Treatment effects in sample selection models and their nonparametric estimation, J. Econom. 167 (2012), pp. 317–329. [Google Scholar]
  • 22.Lee S.Y. and Xu L., Influence analysis of nonlinear mixed-effects models, Comput. Stat. Data Anal. 45 (2004), pp. 321–341. [Google Scholar]
  • 23.Lim H., Ordonez J.A., Lachos V.H., and Punzo A., Heckman selection contaminated normal model, arXiv preprint arXiv:2409.12348 (2024).
  • 24.Marchenko Y.V. and Genton M.G., A Heckman selection-t model, J. Am. Stat. Assoc. 107 (2012), pp. 304–317. [Google Scholar]
  • 25.Massuia M.B., Cabral C.R.B., Matos L.A., and Lachos V.H., Influence diagnostics for Student-t censored linear regression models, Statistics 49 (2015), pp. 1074–1094. [Google Scholar]
  • 26.Matos L.A., Lachos V.H., Balakrishnan N., and Labra F.V., Influence diagnostics in linear and nonlinear mixed-effects models with censored data, Comput. Stat. Data Anal. 57 (2013), pp. 450–464. [Google Scholar]
  • 27.Matos L.A., Prates M.O., Chen M.H., and Lachos V.H., Likelihood-based inference for mixed-effects models with censored response using the multivariate-t distribution, Stat. Sin. 23 (2013), pp. 1323–1342. [Google Scholar]
  • 28.Miao W., Ding P., and Geng Z., Identifiability of normal and normal mixture models with nonignorable missing data, J. Am. Stat. Assoc. 111 (2016), pp. 1673–1683. [Google Scholar]
  • 29.Mroz T.A., The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions, Econometrica 55 (1987), pp. 765–799. [Google Scholar]
  • 30.Ogundimu E.O. and Hutton J.L., A sample selection model with skew-normal distribution, Scand. J. Stat. 43 (2016), pp. 172–190. [Google Scholar]
  • 31.Pan J., Fei Y., and Foster P., Case-deletion diagnostics for linear mixed models, Technometrics 56 (2014), pp. 269–281. [Google Scholar]
  • 32.Poon W.Y. and Poon Y.S., Conformal normal curvature and assessment of local influence, J. R. Stat. Soc. Ser. B 61 (1999), pp. 51–61. [Google Scholar]
  • 33.Saulo H., Vila R., Cordeiro S.S., and Leiva V., Bivariate symmetric heckman models and their characterization, J. Multivar. Anal. 193 (2023), p. 105097. [Google Scholar]
  • 34.Vaida F. and Liu L., Fast implementation for normal mixed effects models with censored response, J. Comput. Graph. Stat. 18 (2009), pp. 797–817. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Valeriano K.A., Galarza C.E., Matos L.A., and Lachos V.H., Likelihood-based inference for the multivariate skew-t regression with censored or missing responses, J. Multivar. Anal. 196 (2023), p. 105174. [Google Scholar]
  • 36.Zhao J., Kim H.-J., and Kim H.-M., New EM-type algorithms for the Heckman selection model, Comput. Stat. Data. Anal. 146 (2020), p. 106930. [Google Scholar]
  • 37.Zhu H., Ibrahim J.G., and Shi X., Diagnostic measures for generalized linear models with missing covariates, Scand. J. Stat. 36 (2009), pp. 686–712. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Zhu H. and Lee S., Local influence for incomplete-data models, J. R. Stat. Soc. Ser. B 63 (2001), pp. 111–126. [Google Scholar]
  • 39.Zhu H., Lee S., Wei B., and Zhou J., Case-deletion measures for models with incomplete data, Biometrika 88 (2001), pp. 727–737. [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES