Abstract
This study presents diagnostic techniques for Heckman selection models estimated using the EM algorithm. The focus is on the selection t and normal models, based on the bivariate Student's-t and bivariate normal distributions, respectively. The Heckman selection model is a key econometric tool for estimating relationships while addressing selection bias. Relying on the EM-type algorithm, we develop global and local influence analyses based on the conditional expectation of the complete-data log-likelihood function, exploring four perturbation schemes for local influence analysis. To assess the effectiveness of the proposed diagnostic measures in identifying influential observations, we conducted a simulation study, complemented by two real-data applications that demonstrate how these techniques can effectively identify influential points. The proposed algorithms and methodologies are incorporated into the R package HeckmanEM.
Keywords: Case-deletion, Heckman selection model, local influence, model perturbation, multivariate Student's-t
1. Introduction
The Heckman selection model, introduced by Heckman [14], is a widely used method in econometrics and statistics to correct for selection bias in datasets where the outcomes are observed only for a non-random subset of the population. The model originally assumes bivariate normality (SLn), simplifying its mathematical formulation. However, real-world data often deviate from this assumption, exhibiting skewness or heavy tails, which can lead to biased estimates if not properly addressed. To address these limitations, Marchenko and Genton [24] extended the model to incorporate a bivariate Student's-t error distribution (SLt), offering greater flexibility for modeling heavier tails with the addition of the degrees of freedom parameter. This extension enhances resistance to outliers, a common issue in empirical datasets.
Numerous studies have advanced the understanding of Heckman selection models and their variations. For example, Lee [20] provided a generalized framework for selection models, while Bastos and Barreto-Souza [3] introduced a sample selection model based on the bivariate Birnbaum–Saunders distribution. Subsequently, Bastos et al. [4] generalized the Heckman model by allowing selection bias and dispersion parameters to depend on covariates. Lee [21] developed nonparametric methods for estimating treatment effects under selection bias, and more recently, Saulo et al. [33] extended the model to a broad class of symmetric distributions. Despite these advancements and the relevance of the area, there remains a significant gap in diagnostic methodologies specifically tailored to Heckman selection models, particularly regarding resistance and influence analysis.
Recognizing the need for reliable estimation methods under the SLn [36] and SLt model [19] proposed an EM-type algorithm that improves the computation of maximum likelihood estimates by involving the first two moments of a truncated respective multivariate distribution in the E-step. This estimation methodology served as the foundation for the development of the diagnostic methods for Heckman models presented in this work.
The importance of influence diagnostics in the Heckman selection models lies in its sensitivity to influential observations. Identifying the impact of specific data points is crucial, as even a few influential cases can disproportionately affect parameter estimates and predictions, leading to potentially misleading conclusions.
This study seeks to fill the existing gap by adapting diagnostic procedures to Heckman selection models and offering practical tools to enhance the validity of analyses in real-world applications. Although diagnostic methodologies for regression models have been extensively studied, influence diagnostics for Heckman selection models, particularly in local influence analysis, remain unexplored. The observed log-likelihood function of the SLn model includes intractable integrals, complicating the application of Cook's approach (see [7]). To address these challenges, Zhu and Lee [38] introduced local influence analysis using the Q-displacement function, aligning with the E-step of the EM algorithm; additionally, Zhu et al. [39] explored case-deletion methods for identifying influential observations.
Building on the work of Zhu and Lee [38], this study adapts their diagnostic procedures for both the SLn and SLt Heckman models and demonstrates their application with real-world data. Our framework identifies influential observations and assesses their impact on the stability and reliability of model estimates. These diagnostic tools enable practitioners to refine their models, ensuring that decisions are not unduly influenced by anomalous data points. The proposed methods are implemented in the R package HeckmanEM, providing researchers and analysts with accessible solutions for applying Heckman selection models.
The paper is structured as follows. Section 2 introduces the multivariate Student's-t distribution, its truncated version, and the extended multivariate skew-t and extended skew-normal distributions. Section 3 discusses the SLn and SLt models and the EM-type algorithm for maximum likelihood estimates. Section 4 derives diagnostic measures for global and local influence, considering four perturbation schemes. The simulation study and two real-data applications are presented in Sections 5 and 6, respectively, while Section 7 provides final remarks and conclusions. Proofs for the derived quantities are in the Appendix.
2. Background
2.1. The multivariate Student's-t distribution and its truncated version
A p-dimensional random variable following a multivariate Student's-t (MVT) distribution with location vector , positive-definite scale-covariance matrix , and degrees of freedom ν is denoted by , with its probability density function (pdf) denoted by . For and , the cumulative distribution function (cdf) is represented by:
Special cases include for , and and when and . When p = 1, the subscript p is omitted. As ν approaches infinity, converges in distribution to a multivariate normal (MVN) distribution . A key property of is its representation as a scale mixture of an MVN random vector and a positive random variable:
| (1) |
where and is independent of , with being a gamma distribution with mean .
Considering the Borel set within :
| (2) |
a p-dimensional random vector following a doubly truncated Student's-t (TMVT) distribution within the truncation region , denoted by , has the pdf:
The cdf of within is:
In the next subsection, we will introduce the multivariate extended skew-t and skew-normal distributions, which are pivotal for improving computational efficiency in moment calculations within the EM algorithm for Heckman selection models.
2.2. The multivariate extended skew-t distribution
The multivariate extended skew-t (EST) distribution for a p-dimensional random vector with location vector , positive-definite dispersion matrix , skewness vector , , and degrees of freedom ν is denoted as . The pdf is:
| (3) |
From Valeriano et al. [35], the mean vector and variance-covariance matrix of are given by:
| (4) |
where , , and with . When , the distribution simplifies to the skew-t distribution as described by Lachos et al. [18]. In the limits and , it converges to the Student's-t and multivariate extended skew-normal (ESN) distributions, respectively. The ESN pdf is given by:
| (5) |
denoted as . From Galarza et al. [11], the mean vector and variance-covariance matrix of an ESN random vector are:
| (6) |
where . Here, and represent the pdf and cdf of , respectively, with the subscript p omitted for p = 1. Refer to Galarza et al. [10] and Galarza et al. [11] for more details on the EST and ESN distribution properties.
The moments from Equation (4) are essential for the E-step in the EM algorithm of the Heckman selection-t model. Despite their asymmetry, the EST and ESN distributions naturally arise in selection models, belonging to a broader class known as the multivariate selection elliptical family (see, [1]).
3. The Heckman selection model
Sample selection bias and missing data often pose significant challenges in research. The SL model tackles these issues using two equations: a linear equation for the dependent variable and a Probit equation for the sample selection process. The linear equation describes the relationship between the independent and dependent variables, while the Probit equation estimates the probability of a sample being selected. The outcome equation is:
| (7) |
and the sample selection mechanism is described by the latent linear equation:
| (8) |
for . Here, and are unknown parameters, and are known characteristics. The covariates in and can overlap, and the exclusion restriction is met when at least one element of is not in . The sample selection indicator is . We observe the outcome only if , which means if , and (missing data) if . Therefore, the observed data for the ith subject is , where represents the vector of censored readings and is the censoring indicator.
3.1. The classical Heckman selection model
Heckman [15] assumes that the error terms are independently distributed according to a bivariate normal distribution:
| (9) |
where the second diagonal element equals one due to the probit link associated with the latent variable , ensuring model identifiability. The model defined in (7)–(9) is referred to as the Heckman selection (SLn) model, with parameter vector . When the selection effect is absent ( ), it indicates that the unobserved outcomes are missing at random.
Using Bayes' rule, the conditional pdf of an observation is (see, [24]):
| (10) |
which belongs to the ESN family, as discussed in Subsection 2.2, i.e.
From (6), we have and , thus the mean equation for the observed outcomes is:
| (11) |
where is the inverse Mills ratio. The SLn problem can be viewed as a model misspecification case, combining a linear component, , with a nonlinear correction term, . Heckman [15] proposed a two-step procedure to address this, which is less efficient than ML estimation but remains robust even if the error terms are not jointly normal. In the two-step procedure, the standard probit model provides the estimate ; then is an additional covariate in (11), and the least squares coefficient of estimates . This can be implemented in R using the sampleSelection library [16]. Alternatively, the ML estimate of can be computed by maximizing the likelihood function given the observed data :
| (12) |
or via the ECM algorithm discussed in Zhao et al. [36] and Lachos et al. [19], implemented in the R package HeckmanEM [19].
3.2. The Heckman selection-t model
Marchenko and Genton [24] introduced the selection-t (SLt) model to handle heavy-tailed distributions. This model assumes that the error terms in Equations (7)-(8) follow a bivariate Student's-t distribution with unknown degrees of freedom, ν:
| (13) |
The model, defined by (7)–(8) and (13), is known as the Heckman selection-t (SLt) model, with parameter vector . According to Miao et al. [28] (Theorem 5), all SLt model parameters are identifiable from the observed data .
The conditional pdf of an observed outcome (see, [24]) is given by:
| (14) |
that is,
Using Equation (4), the conditional expectation for the observed outcome simplifies to
| (15) |
by identifying the terms , , , and defining as a specialized function. Like the SLn model, the traditional OLS regression produces inconsistent results when . Marchenko and Genton [24, Figure 1] found that for negative values of the selection linear predictor , the conditional expectation (15) is typically underestimated in the SLn model when the degrees of freedom ν are moderate. However, this bias diminishes as the degrees of freedom increase.
The likelihood function of for the SLt model, given the observed sample , is expressed as:
| (16) |
As seen, it is similar to the SLn case due to the standard probit model . There are no closed-form expressions for the ML estimates of the parameters in (16); thus, the ML estimates are obtained numerically or via the ECM algorithm implemented in the R package HeckmanEM. We briefly outline the EM-type algorithm from Lachos et al. [19], where all parameters are updated (M-step) by treating both the outcome and sample selection ( ) as missing data [26,34]. Further technical details are in Lachos et al. [19].
Disregarding censoring momentarily, consider observations for n independent individuals:
Here, represents the vector of independent responses for sample unit i, with
and the dispersion matrix depends on an unknown parameter vector , as defined in (9). Using representation (1) and temporarily disregarding censoring, the distribution of can be hierarchically expressed as follows:
| (17) |
Consider , , , , where we observe for the ith subject. In the estimation process, and are considered as hypothetical missing data and augmented with the observed dataset to form . Therefore, the EM-type algorithm is applied to the complete-data log-likelihood function:
where
Here, c represents a constant that does not depend on , and denotes the pdf of the distribution. The EM algorithm for the SLt model can be outlined in the following two steps:
-
E-step:with the current estimation at the kth stage of the algorithm, the E-step offers the conditional expectation of the complete-data log-likelihood function
where(18)
with , , and . Consider that calculating directly poses analytical challenges. To circumvent this, they opt for the constrained ML step (CML-step) to update ν instead. In the CML-step, the actual log-likelihood function is maximized under specific constraints, rather than the Q-function. Parameter transformations and are utilized to derive closed-form expressions in the M-Step.(19) -
M-step:
the conditional maximization of is performed regarding , yielding updated estimates , , . The closed-form expressions for these estimates, along with a suitable approach for computing ML estimate standard errors for the SLt model and residual analysis, are detailed in Lachos et al. [19]. These methodologies are implemented in the R package HeckmanEM.
4. Influence diagnostics
Techniques for diagnosing influence focus on determining how sensitive a model's parameter estimates are to changes in the dataset or the model's underlying assumptions. Two main approaches are employed to identify influential data points. The first approach, known as the case-deletion technique, assesses the impact of excluding a particular observation by comparing the parameter estimates before and after its removal. This involves fitting one or more models without the observation and evaluating the differences using metrics like likelihood or Cook's distance [6]. The second approach, termed the local influence method, explores the effect of making minor adjustments to an observation on the analysis results, instead of completely omitting it Cook [7]. Building on the work of Zhu and Lee [39], we introduce case-deletion measures and local influence measures for the Heckman selection-t model, utilizing the Q-function derived during the E-step of the EM algorithm; see Zhu et al. [38]. We start by discussing the case-deletion measures, then move on to the local influence measures, and finally describe the perturbation schemes used.
4.1. Case-deletion measures
The case-deletion approach is frequently employed to examine the impact of removing the ith observation from a dataset. In this context, any quantity with the subscript ‘ ’ represents the original quantity with the ith observation excluded. For example, refers to the complete data with the ith observation removed. Let be the maximizer of the function , where represents the ML estimates of . To assess the effect of the ith observation on , we compare the difference between and . If the removal of an observation significantly alters the estimates, it indicates that the observation is influential. In other words, if markedly differs from , the ith observation may be considered influential. Since calculating for every observation can be computationally intensive, the following one-step approximation is used to alleviate the computational load (see [8,39]):
| (20) |
where
| (21) |
represent the gradient vector and the Hessian matrix evaluated at , respectively. Specifically, the Hessian matrix plays a critical role in the method developed by Zhu et al. [39] (see also [37]) for computing case-deletion diagnostic measures and for assessing the local influence of a given perturbation scheme. These formulas can be derived straightforwardly from Equation (18). The elements of the gradient vector are given by:
where
| (22) |
The elements of the Hessian matrix , where is the parameter vector, are given by:
where , , and are defined as in (22), , and . We obtain the Hessian matrix by evaluating these second-order derivatives at .
To assess influential observations, we can develop case-deletion measures such as the generalized Cook's distance and the likelihood distance. Using the distance metric proposed by Zhu et al. [39] to quantify the difference between and , we define the generalized Cook's distance as follows:
| (23) |
By substituting (20) into (23), we derive the following approximation of the generalized Cook's distance:
4.2. Local influence
In this subsection, we calculate the normal curvature of local influence, following the method described by Cook [7], for several standard perturbation schemes applied to the model or data. Specifically, we will investigate case-weight perturbation, scale matrix perturbation, explanatory variable perturbation, and response perturbation. By examining these methods, we aim to accurately assess the sensitivity and resistance of our model, enabling us to make any necessary adjustments to enhance its reliability.
Let be the perturbation vector varying within an open region . Consider that denote the complete-data log-likelihood function of the perturbed model. We assume there exists such that for all . Let us define
The influence graph is defined as , with representing the Q-displacement function, given by:
Building on the methodology of Cook [7] and Zhu and Lee [38], the normal curvature of at in the direction of a unit vector can be employed to analyze the local behavior of the Q-displacement function. We define
Next, it can be demonstrated that
with as defined in (21).
Adopting the approach described by Cook [7], the symmetric matrix offers crucial insights for detecting influential observations. We begin by applying the spectral decomposition:
where are eigenvalue-eigenvector pairs of with , and orthonormal eigenvectors . Zhu and Lee [38] proposed examining all eigenvectors corresponding to nonzero eigenvalues to extract additional insights, employing the following approach:
Let represent the lth component of , where the assessment of influential cases involves visually inspecting for , plotted against the index l, and considering a case influential if exceeds a specified benchmark.
Using normal curvature to evaluate observation influence can be problematic due to the variability of , which is not invariant under uniform scale changes. To address this issue, Zhu and Lee [38] introduced the concept of conformal normal curvature, inspired by Poon and Poon [32], defined as:
which is straightforward to compute and satisfies . Let be a basic perturbation vector where the lth entry is 1 and all other entries are 0. Zhu and Lee [38] demonstrated that for all l, allowing us to derive from .
At present, there is no standard guideline for determining the influence magnitude of a given case. Let and represent the mean and standard error of , with . Since the vectors are orthonormal, it is straightforward to establish that . Poon and Poon [32] suggested using as a reference for , though alternative functions can be used. For example, Zhu and Lee [38] proposed to account for the variance of . According to Lee and Xu [22], choosing a benchmark function based on is subjective; they recommended , where is a constant adaptable to specific applications. In this study, we use , a choice supported by Massuia et al. [25], who found it effective in empirical research.
4.3. Perturbation schemes
This section examines the matrix across various perturbation strategies within the Heckman selection-t model. Case-weight perturbation is employed to identify observations that notably impact the log-likelihood function, potentially exerting significant influence on maximum likelihood estimates. Scale perturbation involves adjustments to the scale matrix , highlighting individuals whose likelihood displacement within the scale structure is most pronounced. Response perturbation focuses on varying response values to identify observations that strongly affect their predicted outcomes. Explanatory variable perturbation helps pinpoint the values of continuous explanatory variables that are highly sensitive, as indicated by changes in the log-likelihood. Each perturbation scheme is considered in its partitioned form:
where
with , and . Analytical expressions are provided in the following four propositions, with proofs available in the Appendix section.
4.3.1. Case-weight perturbation
We investigate assigning arbitrary weights to the expected value of the complete-data log-likelihood function (perturbed Q-function), allowing us to account for deviations in various directions through
| (24) |
Here, is an vector, and . Note that if and for , the ith observation is excluded from the complete-data log-likelihood function.
Proposition 4.1
Under the case-weight perturbation scheme defined in (24), the elements of the matrix are given by
where , , and are defined as in (22).
Proof.
See the Appendix.
4.3.2. Scale perturbation
To assess deviations from the assumption about the scale matrix , we examine the perturbation given by
| (25) |
In this perturbation scheme, the original model corresponds to . Furthermore, the perturbed Q-function, as in (18), replaces with .
Proposition 4.2
Under the scale perturbation scheme defined in (25), the elements of the matrix are given by:
where , , and are defined as in (22).
Proof.
See the Appendix.
4.3.3. Response perturbation
To introduce a perturbation of the response variables , we replace with
| (26) |
where is a vector of ones. The perturbed Q-function follows (18), replacing with . Here, the vector signifies no perturbation.
Proposition 4.3
Under the response perturbation scheme defined in (26), the elements of the matrix are as follows:
where , and and are defined as in (22).
Proof.
See the Appendix.
4.3.4. Explanatory variable perturbation
There are three potential methods for perturbing a specific continuous explanatory variable: as a covariate in the primary regression (outcome model), as a covariate in the selection equation (selection model), or simultaneously in both equations. Here, we will focus on the first scenario, while the other two can be addressed similarly.
In this scenario, we aim to perturb a specific continuous explanatory variable in the primary regression. Under this condition, the perturbed explanatory matrix is given by:
| (27) |
where , is a vector with 1 in the uth column, . This approach addresses situations where the continuous covariate is measured with error. The perturbed Q-function follows (18), with replacing . The unperturbed case is achieved by setting .
Proposition 4.4
Under the explanatory variable perturbation scheme defined in (27), has the following elements:
where and is a matrix of dimension obtained by
with 1 in the first row of the uth column, . and are defined as in (22).
Proof.
See the Appendix.
Note that while it is not feasible to cover all pertinent perturbation schemes in detail, the key lies in finding an appropriate . As long as the perturbed complete-data log-likelihood function remains sufficiently smooth, ensuring that all necessary derivatives for diagnostic measures are well-defined, conducting local influence analysis becomes feasible without significant complications. Finally, it is important to note that our approach to developing diagnostic techniques is based on . In the context of linear mixed models, Pan and Foster [31] suggest using instead of to detect potentially influential observations. They demonstrate that is block-diagonal, which facilitates the interpretation of diagnostic measures in this setting.
The next section provides a brief simulation study to evaluate the effectiveness of the proposed diagnostic measures in identifying outliers.
5. Simulation studies
To evaluate the effectiveness of the proposed diagnostic measures, we conducted a simulation study focusing on the SLn and SLt models. The Monte Carlo simulations were designed to assess the ability of the global diagnostic measure (GD), derived via the case-deletion technique (see Section 4.1), to detect influential points in the response variable. Although analogous analyses can be extended to local influence measures, this simulation study concentrates on the global approach for simplicity.
The simulations were based on the classical Heckman selection model, as defined in Equations (7) and (8), with a sample size of . The regression coefficients were set to , and the selection parameters to . Covariates were specified as and , with and . The error terms followed a bivariate normal distribution, as outlined in Equation (9), and were assumed to be independent.
Data generation adhered to the Heckman model framework, implemented using the rHeckman function from the HeckmanEM package in R. We considered three levels of censoring, 10%, 20%, and 40% to examine whether the degree of censoring affects the detection of influential points.
To identify these points, we perturbed the minimum and maximum values within each sample by k standard deviations, with k ranging from 0 to 3. Specifically, the minimum value was reduced, and the maximum value increased by k standard deviations. Even without perturbation, these extreme points were treated as potential influential candidates due to their positions in the distribution tails. Each sample was subsequently fitted to the SLn and SLt models, and the GD measure was computed. For the SLt model two fitting strategies were performed: 1) we fixed the degrees of freedom (SLt with fixed ν) in the estimated value when no synthetic influential observation was added; 2) the degree of freedom was estimated for the datasets (SLt with adjusted ν) with the inclusion of the outliers. A point was classified as influential if both the minimum and maximum values exceeded the GD benchmark threshold. This process was repeated across 500 Monte Carlo replicates.
The results, summarized in Table 1, present the percentage of detected influential points for both the SLn and SLt models, alongside the mean and standard deviation (SD) of the influence measure, compared to the benchmark value. As expected, for the SLn model, the percentage of influential points increased with higher values of k, regardless of the censoring level, showcasing the diagnosis capacity to detect the influential observations as they become more severe. Meanwhile, the SLt fit presents two different behaviors. When the ν parameter is fixed, the SLt loses its capacity to adapt to outliers. Therefore, the diagnosis detects the shifted observations as influential. However, when the degree of freedom is re-estimated in the model fit, the model absorbs the outliers making the tails fatter and not exceeding the threshold of the diagnosis tool. This helps us understand why the SLt model has significant resistance compared to the SLn.
Table 1.
Influence diagnostic analysis: case-deletion in SLn and SLt models under different censoring types and threshold variations ( ).
| SLn | SLt with fixed ν | SLt with adjusted ν | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| k | k | k | ||||||||||
| Statistic | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 |
| 10% Censoring | ||||||||||||
| % Influential1 | 2.0 | 46.4 | 95.4 | 99.6 | 0.8 | 27.7 | 81.8 | 93.0 | 0.0 | 0.2 | 0.2 | 0.0 |
| Mean measure | 0.020 | 0.025 | 0.032 | 0.044 | 0.021 | 0.022 | 0.028 | 0.035 | 0.020 | 0.019 | 0.019 | 0.018 |
| SD2measure | 0.029 | 0.055 | 0.117 | 0.203 | 0.025 | 0.039 | 0.079 | 0.134 | 0.025 | 0.022 | 0.019 | 0.017 |
| Benchmark | 0.140 | 0.160 | 0.160 | |||||||||
| 20% Censoring | ||||||||||||
| % Influential | 1.8 | 44.8 | 91.2 | 99.8 | 0.2 | 28.1 | 81.6 | 93.5 | 0.0 | 0.2 | 0.2 | 0.0 |
| Mean measure | 0.021 | 0.024 | 0.033 | 0.043 | 0.021 | 0.023 | 0.029 | 0.036 | 0.020 | 0.019 | 0.019 | 0.018 |
| SD measure | 0.027 | 0.051 | 0.110 | 0.190 | 0.023 | 0.038 | 0.079 | 0.134 | 0.025 | 0.022 | 0.019 | 0.017 |
| Benchmark | 0.140 | 0.160 | 0.160 | |||||||||
| 40% Censoring | ||||||||||||
| % Influential | 1.2 | 31.6 | 83.0 | 97.6 | 0.0 | 18.6 | 69.3 | 88.8 | 0.0 | 0.2 | 0.0 | 0.0 |
| Mean measure | 0.020 | 0.023 | 0.029 | 0.036 | 0.020 | 0.021 | 0.026 | 0.031 | 0.020 | 0.020 | 0.019 | 0.019 |
| SD measure | 0.025 | 0.044 | 0.089 | 0.147 | 0.021 | 0.034 | 0.067 | 0.109 | 0.022 | 0.020 | 0.018 | 0.016 |
| Benchmark | 0.140 | 0.160 | 0.160 | |||||||||
1% Influential: represents the percentage of Monte Carlo replicates in which both the minimum and maximum observations were jointly detected as influential (exceeded the benchmark value).
2SD: Standard deviation.
Moreover, in the SLt with adjusted ν model, the mean and standard deviation of the GD measure remained stable or slightly decreased as k increased, indicating resistance. This behavior contrasts with the SLn and SLt with fixed ν models, where both the mean and standard deviation of GD increased with k. Additionally, higher censoring levels led to a slight decrease in the detection of influential points for the SLn and SLt with fixed ν models, whereas the SLt with adjusted ν model remained unaffected, emphasizing its resilience to outliers regardless of the degree of censoring when the degree of freedom is estimated in the process.
Figure 1 allows us to visualize, for one of the Monte Carlo iterations with 10% censoring, the effect of the GD measure with varying k. Clearly, for the SLn and the SLt with fixed ν models, the influential observation became more extreme as k increased, being more severe for the SLn model. However, the GD measure maintained stable for the SLt with adjusted ν. In this case, there was an inverse association between ν and k, that is, ν reduced as k increased. This effect happened to absorb the outliers, showcasing the resistance of the SLt model.
Figure 1.
Approximate generalized Cook's distance ( ) for the SLn (left), SLt with fixed ν (center), and SLt with adjusted ν (right) models in a single Monte Carlo replication, considering minimum and maximum perturbations of k standard deviations (k = 0, 1, 2, 3) with 10% censoring.
This clearly demonstrates that the proposed diagnostic tools are effective in accurately identifying whether an observation is influential, depending on the fitted distribution. It is well established that heavy-tailed models are robust to outliers, and the diagnostic measures effectively capture this robustness.
The next section presents two real-world applications that further illustrate the efficacy of the proposed methodology.
6. Applications
6.1. Ambulatory expenditures
To illustrate the methodologies discussed, we applied them to analyze ambulatory expenditures data originally from Cameron and Trivedi [5] and later re-examined by Marchenko and Genton [24] using ML estimation in Stata, by Ding [9] using Bayesian methods, and by Lachos et al. [19] utilizing an efficient EM-type algorithm.
For our analysis, we selected the same covariates as used by Marchenko and Genton [24], Ding [9], and Lachos et al. [19]. Specifically, we focused on log expenditures (lambexp) as the outcome variable. The covariates in the outcome equation included age, blhisp, educ, female, ins, totchr), representing age, ethnicity, education status, gender, insurance status, and number of chronic diseases, respectively. The income variable was included in the selection equation, , income), to ensure the exclusion restriction. The dataset had 526 missing values out of a total of 3328 observations.
Using the HeckmanEM package in R, we fitted both SLn and SLt models. Covariates educ and ins were found non-significant in both models, leading to their exclusion upon model readjustment to age, blhisp, female, totchr) with , income). The estimation results are detailed in Table 2. Notably, all covariates proved significant in both the outcome and selection models for SLn and SLt. As noted by Marchenko and Genton [24], Ding [9], and Lachos et al. [19], the SLn model's 95% confidence interval of ρ contained zero , indicating weak evidence of selection bias. In contrast, the SLt model exhibited a 95% confidence interval of ρ of , suggesting a significant selection bias effect.
Table 2.
Ambulatory expenditures data: ML estimates, standard errors, and information criteria.
| SLn | SLt | |||||||
|---|---|---|---|---|---|---|---|---|
| Parameter | Estimate | Std. error | z | p | Estimate | Std. error | z | p |
| Outcome model | ||||||||
| intercept | 5.317 | 0.173 | 30.745 | 0.000 | 5.471 | 0.135 | 40.374 | 0.000 |
| age | 0.209 | 0.024 | 8.653 | 0.000 | 0.202 | 0.023 | 8.738 | 0.000 |
| blhisp | −0.233 | 0.068 | −3.411 | 0.001 | −0.200 | 0.059 | −3.367 | 0.001 |
| female | 0.342 | 0.071 | 4.831 | 0.000 | 0.295 | 0.059 | 4.973 | 0.000 |
| totchr | 0.534 | 0.051 | 10.413 | 0.000 | 0.505 | 0.041 | 12.438 | 0.000 |
| Selection model | ||||||||
| intercept | 0.126 | 0.115 | 1.088 | 0.277 | 0.091 | 0.124 | 0.735 | 0.462 |
| age | 0.088 | 0.026 | 3.344 | 0.001 | 0.098 | 0.029 | 3.427 | 0.001 |
| blhisp | −0.435 | 0.060 | −7.189 | 0.000 | −0.465 | 0.064 | −7.235 | 0.000 |
| female | 0.687 | 0.060 | 11.404 | 0.000 | 0.748 | 0.066 | 11.308 | 0.000 |
| totchr | 0.780 | 0.068 | 11.457 | 0.000 | 0.869 | 0.083 | 10.437 | 0.000 |
| income | 0.005 | 0.001 | 4.512 | 0.000 | 0.006 | 0.001 | 4.492 | 0.000 |
| σ | 1.274 | 0.021 | 1.203 | 0.025 | ||||
| ρ | −0.157 | 0.207 | −0.367 | 0.128 | ||||
| ν | 13.089 | |||||||
| AIC | 11713.780 | 11686.060 | ||||||
| BIC | 11719.890 | 11692.170 | ||||||
Lachos et al. [19] performed residuals analysis and concluded the SLt model provided a better fit than the SLn model. Subsequently, we investigated the dataset for influential observations using the case-deletion approach ( ), from conformal curvature , and perturbation schemes outlined in Section 4.3.
Figure 2 presents the approximate generalized Cook's distance for the SLn (left panel) and SLt (right panel) fitted models. Higher values indicate greater impact of the ith observation on ML parameter estimates. Adapting the suggestion of Barros et al. [2], we used as a benchmark for , where denotes the number of estimated model parameters. To enhance clarity, we highlighted observations with high values. Notably, fewer observations exceeded the benchmark in the SLt model (18 points, 0.54% of data) compared to the SLn model (55 points, 1.65% of data), underscoring the resistance of the heavy-tailed model, as expected. This highlights the efficacy of our diagnostic approach in identifying influential observations. Heavy-tail distributions are recognized for their resilience to influential observations (see, e.g.[12,25,27]). Figure 2 confirms the resistant behavior of the Student-t distribution in assessing . Additionally, it is noteworthy that all influential observations identified in the SLt model also appeared in the SLn model, though with lower values.
Figure 2.
Ambulatory expenditures data: approximate generalized Cook's distance ( ) for the SLn (left) and SLt (right) fitted models.
Next, we conducted a local influence study based on , guided by Sections 4.2 and 4.3. Here, we used the criterion , for , to identify influential observations.
Figure 3 presents SLn and SLt model results under case-weight perturbation, scale matrix perturbation, and response perturbation schemes. To enhance clarity, we highlighted points with high values. Analysis of case-weight and scale matrix perturbations revealed a greater number of influential points detected in the SLn model compared to the SLt model, consistent with observations in Figure 2. Conversely, response perturbation yielded similar findings of influential observations across both models.
Figure 3.
Ambulatory expenditures data: index plot of under case-weight, scale matrix, and response perturbations for SLn (left) and SLt (right) fitted models. Horizontal lines indicate the Lee and Xu [22] benchmark for with .
Figure 4 illustrates the explanatory variable perturbation for the two continuous covariates included in the primary regression. As anticipated, the SLn model identifies several influential points when perturbing the age and totchr covariates. In contrast, the SLt model designates only a few observations as influential points when perturbing the totchr covariate (number of chronic diseases). Regarding the age covariate, the SLt model effectively accommodates the observations, with no additional influential points identified.
Figure 4.
Ambulatory expenditures data: index plot of under explanatory variable perturbation for SLn (left) and SLt (right) fitted models. Horizontal lines indicate the Lee and Xu [22] benchmark for with .
To further assess the effectiveness of the proposed diagnostic measures, we refitted the SLn and SLt models by excluding specific data points. Based on the points identified as potentially influential by our proposal, we implemented the following strategy: for the SLn fit, we initially excluded 55 non-influential observations (with the lowest values of ), and alternatively, for comparison purposes, we removed all 55 observations with values above the benchmark. Similarly, for the SLt fit, we excluded 18 non-influential observations (with the lowest values of ) and then eliminated all 18 observations with values above the benchmark. Table 3 displays the relative percentage changes (RC) in these estimates, calculated as
| (28) |
where η = and denotes the ML estimates of after the set has been removed. As expected, when we remove points with low values of (considered non-influential), we observe in Table 3 that the relative percentage changes are very small, indicating that their removal does not significantly impact the ML estimates for both models. Conversely, when we exclude the points identified by the measure as influential, for both models, we observe a substantial percentage of relative changes (many exceeding 10%) in the ML estimates. Therefore, our proposal correctly discriminates the influential points from the non-influential.
Table 3.
Ambulatory expenditures data: relative changes (in %) of ML estimates.
| SLn | SLt | |||
|---|---|---|---|---|
| Dropping 55 points | Dropping 55 points | Dropping 18 points | Dropping 18 points | |
| Parameter | without influence† | with influence‡ | without influence† | with influence‡ |
| Outcome model | ||||
| intercept | 0.12 | 0.68 | 0.08 | 0.17 |
| age | 0.06 | 3.44 | 0.35 | 1.47 |
| blhisp | 0.44 | 6.00 | 0.26 | 5.48 |
| female | 0.97 | 4.66 | 0.42 | 4.04 |
| totchr | 0.03 | 4.10 | 0.18 | 2.40 |
| Selection model | ||||
| intercept | 2.27 | 120.75 | 4.16 | 78.05 |
| age | 1.96 | 18.12 | 1.38 | 3.72 |
| blhisp | 0.96 | 7.75 | 0.47 | 3.06 |
| female | 0.19 | 13.14 | 0.71 | 1.23 |
| totchr | 0.69 | 22.55 | 0.61 | 6.11 |
| income | 3.78 | 51.65 | 0.55 | 45.02 |
| σ | 0.90 | 3.15 | 0.62 | 0.61 |
| ρ | 6.35 | 8.79 | 0.70 | 8.71 |
| ν | 3.80 | 0.33 | ||
†Lower values. ‡ values above the benchmark.
6.2. Mroz: labor supply data
In this second application, our focus is on analyzing missing econometric data through a reexamination of the dataset originally introduced by Mroz [29]. Our goal is to estimate the wage offer function for married women using diagnostic tools proposed in the methodology. The dataset, referred to as the ‘Mroz data’, consists of observations on 753 married white women across 21 variables. This dataset is available in the R package AER [17], accessible via the command data("PSID1976"). Notably, the variable of interest, female wage, is missing for 325 (43%) of the 753 women in the sample. To illustrate diagnostic techniques, we adopt the same set of covariates used by Ogundimu and Hutton [30]. Specifically, the logarithm of wage depends on education status and city, represented as . The selection equation incorporates husband's wage, number of children aged 5 years or younger, marginal tax rate of the wife, and the wife's father's educational attainment, alongside educational and city variables. Thus, .
We fitted both SLn and SLt models using the HeckmanEM package in R. Table 4 summarizes the parameter estimates and their corresponding p-values. Notably, while the statistical significance of covariates in both models is similar, the SLt model yielded a small estimated value of . This suggests that the SLn model is inadequate for the Mroz data. Moreover, the estimate of σ decreased from 0.8 in the SLn model to 0.5 in the SLt model. Both models indicated a high value of ρ close to −1, indicating non-random sample selection.
Table 4.
Mroz data: ML estimates, standard errors, and information criteria.
| SLn | SLt | |||||||
|---|---|---|---|---|---|---|---|---|
| Parameter | Estimate | Std. error | z | p | Estimate | Std. error | z | p |
| Outcome model | ||||||||
| intercept | 0.669 | 0.239 | 2.798 | 0.005 | 0.332 | 0.170 | 1.959 | 0.051 |
| educ | 0.066 | 0.018 | 3.559 | 0.000 | 0.087 | 0.013 | 6.719 | 0.000 |
| city | 0.107 | 0.082 | 1.306 | 0.192 | 0.094 | 0.059 | 1.602 | 0.110 |
| Selection model | ||||||||
| intercept | 3.802 | 0.764 | 4.975 | 0.000 | 5.934 | 0.953 | 6.228 | 0.000 |
| huswage | −0.103 | 0.015 | −6.812 | 0.000 | −0.153 | 0.021 | −7.387 | 0.000 |
| kids5 | −0.415 | 0.078 | −5.345 | 0.000 | −0.585 | 0.108 | −5.438 | 0.000 |
| mtr | −5.782 | 0.847 | −6.825 | 0.000 | −8.448 | 1.089 | −7.756 | 0.000 |
| fatheduc | −0.020 | 0.013 | −1.617 | 0.106 | −0.012 | 0.016 | −0.793 | 0.428 |
| educ | 0.112 | 0.024 | 4.653 | 0.000 | 0.118 | 0.029 | 4.140 | 0.000 |
| city | −0.040 | 0.107 | −0.370 | 0.712 | −0.097 | 0.123 | −0.784 | 0.433 |
| σ | 0.800 | 0.028 | 0.501 | 0.030 | ||||
| ρ | −0.780 | 0.040 | −0.733 | 0.061 | ||||
| ν | 3.001 | |||||||
| AIC | 1765.604 | 1678.728 | ||||||
| BIC | 1770.228 | 1683.352 | ||||||
Figure 5 presents the normal probability plot of residuals generated by the HeckmanEM package, highlighting a better fit for the SLt model, corroborated by lower AIC and BIC values (as seen in Table 4). Additionally, Figure 6 shows the approximate generalized Cook's distance ( ) for the Mroz data in both SLn and SLt models. Interestingly, the SLt model identified only 4 influential points (specifically 84, 176, 369, and 423), representing a 73.3% reduction from the 15 influential points identified by the SLn model.
Figure 5.
Mroz data: normal probability plot of normalized quantile residual for SLn (left) and SLt (right) models.
Figure 6.
Mroz data: approximate generalized Cook's distance ( ) for the SLn (left) and SLt (right) fitted models.
Furthermore, a local influence study based on (Sections 4.2 and 4.3) was conducted. Figure 7 displays results for both SLn (left) and SLt (right) models under various perturbations (case-weight, scale matrix, response, and explanatory variables). Regarding the latter, we exclusively present the perturbation graph for the continuous covariate educ.
Figure 7.
Mroz data: index plot of under perturbations of case-weight, scale matrix, response, and explanatory variables (covariate educ) for SLn (left) and SLt (right) fitted models. Horizontal lines mark the benchmark for by Lee and Xu [22] with .
Based on Figure 7, it is evident that several points identified in Figure 6 ( ) reappear during the case-weight, scale matrix, and explanatory variable educ perturbations, but exclusively in the SLn model fit. In contrast, these points do not exhibit prominence in the SLt model fit. As previously discussed, these outcomes align with the inherent characteristics of the SLn and SLt models. The efficacy of the proposed diagnostic techniques in detecting these discrepancies is noteworthy. Lastly, concerning the response perturbation, similar patterns among the highlighted points are observed across both model fits.
Based on the influence methods, the resistance of the SLt model to atypical observations is reinforced. In particular, these diagnostic tools enable us to quantify how much the ML estimates of are impacted by altering a single observation by ξ units. Specifically, we modify a single observation to , and then compute the relative change in estimates , where represents the original estimate and denotes the estimate with the contaminated data. In this instance, we manipulated the observation corresponding to subject 369, varying ξ from to 5 in increments of 1. Figure 8 illustrates the relative changes in the estimate , corresponding to (intercept, educ, city) of the outcome model, for different levels of ξ, under both SLn and SLt models. As anticipated, the SLt model exhibits less pronounced fluctuations in estimates when subjected to variations in ξ, compared to the SLn model.
Figure 8.
Mroz data: relative changes in ML estimates of , , and for SLn and SLt models under different contamination of ξ on subject 369. Percentage change = , where denotes the original estimate and denotes the estimate for the contaminated data.
An important point to emphasize is the significant role that diagnostic techniques play in model inference. To illustrate this, we re-estimated the SLn and SLt models using the Mroz dataset, excluding all points identified as influential in the response perturbation analysis (see Figure 7). The revised parameter estimates and their corresponding p-values are presented in Table 5. A notable change was observed in the covariate fatheduc, included in the selection equation. Initially, in the SLn model, its p-value was 0.106 (see Table 4), making it non-significant at the 5% level. However, after re-estimation, the p-value dropped to 0.04, rendering the variable statistically significant and altering the corresponding inference. In the SLt model, no significant change in the conclusions of the model were obtained when the potential influential observations were removed. This is an indication of the resistance of the SLt model. Regardless, these results underscore the importance of diagnostic tools in detecting possible influential observations and the need to study their effects on the fit of the model, which can even change the model conclusion, as observed for SLn.
Table 5.
Mroz data: ML estimates, standard errors, and information criteria after excluding influential points identified in response perturbation.
| SLn | SLt | |||||||
|---|---|---|---|---|---|---|---|---|
| Parameter | Estimate | Std. error | z | p | Estimate | Std. error | z | p |
| Outcome model | ||||||||
| intercept | 0.749 | 0.229 | 3.273 | 0.001 | 0.343 | 0.170 | 2.016 | 0.044 |
| educ | 0.062 | 0.018 | 3.484 | 0.001 | 0.086 | 0.013 | 6.640 | 0.000 |
| city | 0.095 | 0.080 | 1.190 | 0.235 | 0.095 | 0.059 | 1.604 | 0.109 |
| Selection model | ||||||||
| intercept | 3.768 | 0.743 | 5.075 | 0.000 | 5.910 | 0.953 | 6.201 | 0.000 |
| huswage | −0.099 | 0.014 | −6.993 | 0.000 | −0.148 | 0.021 | −7.136 | 0.000 |
| kids5 | −0.366 | 0.072 | −5.056 | 0.000 | −0.558 | 0.108 | −5.144 | 0.000 |
| mtr | −5.770 | 0.825 | −6.996 | 0.000 | −8.436 | 1.090 | −7.738 | 0.000 |
| fatheduc | −0.023 | 0.011 | −2.040 | 0.042 | −0.013 | 0.016 | −0.862 | 0.389 |
| educ | 0.113 | 0.024 | 4.787 | 0.000 | 0.117 | 0.028 | 4.125 | 0.000 |
| city | −0.056 | 0.105 | −0.533 | 0.594 | −0.103 | 0.123 | −0.837 | 0.403 |
| σ | 0.814 | 0.027 | 0.503 | 0.030 | ||||
| ρ | −0.851 | 0.028 | −0.740 | 0.060 | ||||
| ν | 3.001 | |||||||
| AIC | 1721.049 | 1676.327 | ||||||
| BIC | 1725.661 | 1680.941 | ||||||
In conclusion, our diagnostic methodology effectively identified influential points in the analysis of real data. Moreover, it confirmed the SLt model is superior in its resistance by significantly reducing the number of influential observations compared to the SLn model.
7. Conclusions
To the best of our knowledge, this article is the first to introduce diagnostic tools designed to identify outliers and influential observations in Heckman selection models, filling this gap in the literature. Specifically, this is done for the SLt and SLn models, which assume joint distributions of outcome and sample selection following either the Student's-t or normal distribution. Our approach utilizes the Q-function derived from the EM algorithm specific to these models. Nevertheless, the techniques employed can be extended to any selection model that relies on EM algorithm.
From the real data analysis and simulation studies, we found that our diagnostic tools effectively distinguish between influential and non-influential observations. Additionally, our findings complement the resistant likelihood-based inference methods pioneered by Lachos et al. [19] for analyzing SLt (and SLn) models, particularly suited for selection bias scenarios. Further, from the simulation study, we were not only able to show that our diagnostic tools correctly detect influential observation but, as a side effect, help the reader understand why the SLt model is resistant to outliers. Our proposed methodology has been integrated into the R package HeckmanEM, offering practitioners a user-friendly tool for applying these diagnostics in various domains. Furthermore, this package promotes the reproducibility of our research outcomes, supporting transparency and reliability in subsequent applications.
Future work encompasses the development of diagnostic tools to another type of Heckman selection models that rely on EM-type algorithms, e.g.the Heckman selection contaminated normal model (SLcn) introduced by Lim et al. [23] or to understand the relationship between SLn and SLt models with the broader family of extended skew-elliptical distributions [10,11] to build influence diagnostics from a broader perspective.
Appendix.
The following results of matrix differentiation will be used in the proofs of some propositions.
Lemma A.1
Let be an symmetric matrix, and let , , and be vectors of dimension . Then
Proof.
The proof can be found in Graham [13].
Lemma A.2
Let denote a positive definite matrix, which is therefore symmetric, and let t be a scalar. Then,
Proof.
The proof can be found in Graham [13].
Proof Proof of Proposition 4.1 —
Starting from Equation (24), we can express as a summation: Substituting as defined in (19), and omitting the superscript (k) for simplicity, we obtain:
where , and . Now, applying the results of Lemmas 1 and 2, and considering and , we have: and
Now, differentiating with respect to and evaluating at , we obtain:
Proof Proof of Proposition 4.2 —
The perturbed Q-function is as defined in (18), where is used in place of . Therefore, we have:
By leveraging Lemmas A.1 and A.2 and proceeding through analogous steps as those in the proof of Proposition 4.1, the result ensues.
Proof Proof of Proposition 4.3 —
The perturbed Q-function follows (18), where we replace with . Therefore, the perturbed Q-function is expressed as
where is updated from the proposed perturbation, specifically: . Now, applying the results of Lemmas 1 and 2, and considering the same and as in the proof of Proposition 4.1, we obtain: and , with updated . Again, differentiating with respect to and evaluating at , we obtain:
where .
Proof Proof of Proposition 4.4 —
Consider the perturbed explanatory matrix
where . Here, is a vector with 1 in the uth column, . The perturbed Q-function is defined as in (18), by replacing with . The unperturbed case corresponds to . When substituting with , is updated accordingly, following the procedure outlined in the proof of Proposition 4.3. The conclusion follows from applying Lemmas A.1 and A.2 and following the same steps as in the proofs of the preceding propositions.
Funding Statement
The research conducted by Marcos S. Oliveira was supported by Grant no. 401418/2022-7 Conselho Nacional de Desenvolvimento Científico e Tecnológico - CNPq. Marcos O. Prates acknowledges support from CNPq grant 309186/2021-8, Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG) grant APQ-01837-22, and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES). Christian E. Galarza acknowledges the support from the ESPOL Dean of Research. Victor Lachos acknowledges partial financial support from UConn - CLAS's Summer Research Funding Initiative 2023 and Research Excellence Program - UConn.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Arellano-Valle R., Branco M., and Genton M., A unified view on skewed distributions arising form selections, Can. J. Stat. 34 (2006), pp. 581–601. [Google Scholar]
- 2.Barros M., Galea M., González M., and Leiva V., Influence diagnostics in the Tobit censored response model, Stat. Methods. Appt. 19 (2010), pp. 379–397. [Google Scholar]
- 3.Bastos F.S. and Barreto-Souza W., Birnbaum–Saunders sample selection model, J. Appl. Stat. 48 (2021), pp. 1896–1916. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bastos F.S., Barreto-Souza W., and Genton M.G., A generalized heckman model with varying sample selection bias and dispersion parameters, Stat. Sin. 32 (2022), pp. 1911–1938. [Google Scholar]
- 5.Cameron A.C. and Trivedi P.K., Microeconometrics Using Stata, Vol. 5, Stata Press, College Station, TX, 2009. [Google Scholar]
- 6.Cook R.D., Detection of influential observation in linear regression, Technometrics 19 (1977), pp. 15–18. [Google Scholar]
- 7.Cook R.D., Assessment of local influence, J. R. Stat. Soc. Ser. B. 48 (1986), pp. 133–169. [Google Scholar]
- 8.Cook R.D. and Weisberg S., Residuals and Influence in Regression, Chapman & Hall/CRC, Boca Raton, FL, 1982. [Google Scholar]
- 9.Ding P., Bayesian robust inference of sample selection using selection-t models, J. Multivar. Anal. 124 (2014), pp. 451–464. [Google Scholar]
- 10.Galarza C.E., Matos L.A., Castro L.M., and Lachos V.H., Moments of the doubly truncated selection elliptical distributions with emphasis on the unified multivariate skew-t distribution, J. Multivar. Anal. 189 (2022), p. 104944. [Google Scholar]
- 11.Galarza C.E., Matos L.A., Dey D.K., and Lachos V.H., On moments of folded and doubly truncated multivariate extended skew-normal distributions, J. Comput. Graph. Stat. 31 (2022), pp. 455–465. [Google Scholar]
- 12.Garay A., Castro L., Leskow J., and Lachos V.H., Censored linear regression models for irregularly observed longitudinal data using the multivariate-t distribution, Stat. Methods. Med. Res. 26 (2014), pp. 542–566. [DOI] [PubMed] [Google Scholar]
- 13.Graham A., Kronecker Products and Matrix Calculus: With Applications, Ellis Horwood series in mathematics and its applications, Horwood, 1981. [Google Scholar]
- 14.Heckman J., Shadow prices, market wages, and labor supply, Econometrica 42 (1974), pp. 679–694. [Google Scholar]
- 15.Heckman J., Sample selection bias as a specification error, Econometrica 47 (1979), pp. 153–161. [Google Scholar]
- 16.Henningsen A., Toomet O., and Petersen S., sampleSelection: Sample selection models, R Package Version 1.2-0 https://cran.r-project.org/web/packages/sampleSelection/index.html (2019).
- 17.Kleiber C. and Zeileis A., Applied Econometrics with R, Springer-Verlag, New York, 2008. [Google Scholar]
- 18.Lachos V.H., Ghosh P., and Arellano-Valle R.B., Likelihood based inference for skew–normal independent linear mixed models, Stat. Sin. 20 (2010), pp. 303–322. [Google Scholar]
- 19.Lachos V.H., Prates M.O., and Dey D.K., Heckman selection-t model: Parameter estimation via the EM-algorithm, J. Multivar. Anal. 184 (2021), p. 104737. [Google Scholar]
- 20.Lee L.F., Generalized econometric models with selectivity, Econometrica 51 (1983), pp. 507–512. [Google Scholar]
- 21.Lee M.-j., Treatment effects in sample selection models and their nonparametric estimation, J. Econom. 167 (2012), pp. 317–329. [Google Scholar]
- 22.Lee S.Y. and Xu L., Influence analysis of nonlinear mixed-effects models, Comput. Stat. Data Anal. 45 (2004), pp. 321–341. [Google Scholar]
- 23.Lim H., Ordonez J.A., Lachos V.H., and Punzo A., Heckman selection contaminated normal model, arXiv preprint arXiv:2409.12348 (2024).
- 24.Marchenko Y.V. and Genton M.G., A Heckman selection-t model, J. Am. Stat. Assoc. 107 (2012), pp. 304–317. [Google Scholar]
- 25.Massuia M.B., Cabral C.R.B., Matos L.A., and Lachos V.H., Influence diagnostics for Student-t censored linear regression models, Statistics 49 (2015), pp. 1074–1094. [Google Scholar]
- 26.Matos L.A., Lachos V.H., Balakrishnan N., and Labra F.V., Influence diagnostics in linear and nonlinear mixed-effects models with censored data, Comput. Stat. Data Anal. 57 (2013), pp. 450–464. [Google Scholar]
- 27.Matos L.A., Prates M.O., Chen M.H., and Lachos V.H., Likelihood-based inference for mixed-effects models with censored response using the multivariate-t distribution, Stat. Sin. 23 (2013), pp. 1323–1342. [Google Scholar]
- 28.Miao W., Ding P., and Geng Z., Identifiability of normal and normal mixture models with nonignorable missing data, J. Am. Stat. Assoc. 111 (2016), pp. 1673–1683. [Google Scholar]
- 29.Mroz T.A., The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions, Econometrica 55 (1987), pp. 765–799. [Google Scholar]
- 30.Ogundimu E.O. and Hutton J.L., A sample selection model with skew-normal distribution, Scand. J. Stat. 43 (2016), pp. 172–190. [Google Scholar]
- 31.Pan J., Fei Y., and Foster P., Case-deletion diagnostics for linear mixed models, Technometrics 56 (2014), pp. 269–281. [Google Scholar]
- 32.Poon W.Y. and Poon Y.S., Conformal normal curvature and assessment of local influence, J. R. Stat. Soc. Ser. B 61 (1999), pp. 51–61. [Google Scholar]
- 33.Saulo H., Vila R., Cordeiro S.S., and Leiva V., Bivariate symmetric heckman models and their characterization, J. Multivar. Anal. 193 (2023), p. 105097. [Google Scholar]
- 34.Vaida F. and Liu L., Fast implementation for normal mixed effects models with censored response, J. Comput. Graph. Stat. 18 (2009), pp. 797–817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Valeriano K.A., Galarza C.E., Matos L.A., and Lachos V.H., Likelihood-based inference for the multivariate skew-t regression with censored or missing responses, J. Multivar. Anal. 196 (2023), p. 105174. [Google Scholar]
- 36.Zhao J., Kim H.-J., and Kim H.-M., New EM-type algorithms for the Heckman selection model, Comput. Stat. Data. Anal. 146 (2020), p. 106930. [Google Scholar]
- 37.Zhu H., Ibrahim J.G., and Shi X., Diagnostic measures for generalized linear models with missing covariates, Scand. J. Stat. 36 (2009), pp. 686–712. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Zhu H. and Lee S., Local influence for incomplete-data models, J. R. Stat. Soc. Ser. B 63 (2001), pp. 111–126. [Google Scholar]
- 39.Zhu H., Lee S., Wei B., and Zhou J., Case-deletion measures for models with incomplete data, Biometrika 88 (2001), pp. 727–737. [Google Scholar]








