Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2022 Jan 13;50(6):1255–1282. doi: 10.1080/02664763.2021.2024798

Heteroscedastic partially linear model under skew-normal distribution with application in ragweed pollen concentration

Clécio S Ferreira 1,CONTACT, Camila Borelli Zeller 1, Rafael R de Oliveira Garcia 1
PMCID: PMC10071991  PMID: 37025282

Abstract

We introduce a new class of heteroscedastic partially linear model (PLM) with skew-normal distribution. Maximum likelihood estimation of the model parameters by the ECM algorithm (Expectation/Conditional Maximization) as well as influence diagnostics for the new model are investigated. In addition, a Likelihood Ratio test for assessing the homogeneity of the scale parameter is presented. Simulation studies for assessing the performance of the ECM algorithm and the Likelihood Ratio test statistics for homogeneity of variance are developed. Also, a study for misspecification of the structure function is considered. Finally, an application of the new heteroscedastic PLM to a real data set on ragweed pollen concentration is presented to show that it provides a better fit than the classic homocedastic PLM. We hope that the proposed model may attract applications in different areas of knowledge.

Keywords: Partially linear models, skew-normal distribution, heteroscedasticity, ECM algorithm, local influence

1. Introduction

Partially linear models (PLMs) or semiparametric models have been studied by various authors (see, for instance, [16,30], and references therein). These models add a nonparametric component to the usual linear relation between the response and explanatory variables. When the data set presents heterogeneity of variance, the PLMs can be extended by incorporating, for example, a positive continuously differentiable function involving a subset of explanatory variables, named here as Heteroscedastic Partially Linear Models (HPLM). In this context, Chen and You [5] proposed to estimate the nonparametric component through kernel smoothing and constructed a semiparametric generalized least-squares estimator for the parametric component. Ma et al. [26] studied the HPLMs with an unspecified partial baseline component and a nonparametric variance function, proposing a family of consistent estimators and investigate their asymptotic properties. They showed that the optimal semiparametric efficiency bound can be reached by a semiparametric kernel estimator in this family. You et al. [35] proposed a test of heteroscedasticity, a two-step estimator of the heteroscedastic variance function, semiparametric generalized least-squares estimators of the parametric and nonparametric components of the model, and bootstrap goodness-of-fit test to see whether the nonparametric component can be parametrized. More recently, Keilegom and Wang [21] studied a general class of location-dispersion regression models, in which both the location function and the dispersion function are semiparametrically modeled. Note that these works use the normal (Gaussian) distribution to model data, including the use of the traditional method of least-squares in the models.

However, in some situations, where the data are asymmetric, the above methods may not be appropriate, particularly when the response assumes real values. In this sense, a distribution that accommodates skewness, and includes the normal distribution as a special case, was introduced by Azzalini [2], named skew-normal. This class has been studied by various authors in different contexts (see, for example, [4,23,27], among others). An extensive, but not exhaustive, list of the publications involving skew-normal distribution can be accessed on Azzalini's homepage1. Ferreira and Paula [12] proposed estimation and diagnostic for homoscedastic PLMs in an asymmetric context using the skew-normal distribution (PLM-SN), by developing the expectation-maximization (EM) algorithm for linear regression models and diagnostic analysis through local influence as well as generalized leverage, following the approach of Zhu and Lee [37]. To our knowledge, there is no research in the scientific literature involving HPLMs with an asymmetric error.

So, a natural extension is to propose a PLM model using SN distribution in the presence of heteroscedasticity and to develop influence diagnostics for detecting influential observations, which is the objective of this work. It is important to note that we implemented the ECM algorithm (Expectation/Conditional Maximization) and obtained closed-form expressions for all the estimators of the parameters of the proposed model, except for the parameter of heterogeneity. Influence analysis is an important and key step in data analysis after parameter estimation. There are two approaches for detecting influential observations: the case-deletion approach [7,38] and the local influence approach [6,37]. Since the estimation of the parameters of the HPLM model using skew-normal distribution will be via the ECM algorithm, then, in this work, the local influence approach will be based on the complete-data technique that uses the conditional expectation of the complete-data log-likelihood function (Q-function), from the ECM algorithm, as proposed by Zhu and Lee [37].

Another interesting aspect is that in the PLM, the standard assumption is that all observations have equal variances, but in some cases these models do not comply with this assumption, affecting the efficiency of the estimators. Therefore, it is important to develop tests that allow us to determine the presence or absence of such homogeneity. In this paper, we propose a Likelihood Ratio test o check the homogeneity of the scale parameter in the HPLM-SN model.

The rest of the work is organized as follows. In Section 3, the heteroscedastic partially linear model, under the assumption that the errors follow the skew-normal distribution, is presented and a penalized log-likelihood function is considered for the parameter estimation. The ECM algorithm to obtain the maximum likelihood estimate of the parameters of the HPLM-SN model is given in Section 4 and the discussion of degrees of freedom estimation and goodness-of-fit is also presented. Residual analysis is given in Section 5 to identify atypical observations and/or model misspecification once residuals are measures of agreement between the data and the fitted model. Section 6 presents local influence measures using the methodology proposed by Zhu and Lee [37]. The Hessian and the corresponding matrices of four perturbation schemes are also derived. In Section 7, we discuss the Likelihood Ratio test to test the homogeneity of a scale parameter. The properties of the Likelihood Ratio test statistics are investigated through Monte Carlo simulations. Section 8 deals with simulation studies to evaluate the efficiency of the ECM algorithm and the Likelihood Ratio test for homogeneity of variance. In addition, a study for misspecification of the structure function is considered. Finally, in Section 9, we illustrate the methodology by considering an application with a real data set, where the main interest is to explain the daily pollen concentration. Section 10 summarizes the contributions of the paper.

2. Motivating example

We illustrate our proposed methods with a data set obtained from [32], which is available, e.g. in the R package SemiPar. This data set contains data on ragweed levels and meteorological variables for 335 days in Kalamazoo, Michigan, USA, from 1991 to 1994. Ferreira and Paula [12] have analyzed this data set using a homoscedastic partially linear model under asymmetric distributions. Now we revisit this data set with the aim of expanding the inferential results to the heteroscedastic partially linear skew-normal model. Following [12], the explanatory variables included in our proposed model are the indicator of significant rain (1: at least 3 hours of steady or brief but intense rain, 0: other), temperature (degrees Fahrenheit), wind speed forecast for the following day (mph) and time (t-days in season). The response variable (y) is the square root of the pollen concentration of ragweed ( grains/m3). First, we fit a PLM-SN model to the data as specified by [12]

yi=β1raini+β2temperaturei+β3windi+f(ti)+ϵi,i=1,,335, (1)

where ϵi are iid errors following a skew-normal distribution, i.e. SN(0,σ2,λ). To detect deviations from the error model, we use the standard residuals ei0,i=1,,335, defined in Section 5. Figure 1(a) presents the plot of standard residuals ei0's, where we may see a heteroscedastic behaviour of the residuals. Furthermore, Figures 1(b and c) show the scatter plots between these residuals and the explanatory variables of temperature and wind. We can note that the heteroscedasticity of the residuals is due to the temperature.

Figure 1.

Figure 1.

Standard residuals of the Homoscedastic PLM-SN model fitted ( ei0's) to the ragweed levels data: (a) plot of the residuals, (b) scatter plot between residuals and temperature, and (c) scatter plot between residuals and wind.

Thus, this paper aims to introduce a new class of heteroscedastic PLM under skew-normal distribution. In other words, this work aims to develop current research topics which are of paramount importance for proper data analysis. When skewness and heteroscedasticity problem is a concern in a partially linear regression model this manuscript may be a useful reference to cope with those problems, and hence it is a good contribution to the statistics literature.

3. The proposed model

In this section, we propose the heteroscedastic partially linear model under the assumption that the errors follow the skew-normal distribution and discuss the penalized function method, which is often required to maximize the penalized likelihood function.

3.1. Skew-normal distribution

We start with the definition of the skew-normal (SN) distribution that will be used in this article; see [2] for more details. A random variable YSN(μ,σ2,λ) if its probability density function (pdf) is given by

f(y|μ,σ2,λ)=2ϕ(y|μ,σ2)Φ(λ(yμ)σ),yR, (2)

where ϕ(;μ,σ2) stands for the pdf of the normal distribution with mean μ and variance σ2, Φ() represents the cumulative distribution function (cdf) of the standard normal distribution. Its stochastic representation is given by

Y=dμ+σ(δ|T0|+(1δ2)1/2T1),withδ=λ1+λ2, (3)

where |T0| denotes the absolute value of T0, T0N(0,1) and T1N(0,1) are independent, and ‘ =d’ means ‘distributed as’. This convenient hierarchical representation facilitates EM-type implementation for the maximum-likelihood estimation and can be used to simulate data. Note that if YSN(μ,σ2,λ), then Z=(Yμ)/σSN(0,1,λ). A particular case of this distribution is the normal distribution ( YN(μ,σ2)) when λ=0. From (3) it follows that the expectation and variance of Y are given, respectively, by

E[Y]=μ+cσδandVar[Y]=σ2(1c2δ2),withc=2π. (4)

3.2. Model specification

In this section, we define the heteroscedastic partially linear model under skew-normal distribution. First, consider the homoscedastic partially linear model under skew-normal distribution (PLM-SN model), as defined by Ferreira and Paula [12], given by

Yi=xiβ+f(ti)+ϵi,i=1,,n, (5)

where Yi denotes the response of the ith experimental unit, xi is a known p×1 vector covariate vector, β is a p-dimensional vector of unknown regression coefficients, ti is a scalar that may represent a value of a continuous covariate, for example, time, f() is a smoothing function, and ϵi are independent random errors such that ϵiSN(0,σ2,λ). However, the actual scale parameter may be related to the ith observation Yi and thus its variance is nonconstant.

We extend the PLM-SN model defined in (5) with the assumption under a skew-normal structure in the presence of heteroscedasticity. Thus, for the ith experimental unit,

ϵiSN(0,σi2,λ),σi2=σ2mi(ρ,zi),i=1,,n, (6)

where mi=m(ρ,zi) is a known continuously differentiable positive function, zi contains the values of the explanatory variables, which generally constitute, although not necessarily, a subset of xi, and ρ:p×1 is a vector of unknown parameters. If the variances depend on the values of some explanatory variables zi, for example, a specific form of mi is the log-linear model given by mi(ρ,zi)=exp(j=1pρjzij); see [11,22,34] and references therein for more details. It is assumed that there is a ρ0 such that m(ρ0,zi)=1, for all i=1,,n. We call the structure defined by (5) and (6) the HPLM-SN model (heteroscedastic partially linear skew-normal model).

Alternatively, the model (5)–(6) can be written in matrix form as follows

Y=Xβ+Nf+ϵ, (7)

where Y=(Y1,,Yn), X is an n×p design matrix with rows xi, N is an n×q incidence matrix, f is a q×1 vector (unknown smooth curve) and ϵ=(ϵi,,ϵn) is an n×1 vector of random errors.

3.3. Penalized log-likelihood function

From expressions (6)–(7), we have that YiSN(μi,σi2,λ), where μi=xiβ+nif and ni is the ith row of N, the observed-data log-likelihood function of θ=(β,σ2,λ,ρ,f) is given by

(θ)n2logσ212i=1nlogmi12σ2i=1n(yiμi)2mi+i=1nlogΦ(λ(yiμi)σmi1/2). (8)

The direct maximization of (8) is difficult due to the term Φ(). In addition, maximization of (8) without imposing restrictions on the function f() may cause overfitting and nonidentification of β (see, for instance, [14]). A well-known procedure that is based on the idea of log-likelihood penalization consists in incorporating a penalty function in the log-likelihood, such that

p(θ,α)=(θ)α2J(f), (9)

where J(f) denotes the penalty function over f() and α is a smoothing parameter that controls the tradeoff between the goodness-of-fit, measured by large values of (θ), and the estimated smoothing function, measured by small values of J(f). Therefore, the determination of the parameter α is a crucial part of the estimation process, for which different methods of choice are available in the literature, such as the Akaike information criterion or the Bayesian information criterion. Following [15,18], we use a natural cubic spline to estimate f and the general form of J(f)=ab[f(2)(t)]2dt, where f(2)(t) denotes the second derivative of f(t) with [a,b] containing the values t10,,tq0 being the distinct and ordered values of ti. Moreover, from Theorem 2.1 in [15], page 13, the penalty function will satisfy ab[f(2)(t)]2dt=fKf, where KRq×q is a nonnegative definite matrix that depends only on knots. In this article, we will use the methods of [15]- first method- and [10]- second method- to construct the matrices N and K. They differ in the choice of knots that will be used to estimate the f curve. In the first method, f=(f(t10),,f(tq0)), with t10,,tq0 being the distinct and ordered values of ti and N is an n×q incidence matrix whose (i,j)th element equals the indicator function I(ti=tj0) for j=1,,q. On the other hand, in the second method, the knots are chosen according to a pre-established interval using B-splines and we estimate f() as a B-spline of order 3 [3], i.e. f(x)=j=1qfjBj(x). In this case, the elements of N are given by nij=Bj(ti),i=1,,n and j = 1,.., q. A function ‘ bspline(t,q,df)’ in R [29] is used, where q is the number of equidistant knots desired by the user and df is an order of the polynomial, that equals 3, in this paper (a B-spline cubic); see Appendix 2 for more details. Thus, the expression of matrix K is described in Appendix 2.

4. Statistical inference

In this section, we discuss some inferential aspects in the HPLM-SN model as well as the penalized maximum likelihood estimation using the ECM algorithm. The ECM algorithm [28] to obtain the maximum likelihood estimate of θ and a discussion of degrees of freedom estimation are given in Section 4.1. The standard error estimation of θ^ is presented in Appendix 1.

4.1. Parameter estimation using the ECM algorithm

In this section, we present an ECM algorithm for the ML estimation of the HPLM-SN model. To explore the ECM algorithm, we present the HPLM-SN model in an incomplete data framework, using the results presented in Section 3. Thus, from Equation (3), the set-up defined above can be written hierarchically as

Yi|Wi=wiN(μi+σiδwi,σi2(1δ2)), (10)
WiTN(0,1;0,+), (11)

for i=1,,n all independent, where TN(r,s;(a,b)) denotes the univariate normal distribution (N(r,s)), truncated on the interval (a,b) [20]. Let y=(y1,,yn) and w=(w1,,wn). Then, under the hierarchical representation (10)–(11), it follows that the complete log-likelihood function associated with yc=(y,w) is

c(θ|yc)nlogσ2i=1nlogmi12σ2i=1n1mi×[(1+λ2)(yiμi)22λwi(yiμi)+wi2].

As in the original proposal of [9], the E-step of our algorithm consists of taking the conditional expectation Q(θ|θ^(k))=E[c(θ|yc)|y,θ^(k)], where θ^(k)=(β^(k),f^(k),σ2^(k),λ^(k),ρ^(k)), is the current estimate of θ at the kth iteration. The maximum penalized likelihood estimate (MPLE) of θ is the value that maximizes the function

Qp(θ|θ^(k))=Q(θ|θ^(k))α2J(f). (12)

Given α^(k), the M-step consists of the maximization of Qp(θ|θ^(k)) with respect to θ. It follows, after some simple algebra, that the conditional expectation of the complete log-likelihood function has the form

Q(θ|θ^(k))nlogσ2i=1nlogmi(1+λ2)2σ2(yXβNf)×H(yXβNf)12σ2w2^(k)H1n+λσ2(yXβNf)Hw^(k), (13)

where 1n is a n×1 vector of 1's, H is a diagonal matrix of the vector (m11(ρ,z1),,mn1(ρ,zn)), w^(k)=(w1^(k),,wn^(k)) and w2^(k)=(w12^(k),,wn2^(k)) are n×1 vectors, with w^i(k)=E[Wi|yi,θ^(k)] and w2^i(k)=E[Wi2|yi,θ^(k)].

4.1.1. Step-by-step instructions for the ECM algorithm

The ECM algorithm for the HPLM-SN model can be summarized in the following steps:

  1. E-step: Given the current estimates θ^(k) and α^(k) at the kth iteration, we obtain the conditional expectation of the complete data log-likelihood function given the observed y, named the Q-function, which is given by (13), such that
    w^i(k)=λ^(k)ei(k)+σ^i(k)WΦ(λ^(k)ei(k)σ^i(k)), (14)
    w2^i(k)=[λ^(k)ei(k)]2+σ2^i(k)+λ^(k)σi(k)^ei(k)WΦ(λ^(k)ei(k)σ^i(k)), (15)
    where ei(k)=yixiβ^(k)nif^(k), σ2^i(k)=σ2^(k)mi(ρ^(k),zi), i=1,,n, and WΦ(u)=ϕ(u)/Φ(u).

    Conditional maximization steps (CM-steps) are given as follows.

  2. CM-step 1: Fix α^(k), update β^(k), f^(k), σ2^(k) and λ^(k) as
    β^(k+1)=(XH^(k)X)1XH^(k)[yNf^(k)λ^(k)/(1+(λ^(k))2)w^(k)],f^(k+1)=(NH^(k)N+ασ2^(k)1+(λ^(k))2K)1N×H^(k)(yXβ^(k)λ^(k)1+(λ^(k))2w^(k)),σ2^(k+1)={1nH^(k)w2^(k)2λ^(k)w^(k)H^(k)(yμ^(k))+[1+(λ^(k)2]S(β^(k),f^(k),ρ^(k))}/2n
    and
    λ^(k+1)=w^(k)H^(k)(yμ^(k))/S(β^(k),f^(k),ρ^(k)), (16)
    where S(β,f,ρ)=(yμ)H(yμ) and μ=Xβ+Nf.
  3. CM-step 2: Fix α^(k) and given β^(k+1),f^(k+1),σ2^(k+1) and λ^(k+1), update ρ^(k) as
    ρ^(k+1)=argmaxρQ(ρ|β^(k+1),f^(k+1),σ2^(k+1),λ^(k+1)).
  4. Given θ^(k+1), update α^(k) as
    α^(k+1)=argminαBIC(α|θ^(k+1)).
    We use the Bayesian information criterion (BIC) to select the better value of the α, given by
    BIC(α)=2p(θ^,α)+p(α)logn,
    where p(θ^,α) denotes the penalized log-likelihood function available at θ^ for a fixed α, defined in (9) and n is the sample size. Note that maximizing the penalized log-likelihood function is equivalent to minimizing the BIC. This procedure requires a one-dimensional search, which can be easily accomplished by using, for example, the ‘optim’ routine in R [29] to estimate α, with α between 0.001 and 103. In additive linear models, degrees of freedom are defined as approximately the number of effective parameters involved in modeling the nonparametric effects [17,19]. In our case, using the expression of f^ in (16), we derive effective degrees of freedom as
    df(α)=tr{N(NH^N+ασ^21+λ^2K)1NH^},
    where H is defined in (13). Therefore, one has a total of p(α)=p+p+2+df(α) parameters to be estimated.

    Notes on implementation

    The iterations of the above algorithm are repeated until a suitable convergence rule is satisfied, e.g. θ(k+1)θ(k) is sufficiently small, say 106. A set of reasonable starting values may be achieved by computing β^(0) and σ2^(0) as the solution of the least-squares regression model of y on X. So, f^(0)=(NN+ασ2^(0)K)1N(yXβ^(0)), λ^(0) can be the sample skewness coefficient of yXβ^(0)Nf^(0). The value of ρ^(0) can be the value ρ0 such that mi(ρ0,zi)=1 for all i=1,,n (homoscedasticity of variance).

4.2. Goodness-of-fit

The Mahalanobis distance di2=(Yixiβnif)2/σi2,i=1,,n, is extremely useful in testing the goodness-of-fit and in detecting outliers. According to [33], it can be shown that the distribution of di2 is the same as under normal distribution. So, in the SN distribution, di2=(Yixiβnif)2/σi2χ12. This result is interesting because it allows evaluating the statistical models in practice. By substituting the maximum likelihood estimates of β,f and σ2 at the distance of Mahalanobis di2, we can evaluate the fit of the models by constructing quantile-quantile plots with simulated confidence bands of 100γ%, 0<γ<1 [1]. In addition, by plotting the Mahalanobis distance and considering as a benchmark the quantile ν of the quadratic form di2, we can identify outliers. For instance, for the skew-normal case, we have that ν=χ2(ε), where 0<ε<1.

5. Residuals

The residual analysis aims at identifying atypical observations and/or model misspecification once residuals are measures of agreement between the data and the fitted model. Under the heteroscedastic PLM-SN model, we defined the following standardized residual

ei1=yixiβ^nif^σ2^mi(ρ^,zi),i=1,,n, (17)

where β^,f^,σ2^ and ρ^ denote the MPLE of β,f, σ2 and ρ, respectively, from the ECM-algorithm described in Section 4.1. Note that when ρ=0, under the homoscedastic PLM-SN model, we get the following standardized residual

ei0=yixiβ^nif^σ2^, (18)

where β^,f^ and σ2^ denote the MPLE of β,f and σ2, respectively, from the EM-algorithm described in Section 4 of Ferreira and Paula [12].

Based on the residuals ei1 and ei0, we can detect incorrect specification of the error distribution as well as the presence of outlying observations.

6. Influence diagnostics

Cook [6] proposed a unified approach for the assessment of local influence in minor perturbations of a statistical model, which can be viewed as a generalization of the robustness concept for studying and detecting influential subsets of data. Following [12], a direct application of this approach involves extensive algebraic manipulation for the HPLM-SN model. In this article, we will apply the general approach of Zhu and Lee [37] to achieve diagnostic measures for local influence analysis.

6.1. Description of the local influence approach

The general approach developed by Zhu and Lee [37] for local influence analysis of general statistical models with missing data will be utilized to obtain diagnostic measures for the HPLM-SN model. For completeness and to introduce notation, this approach is briefly outlined here. Consider a perturbation vector ω=(ω1,,ωg) varying in an open region ΩRg, and the following perturbed statistical model M={f(yc,θ,ω):ωΩ}, where f(yc,θ,ω) is the probability density function for the complete-data, yc, perturbed by ω and cp(θ,ω|yc)=logf(yc,θ,ω), its corresponding complete penalized log-likelihood function. We assume there is a ω0 such that cp(θ,ω0|yc)=cp(θ|yc) for all θ. Let θ^(ω) be the maximum of the function Qp(θ,ω|θ^)=E[cp(θ,ω|yc)|y,θ^]. Then, the influence graph is defined as α(ω)=(ω,fQ(ω)), where fQ(ω) is the Q-displacement function defined as follows: fQ(ω)=2[Qp(θ^|θ^)Qp(θ^(ω)|θ^)]. Following the approach developed by [6,37], the normal curvature CfQ,v of α(ω) at ω0 in the direction of some unit vector v is used to summarize the local behavior of the Q-displacement function. It can be shown that CfQ,v=2vQ¨ω0v and Q¨ω0=Δω0{2Qp(θ|θ^)θθ|θ=θ^}1Δω0, where Δω0=2Qp(θ,ω|θ^)θω|θ=θ^,ω=ω0. As in [6], the symmetric matrix 2Q¨ω0 is fundamental for detecting influential observations, and its spectral decomposition is given by 2Q¨ω0=k=1gζkεkεk, where {(ζk,εk),k=1,,g} are eigenvalue–eigenvector pairs of 2Q¨ω0 with ζ1ζr>ζr+1==0 and orthonormal eigenvectors {εk,k=1,,g}. Lesaffre and Verbeke [25] and Zhu and Lee [37] proposed inspecting all eigenvectors corresponding to non-zero eigenvalues for more revealing information. Based on Zhu and Lee [37], we consider the following aggregated contribution vector of all eigenvectors corresponding to non-zero eigenvalues. Let ζ~k=ζk/(ζ1++ζr), εk2=(εk12,,εkg2) and M(0)=k=1rζ~kεk2. The jth component of M(0), M(0)j, is equal to k=1rζ~kεkj2. The evaluation of influential cases is based on the visual inspection of the {M(0)j,j=1,,g} plotted against the index j. The jth case may be regarded as influential if M(0)j is larger than the reference.

The inconvenience involved in the use of the normal curvature consists of deciding the influence of the observations, since CfQ,v(θ) may assume any value and is not invariant under a uniform change of scale. Zhu and Lee [37] considered the following conformal normal curvature BfQ,v(θ)=CfQ,v(θ)/tr[2Q¨ω0], which has an interesting property 0BfQ,v(θ)1, for any unitary direction v. Now, let vj be a basic perturbation vector with the jth entry 1 and zero elsewhere. Zhu and Lee [37] showed that for all j, M(0)j=BfQ,vj. Hence, M(0)j can be obtained by BfQ,vj. The computation of BfQ,vj is very simple. We refer the reader to Zhu and Lee [37] for other theoretical properties of BfQ,vj, such as invariance under reparameterizations of θ. Lee and Xu [24] propose to use 1/m+cSM(0) as a benchmark for establishing the jth case as influential, where c is a selected constant that may be chosen suitably and SM(0) is the standard deviation of {M(0)j,j=1,,g}. In this paper, we consider c=3 unless otherwise indicated.

In the following sections, we derive the Hessian matrix for the proposed HPLM-SN model, including a brief discussion on the perturbation schemes employed for our development.

6.2. The Hessian matrix Q¨θ(θ^)

To obtain the diagnostic measures for the local influence of a particular perturbation scheme, it is necessary to compute Q¨θ(θ^)=2Qp(θ|θ^)θθ|θ=θ^, where θ=(β,σ2,λ,ρ,f). Hence, the Hessian matrix has elements given by

2Qp(θ|θ^)ββ=(1+λ2)σ2XHX,2Qp(θ|θ^)βσ2=1σ4XH[(1+λ2)(yμ)λw^],2Qp(θ|θ^)βλ=1σ2XH[2λ(yμ)w^],2Qp(θ|θ^)ρβ=H˙[(1+λ2)σ2D(yμ)λσ2D(w^)]X,2Qp(θ|θ^)fβ=1+λ2σ2NHX,2Qp(θ|θ^)ρσ2=H˙[12σ4D(w2^)1nλσ4D(w^)(yμ)],2Qp(θ|θ^)σ2σ2=nσ41σ6[w2^H1n2λ(yμ)Hw^+(1+λ2)S(β,f)],2Qp(θ|θ^)σ2λ=1σ4[λS(β,f)(yμ)Hw^],2Qp(θ|θ^)fσ2=1σ4NH[(1+λ2)(yμ)λw^],2Qp(θ|θ^)λ2=1σ2S(β,f),2Qp(θ|θ^)fλ=1σ2NH[2λ(yμ)w^],2Qp(θ|θ^)ρλ=1σ2H˙D(yμ)[λ(yμ)w^],2Qp(θ|θ^)ρρ=i=1nM˙i12σ2i=1nH¨i[(1+λ2)(yiμ2)2λw^i(yiμi)+w2^i],2Qp(θ|θ^)ρf=1σ2H˙D((1+λ2)(yμ)λw^)N,2Qp(θ|θ^)ff=(1+λ2)σ2NHNαK,

where D(x) is the diagonal matrix of the vector x, H˙=(H˙1,,H˙n), with H˙i=1mi2miρ, M˙i=[1mi2miρρ+1mi2miρmiρ] and H¨i=1mi22miρρ+2mi3miρmiρ.

6.3. Perturbation schemes

In this section, we will consider four different perturbation schemes for the HPLM-SN model. For each perturbation scheme, one has the partitioned form Δω0=(Δβ,Δσ2,Δλ,Δρ,Δf), where Δβ=2Qp(θ,ω|θ^)βω|θ=θ^,ω=ω0, Δσ2=2Qp(θ,ω|θ^)σ2ω|θ=θ^,ω=ω0, Δλ=2Qp(θ,ω|θ^)λω|θ=θ^,ω=ω0, Δρ=2Qp(θ,ω|θ^)ρω|θ=θ^,ω=ω0 and Δf=2Qp(θ,ω|θ^)fω|θ=θ^,ω=ω0.

6.3.1. Case-weight perturbation

First, consider the following arbitrary allocation of weights for the expected value of the complete-data penalized log-likelihood function (perturbed Q-function), which may capture departures in general directions, given by

Qp(θ,ω|θ^)=i=1nωiQi(θ|θ^)α2fKf,

where the contribution from the ith experimental unity to the Q-function is Qi(θ|θ^)log(σ2)log(mi)12miσ2[(1+λ2)(yiμ2)2λw^i(yiμi)+w2^i], with ω=(ω1,,ωn) an n×1 vector, 0wi1, for i=1,,n and ω0=(1,,1). For this perturbation scheme, we find Δω0 with the following elements:

Δβ=1σ2XHD(λw^+(1+λ2)(yμ)),Δσ2=1σ21n+12σ4{(yμ)H[(1+λ2)D(yμ)2λD(w^)]+w2^H},Δλ=1σ2(yμ)HD(w^λ(yμ)),Δρ=M12σ2H˙{D(yμ)[(1+λ2)D(yμ)2λD(w^)]+D(w2^)},Δf=1σ2NHD(λw^+(1+λ2)(yμ)),

where M=(M1,,Mn), with Mi=1mimiρ.

6.3.2. Response variable perturbation

A perturbation of the response variables y=(y1,,yn) is defined as yω=y+Syω, where Sy is the standard deviation of y. In this case, ω0=0Rn and the perturbed Q-function is as in Equations (12)–(13), switching yω with y. It follows that the matrix Δω0 has the following elements:

Δβ=Syσ2[(1+λ2)XH],Δσ2=Syσ4[(1+λ2)(yμ)λw^]H,Δλ=Syσ2[w^2λ(yμ)]H,Δρ=Syσ2H˙[(1+λ2)D(yμ)+λD(w^)],Δf=Syσ2[(1+λ2)NH].

6.3.3. Explanatory variable perturbation

In this section, we will consider the influence that perturbation in the specific continuous explanatory variable may produce on the parameter estimates. A perturbation of the explanatory variable xr is defined as xrω=xw+Srω,r1,,p, where Sr is the standard deviation of the explanatory variable xr. In this case, ω0=0Rn and the perturbed Q-function is like Equations (12)–(13), switching Xω with X. Consequently, the matrix Δω0 has the following elements:

Δβ=Srσ2{(1+λ2)[Ir0(yμ)βrX]λIr0w^}H,Δσ2=βrSrσ4[λw^(1+λ2)(yμ)]H,Δλ=βrSrσ2[w^+2λ(yμ)]H,Δρ=βrSrσ2H˙[(1+λ2)D(yμ)λD(w^)],Δf=βrSrσ2(1+λ2)NH,

where Ir0 denotes a p×1 vector of zeros with one in the rth position.

6.3.4. Perturbation of the skewness parameter

Consider the perturbed model Yi=xiβ+f(ti)+ϵi,i=1,,n, with ϵiSN(0,σi2,λi),σi2=σ2mi(ρ,zi) and λi=λsi(ω,zi), where si=s(ω,zi) is a known positive continuously differentiable function, zi contains values of the explanatory variables, which constitute in general, although not necessary, a subset of xi, and ω:l×1 is a perturbation vector. It is assumed that there is a ω0 such that m(ω0,zi)=1, for all i=1,,n. The perturbed Q-function is similar to Equations (12)–(13), switching λi with λ. It follows that the matrix Δω0 has the following elements:

Δβ=λσ2XH[2λD(yμ)D(w^)],Δσ2=λσ4(yμ)H[λD(yμ)D(w^)],Δλ=1σ2(yμ)H[D(w^)2λD(yμ)],Δρ=λσ2H˙D(yμ)[D(yμ)D(w^)],Δf=λσ2NH[2λD(yμ)D(w^)].

In the next section, for simplicity, we consider the diagnostics for the scale parameter in the HPLM-SN model. However, the method proposed here can be used to test for homogeneity of any parameter involved in the variance, as discussed by Xie et al. [34] and Zeller et al. [36].

7. Likelihood ratio test for homogeneity in the HPLM-SN model

The HPLM-SN model defined in Equations (5)–(6) supposes that the variance of the model is not constant with the scale parameter given by σi2=σ2mi, with mi=m(ρ,zi). If the variance depends on the quantity of some explanatory variables zi, some specific forms of mi are usually taken to model the varying dispersion: (i) log-linear model mi(ρ,zi)=exp(j=1pρjzij) and (ii) power product model mi(ρ,zi)=j=1pzijρj=exp(j=1pρjlog(zij)); see [11,22,34] and references therein for more details. Of course, (ii) requires that the zij be strictly positive, while no such restriction is needed for (i). Furthermore, it is assumed that there is a unique value ρ=ρ0, such that mi(ρ0,zi)=1 for all i=1,,n, then σi2=σ2 and Yis has constant variance. Hence the test for homogeneity of scale parameter is equivalent to testing the following hypothesis

H0:ρ=ρ0vsH1:ρρ0.

In this article, we use a Likelihood Ratio (LR) test statistic to check H0, where LR=2(p(θ^,α^)p(θ~,α~)), with (θ^,α^) and (θ~,α~) are the restricted (under H0) and unrestricted ML estimators of (θ,α), respectively. When H0 is true, the statistic LR is asymptotically distributed as χp2.

8. Simulation studies

In this section, we present four simulation studies. The first study evaluates the performance of the ML estimates of the HPLM-SN model parameters determined from the ECM-algorithm. In the second and third studies, the performance of the asymptotic distribution and the power of the LR test statistic are examined. The fourth simulation study also evaluates the performance of the proposed test by providing evidence regarding his behavior when the underlying structure function is misspecified. In all simulation studies, we used the method of Eilers and Marx [10] to construct matrices N and K, with the number of knots being 14 and 12, for scenarios 1 and 2, described in Section 8.1, respectively.

8.1. Study I: parameter recovery

In this subsection, we consider two scenarios for simulation in order to verify if we can estimate the true parameter values accurately by using the proposed estimation method. This is the first step to ensure that the estimation procedure works satisfactorily. We fit the HPLM-SN model defined in Section 3.2 to data that were artificially generated from model (5)–(6), where f(ti)=cos(ti), ti(3π,3π) (scenario 1) or f(ti)=cos(4πti)exp(ti2/2), ti(0.6,1.6) (Doppler effect, scenario 2), such that we assume equidistant values for ti in each specific interval, xiU(0.2,2), β=(β0,β1)=(0,2), σ2=0.1, λ=3 and ρ=0.1, with mi=eρxi or mi=xiρ, i=1,,n.

We generated 2000 samples from each scenario, for n = 200, 500 and 1000. The average values (mean) and the corresponding standard deviations (SD) of the estimates made by the ECM algorithm in all samples are presented in Tables 14. Moreover, these tables contain approximate standard errors (SE) calculated via the observed information matrix for (β1,σ2,λ,ρ); see Appendix 1 for more details.

Table 2.

Mean and standard deviation (SD) estimates by the ECM algorithm based on 2000 samples from the HPLM-SN in scenario 2 and mi=eρxi. SE is the average of estimated standard errors.

    n = 200 n = 500 n = 1000
Parameter True Value Mean SD SE Mean SD SE Mean SD SE
β1 2.0 1.942 0.948 0.559 1.928 0.554 0.415 1.994 0.382 0.309
σ2 0.6 0.551 0.185 0.096 0.566 0.098 0.069 0.577 0.069 0.054
λ 3.0 3.314 0.953 0.471 3.107 0.470 0.335 3.058 0.307 0.244
ρ 4.6 4.693 0.246 0.126 4.651 0.128 0.091 4.631 0.086 0.069

Table 3.

Mean and standard deviation (SD) estimates by the ECM algorithm based on 2000 samples from the HPLM-SN in scenario 1 and mi=xiρ. SE is the average of estimated standard errors.

    n = 200 n = 500 n = 1000
Parameter True Value Mean SD SE Mean SD SE Mean SD SE
β1 2.0 1.951 0.183 0.285 1.982 0.062 0.059 1.993 0.043 0.042
σ2 1.0 0.962 0.214 0.157 0.984 0.097 0.090 0.991 0.069 0.064
λ 3.0 4.445 2.883 2.558 3.223 0.627 0.558 3.082 0.375 0.358
ρ 1.6 1.822 0.234 0.211 1.672 0.120 0.115 1.623 0.079 0.081

Table 1.

Mean and standard deviation (SD) estimates by the ECM algorithm based on 2000 samples from the HPLM-SN in scenario 1 and mi=eρxi. SE is the average of estimated standard errors.

    n = 200 n = 500 n = 1000
Parameter True Value Mean SD SE Mean SD SE Mean SD SE
β1 2.0 1.816 1.071 0.777 1.882 0.544 0.478 1.963 0.390 0.348
σ2 0.6 0.483 0.180 0.109 0.545 0.093 0.077 0.569 0.067 0.061
λ 3.0 3.572 1.443 0.782 3.156 0.471 0.471 3.075 0.311 0.276
ρ 4.6 4.792 0.260 0.176 4.674 0.125 0.125 4.640 0.086 0.079

Table 4.

Mean and standard deviation (SD) estimates by the ECM algorithm based on 2000 samples from the HPLM-SN in scenario 2 and mi=xiρ. SE is the average of estimated standard errors.

    n = 200 n = 500 n = 1000
Parameter True Value Mean SD SE Mean SD SE Mean SD SE
β1 2.0 1.956 0.190 0.257 1.980 0.065 0.066 1.989 0.042 0.044
σ2 1.0 0.962 0.209 0.166 0.979 0.098 0.096 0.991 0.068 0.068
λ 3.0 4.393 2.871 2.594 3.212 0.626 0.601 3.085 0.376 0.378
ρ 1.6 1.811 0.248 0.231 1.675 0.129 0.131 1.639 0.085 0.086

Note that all the point estimates are quite accurate in all the scenarios considered. Thus, the results suggest that the proposed algorithm produces satisfactory estimates. In addition, the SD of the estimates and the SE are closer to each other and decrease with the sample size, showing that the calculation from the observed information matrix seems to be correct.

Finally, in Figure 2, we plot the 2000 estimated functions f() of the nonparametric components from the two considered scenarios and log-linear model as a structure of dispersion. Note that in all scenarios considered, the proposed model presents excellent performance. For structure of the dispersion power product model, the results are the same, so the figure is not shown here to save space. The variability among the estimates of the nonparametric function reduces as the sample size increases, as well as the respective mean estimates, become closer to the true values, for both scenarios and both structures of dispersion. This is an indication of the consistency of the nonparametric estimator.

Figure 2.

Figure 2.

Plots of the nonparametric components with 2000 replications. Adjusted curves (gray lines) and true curves (black lines): scenario 1 (first column) and scenario 2 (second column) and mi=eρxi. (a and b) for n = 200; (c and d) for n = 500 and (e and f) for n = 1000.

8.2. Study II: the empirical distribution of the LR test statistic

In this subsection, the performance of the asymptotic distribution of the LR test statistic is examined following the procedure described in [13,34]. Therefore, the empirical distribution with the theoretical distribution via Monte Carlo simulations is compared.

The design considered in this simulation study is the same as scenario 1 of Study I, but with ρ=0 (under H0), β=(β0,β1)=(0,2), σ2=0.1, λ=3 and n = 50, 100, 200, 300, 400, 500, 600 and 700. As suggested by Cook and Weisberg [8], the power function and the exponential function are usually employed in practice. Thus, we assume that mi=eρxi and mi=xiρ. Each simulated case was replied 2000 times so that the values of the explanatory variable x were kept fixed throughout the simulations. Under H0, it is expected that the LR test statistic follows a χ12 distribution. Then, using the 2000 estimates of the LR statistics we obtained the empirical distribution function (edf). Figure 3 shows comparisons between the edf of the LR statistic and the theoretical distribution of χ12 for n = 100, 300 and 700. We can see that when n increases, the edfs are very close to the theoretical distribution for the model considered in our study.

Figure 3.

Figure 3.

Simulated comparisons between the empirical distributions of the LR statistic and χ12 distribution for n = 100 (a) and (d), n = 300 (b) and (e), n = 700 (c) and (f). In the first row, under the exponential function and in the second line, under power function.

8.3. Study III: the empirical power of the LR test

In our experiment, to gain insight into the performance of the homogeneity LR test in the HPLM-SN model, we performed a simulation study and examined the power functions for various ρ parameter set-ups. For demonstration purposes, we perform this simulation study with the same parameter set options from the previous experiment considered with ρ=0,0.1,0.2,0.3,0.4 and 0.5. The sample sizes of n=50,100,200,300,400,500,600 and 700 were chosen to evaluate the behavior of the test for small and midsize samples. Each simulated case was replied 2000 times so that the values of the explanatory variable x were kept fixed throughout the simulations.

Tables 5 and 6 provide the empirical type I error probability (under the null hypothesis) and the empirical power of LR test under alternative hypotheses with α=0.05 (i.e. the percentage of times that the corresponding statistic exceeds 5% of the upper points of the reference χ2 distribution). The choice of level of significance is usually somewhat arbitrary. The standard significance level α=0.05 was chosen to reflect the usual practice in statistical studies. From Tables 5 and 6, we found that the LR test is usually able to achieve the desired significance level and it is successful in detecting the heteroscedasticity behaviour of the scalar parameter for the model considered in our study. As expected, the performance of the test statistic improves with increasing n. It can be seen that as the size of the sample and ρ increase, the empirical power of the tests increases, approaching 1. As pointed out by Xie et al. [34], the score test statistic is not very sensitive to the functional form in the test for homogeneity of variance parameter. This fact might also be true in our study, in the context of the LR test statistic. Note in Tables 5 and 6 that the results are quite similar.

Table 5.

Empirical type I error probability (when H0) and empirical power of LR under H1, assuming mi=eρxi.

n ρ=0.0 ρ=0.2 ρ=0.4 ρ=0.6 ρ=0.8 ρ=1.0
50 0.342 0.357 0.387 0.547 0.522 0.716
100 0.141 0.180 0.312 0.570 0.718 0.861
200 0.084 0.194 0.569 0.859 0.971 0.997
300 0.062 0.232 0.637 0.928 0.995 1.000
400 0.069 0.295 0.809 0.983 1.000 1.000
500 0.050 0.357 0.887 0.992 1.000 1.000
600 0.051 0.379 0.929 1.000 1.000 1.000
700 0.057 0.454 0.961 1.000 1.000 1.000

Table 6.

Empirical type I error probability (when H0) and empirical power of LR under H1, assuming mi=xiρ.

n ρ=0.0 ρ=0.2 ρ=0.4 ρ=0.6 ρ=0.8 ρ=1.0
50 0.378 0.377 0.413 0.561 0.528 0.685
100 0.136 0.184 0.304 0.556 0.684 0.814
200 0.083 0.199 0.544 0.832 0.965 0.992
300 0.058 0.221 0.620 0.921 0.994 1.000
400 0.056 0.336 0.862 0.988 1.000 1.000
500 0.055 0.367 0.909 1.000 1.000 1.000
600 0.055 0.377 0.909 1.000 1.000 1.000
700 0.060 0.427 0.949 1.000 1.000 1.000

8.4. Study IV: misspecification of the structure function

We report a simulation study to analyze the influence of misspecification of the structure function. The design considered in this simulation study is the same as scenario 1 of Study I, with mi=eρxi (case 1) and mi=ziρ (case 2), but varying σ2={0.1,1} and ρ={0.1,1,5}. We generate 2000 Monte Carlo samples of size n = 1000 and we compute the coverage rates (CR) given by the proportion of estimates that filled in 95% confidence interval and the bias, given by the difference between the mean of the estimates and the true value of the parameters. In this context, we use the following structure functions for mi: mi=1 (homoscedasticity), mi=eρxi and mi=ziρ.

For CR, we expected a value close to 95% and for the bias a value close to 0. According to Tables 7 and 8, the true structure function attains our expectations in terms of CR bias for all parameters taken into consideration. On the other hand, we note that other specifications of mi, in general, present a relatively large bias and a low CR in at least one parameter.

Table 7.

Coverage rates (CR) at the nominal level of 5% and bias for different structure functions with mi=eρxi (true values of the parameters are in parentheses).

  mi=1 mi=eρxi mi=xiρ
Parameter CR bias CR bias CR bias
β(2) 93.85 0.005 96.25 −0.061 96.49 −0.127
σ2(0.1) 69.50 0.011 94.05 0.001 73.05 0.008
λ(3) 94.80 0.086 96.00 −0.035 96.09 −0.141
ρ(0.1) 92.25 0.016 86.27 0.009
β(2) 2.00 0.096 95.40 −0.055 93.93 −0.125
σ2(0.1) 0.00 0.235 92.45 0.001 8.72 0.204
λ(3) 93.20 0.079 96.05 −0.017 96.09 −0.086
ρ(1) 95.10 0.006 44.86 −0.165
β(2) 0.00 5.965 87.00 −0.279 18.16 −1.253
σ2(0.1) 0.00 263.417 91.85 0.010 18.41 36.800
λ(3) 27.35 0.940 96.15 −0.167 27.80 1.343
ρ(5) 94.30 −0.562 2.01 −1.350
β(2) 93.00 0.016 95.65 −0.075 95.64 −0.135
σ2(1) 69.35 0.111 93.30 0.039 73.63 0.044
λ(3) 93.95 0.122 96.34 −0.014 95.99 −0.107
ρ(0.1) 91.75 0.021 87.02 0.016
β(2) 1.80 0.302 94.80 −0.091 94.49 −0.097
σ2(1) 0.00 2.354 93.00 −0.056 20.46 1.761
λ(3) 93.30 0.112 95.25 −0.021 87.21 −0.374
ρ(1) 95.70 −0.004 44.43 −0.172
β(2) 0.00 18.438 73.20 −0.377 27.22 −3.129
σ2(1) 0.00 2620.733 91.35 −0.144 27.07 337.696
λ(3) 27.15 0.963 95.65 6.413 14.17 38.826
ρ(5) 94.05 −0.695 1.32 −1.570

Table 8.

Coverage rates (CR) at the nominal level of 5% and bias for different structure functions with mi=xiρ (true values of the parameters are in parentheses).

  mi=1 mi=eρxi mi=xiρ
Parameter CR bias CR bias CR bias
β(2) 89.10 0.008 92.65 0.001 92.24 0.001
σ2(0.1) 94.40 −0.001 72.45 −0.012 91.70 −0.001
λ(3) 95.75 0.080 94.10 0.052 93.90 0.050
ρ(0.1) 93.25 0.011 92.40 0.003
β(2) 1.20 0.058 92.50 0.005 92.30 0.001
σ2(0.1) 83.35 0.007 0.00 −0.070 91.70 −0.001
λ(3) 92.70 0.085 92.75 0.089 92.90 0.043
ρ(1) 89.85 0.070 92.40 0.015
β(2) 0.00 0.094 89.60 0.011 93.10 0.001
σ2(0.1) 15.25 0.026 0.00 −0.085 91.40 −0.001
λ(3) 89.75 0.086 93.20 0.096 92.55 0.037
ρ(5) 87.45 0.092 92.25 0.018
β(2) 91.20 0.019 92.75 −0.001 92.35 −0.001
σ2(1) 93.85 −0.013 70.10 −0.130 91.10 −0.014
λ(3) 94.80 0.099 92.90 0.071 92.80 0.069
ρ(0.1) 91.60 0.013 91.95 0.005
β(2) 1.00 0.182 93.25 0.012 93.20 −0.003
σ2(1) 79.45 0.078 0.00 −0.708 91.20 −0.016
λ(3) 92.65 0.106 92.20 0.125 93.10 0.081
ρ(1) 87.70 0.081 92.45 0.020
β(2) 0.00 0.293 90.30 0.028 91.20 −0.007
σ2(1) 15.95 0.256 0.00 −0.854 90.70 −0.009
λ(3) 89.95 0.098 92.45 0.140 92.90 0.083
ρ(5) 84.45 0.102 90.55 0.032

9. Application

As suggested by the analysis presented in Section 2, we generalize the model (1) considering heteroscedastic errors, i.e. ϵiSN(0,σi2,λ), where σi2=σ2mi(ρ,zi), with zi=temperatureiaba, a=min{temperature} and b=max{temperature}.

Following [34], to test the homogeneity of the scalar parameter, using the LR statistic given in Section 7, we assume mi(ρ,zi)=eρzi for simplicity. It is easily seen that when ρ=0, then mi=1,i. Hence, we have that the LR=118.687 (p-value 0), which indicates there is significant evidence of a varying scalar parameter and consequently of heterogeneity in the ragweed data set. Meanwhile, we assume that mi(ρ,zi)=ziρ. Then the test is still H0:ρ=0. By a similar computation, we get LR=91.278 (p-value 0). Therefore, these results are similar to those when we choose the exponential function above.

The results of the fit in terms of log-likelihood and BIC are provided in Table 9. Looking at the BIC values, we see that the heteroscedastic PLM-SN models fit the data better than the homoscedastic PLM-SN model. In particular, the best fit was the heteroscedastic PLM-SN model assuming mi(ρ,zi)=eρzi. Table 10 summarizes the MPLE results, including the effective degrees of freedom under skew-normal partially linear models fitted. Furthermore, we construct graphs of the standard residuals ei1,i=1,,335, defined in Section 5, under the HPLM-SN model (log-linear dispersion). Figures 4(a–c) show that the residuals ei1's of the HPLM-SN (log-linear dispersion) do not present more any tendency, confirming that this model is more appropriate for the data set.

Table 9.

Comparison of penalized log-likelihood maximum and BIC for fitted various models using the ragweed levels data. Best fit indicated by (*1).

    Heteroscedastic
  Homoscedastic (log-linear dispersion) (power product dispersion)
p(θ^,α) −730.822 −671.479 −685.183
BIC 1556.033 1473.77 (*1) 1490.496

Table 10.

MPLE results and approximate standard errors (SE) under skew-normal partially linear models fitted to the ragweed levels data.

  Homoscedastic Heteroscedastic
      (log-linear dispersion)
Effect Estimate SE Estimate SE
Rain 1.456 0.382 0.627 0.234
Temperature 0.088 0.0183 −0.006 0.013
Wind 0.228 0.037 0.130 0.025
σ2 9.760 1.362 0.601 0.093
λ 2.189 0.482 2.705 0.615
ρ 4.620 0.1450
α 211.486 96.057
df(α) 11.234 16.500

Figure 4.

Figure 4.

Residuals of the Heteroscedastic PLM-SN model (exponential function) fitted ( ei1's) to the ragweed levels data: (a) plot of residuals, (b) scatter plot between residuals and temperature and (b) scatter plot between residuals, and wind speed.

In order to detect possible outlying observations and to assess the goodness-of-fit of the models, we constructed quantile-quantile plots with simulated confidence bands of 95% [1] based on the Mahalanobis distance di2,i=1,,335. In Figure 5(a), the HPLM-SN model (exponential function) does not present observations outside of the confidence bands. We may notice from Figure 5(b) that the fit under heteroscedastic skew-normal errors (exponential function) seems to be accurate for capturing the tendency at the end of the season.

Figure 5.

Figure 5.

Quantile-quantile plots for the Mahalanobis distance and the 95% pointwise confidence bands for f (Days in season) from the HPLM-SN model (exponential function) fitted to the ragweed levels data.

Next, we identify influential cases for the data set using M(0). Figure 6 presents the index graphs of M(0) for the proposed perturbation schemes. From Figure 6(a), it can be seen that cases 87, 94, 239 and 270 are observations with an outstanding contribution on the log-likelihood function and that may exercise a high influence on the maximum-likelihood estimates. Case 87 seems to be the most influential in the ML estimators in the HPLM-SN model (exponential function) under the case-weight scheme. From Figure 6(b) it can be seen that observations 234, 276, 289, 293, 297, 323, 325 and 330 appear as possibly influential in response perturbation, which may indicate observations with a large influence on their own predicted values. Case 297 seems to be the most influential in the MPLE results in the HPLM-SN model (exponential function) under response variable perturbation. We now examine the effects of perturbing in the specific explanatory variable, i.e. temperature and wind speed. Figures 6(c–d) illustrate the index plot for perturbation of the temperature and wind speed, respectively. Using this perturbation scheme, we can examine that the same observations that stand out as influential in the response variable are also for the explanatory variable wind speed. In addition, cases 87, 239, 265, 270, 276, 299 and 315 are identified as influential under perturbation of the temperature, highlighting observations 239, 270 and 87 that were also identified under the case-weight scheme. Finally, from Figure 6(e) it can be seen that observations 88, 96, 115, 276 and 315 appear as possibly influential in skewness perturbation, which may reveal cases that are most influential, in the sense, of the likelihood displacement on the skewness structure and consequently on the λ estimate. Cases 276 and 315 seem to be the most influential in the MPLE results in the HPLM-SN model (exponential function) under skewness perturbation, and such observations were also identified as influential in the perturbation scheme of the explanatory variable temperature.

Figure 6.

Figure 6.

Diagnostic graphs from the HPLM-SN model (exponential function) fitted to the ragweed level data with c=3: (a) case-weight perturbation; (b) response (square root of ragweed) perturbation; (c) explanatory (temperature) perturbation; (d) explanatory (wind speed) perturbation and (e) skewness perturbation.

Observations 87 and 239 present low levels of ragweed (equal to the first quartile of the response variable values) and low temperature (below the median temperature). However, observation 270 presents a high level of ragweed and temperature (above the median of the values of the respective variables). Observation 297 presents a low level of ragweed and wind speed (below the median of the values of the respective variables). Furthermore, observation 276 presents a high level of ragweed and temperature (above the third quartile of the values of the respective variables), while observation 315 has the lowest level of ragweed and high temperature (above the median temperature). Case 276 is in the right tail of the response variable distribution.

10. Conclusions

In this paper, an extension of the skew-normal partially linear model is developed by considering the case where the error terms are independent and follow a skew-normal distribution in the presence of heteroscedasticity. Our proposed model generalizes the recent work of Ferreira and Paula [12]. We developed the maximum likelihood estimator of the parameters based on the ECM algorithm and we obtained analytic expressions for the E and M-steps, except for the parameter of the heterogeneity that requires a CM-step in the algorithm. Local influence methods were implemented for the HPLM-SN model to evaluate the consequences of model perturbations in situations where different perturbation schemes are investigated. In addition, we discuss the Likelihood Ratio test for homogeneity in the HPLM-SN model. To examine the performances and properties of the LR test, formal simulations studies under several situations were performed, including in the context of misspecication of the structure function. Lastly, the HPLM-SN model was applied to a real data set and compared with the homoscedastic version showing the usefulness of the HPLM-SN to fit data sets with nonparametric components in which the responses are asymmetric and heteroscedastic. Finally, the model proposed in this paper can be extended in context generalized additive models (GAM) as an alternative to relaxing the assumption that the heteroscedasticity function is known. We thank the anonymous referee for valuable suggestions for future research. The authors hope to report these findings in a future paper.

Acknowledgments

We thank the Associate Editor and referees for their helpful comments and suggestions, leading to the improvement of the paper.

Appendices.

Appendix 1. Approximate standard errors.

In this appendix, we derive the observed information matrix associated with the parameter vector θ. The observed information matrix will be used to calculate the standard errors of the estimate θ^. Following the same procedure as Segal et al. [31], we derive the variance-covariance matrix of the θ^ from the inverse of the observed information matrix, which is obtained by treating the penalized likelihood function (8) as a usual likelihood. Given the HPLM-SN model in Equations (5)–(6), the corresponding penalized log-likelihood function of θ=(β,f,σ2,λ) is of the form p(θ)=i=1npi(θ), with pi(θ)=log2+1pi(θ)+logΦ(2i(θ)), where 1pi(θ)=logϕ(yi;μi,σi2)α/(2n)fKf and 2i(θ)=λ(yiμi)/σi, with μi=xiβ+nif and σi2=σ2mi(ρ,zi). Thus, the observed information matrix for θ can be written as Iθθ=i=1n2pi(θ)θθ=I1(θ)+I2(θ), where I1(θ)=i=1n21pi(θ)θθ and I2(θ)=i=1n[WΦ(2i(θ))22i(θ)θθ+WΦ(2i(θ))2i(θ)θ2i(θ)θ], with WΦ(x)=WΦ(x)(x+WΦ(x)) and WΦ(x)=ϕ(x)/Φ(x). The first-order and second-order derivatives of 1pi(θ) in relation to θ can be calculated as follows:

1pi(θ)β=(yiμi)σ2mixi;1pi(θ)f=(yiμi)σ2miniαnKf;1pi(θ)σ2=(di1)2σ2;1pi(θ)λ=0;1pi(θ)ρ=(di1)2mimiρ;21pi(θ)ββ=1σ2mixixi;21pi(θ)fβ=1σ2minixi;21pi(θ)σ2β=(yiμi)σ4mixi;21pi(θ)ρβ=(yiμi)σ2mi2miρxi;21pi(θ)ff=1σ2mininiαnK;21pi(θ)σ2f=(yiμi)σ4mini;21pi(θ)ρf=(yiμi)σ2mi2miρni;21pi(θ)σ4=12σ4diσ4;21pi(θ)σ2ρ=di2σ2mimiρ;21pi(θ)ρρ=12di2mi2miρmiρ+di12mi2miρρ;21pi(θ)θλ=0,

with di=(yiμi)2σ2mi. The first-order and second-order derivatives of 2i(θ) in relation to θ are given by

2i(θ)β=λσmi1/2xi2i(θ)f=λσmi1/2ni;2i(θ)σ2=λ(yiμi)2σ3mi1/2;2i(θ)λ=(yiμi)σmi1/2;2i(θ)ρ=λ2σ(yiμi)mi3/2miρ;22i(θ)ββ=22i(θ)ff=22i(θ)βf=0;22i(θ)λ2=0;22i(θ)σ2β=λ2σ3mi1/2xi;22i(θ)σ2f=λ2σ3mi1/2ni;22i(θ)λβ=1σmi1/2xi;22i(θ)λf=1σmi1/2ni;22i(θ)ρβ=λ2σmi3/2miρxi;22i(θ)ρf=λ2σmi3/2miρni;22i(θ)σ4=3λ4σ5(yiμi)mi1/2;22i(θ)σ2λ=12σ3(yiμi)mi1/2;22i(θ)σ2ρ=λ4σ3(yiμi)mi3/2miρ;22i(θ)λρ=12σ(yiμi)mi3/2miρ;22i(θ)ρρ=λ2σ(yiμi)mi3/2[32mimiρmiρ+2miρρ].

Appendix 2. Calculus of the matrices N and K.

  • Eilers and Marx's method

Let t=(t1,,tn) and ndx as the number of equidistant knots desired by the user.

Commands in R [29]:

require(splines)

bspline=function(x,ndx,bdeg) {

xl=min(x)

xr=max(x)

dx=(xr-xl)/ndx

knots=seq(xl-bdeg*dx,xr+bdeg*dx,by=dx)

B=splineDesign(knots,x,bdeg+1,0*x,outer.ok=T)

}

B=bspline( t,ndx,3)

D = diag(ncol(B))

for (k in 1:2) D = diff(D)

Thus, N=B=bspline(t,ndx,3) and K=DD.

  • Green & Silverman's method:

Let t10,,tq0 the q distinct and ordered values of ti, i=1,,n and N a (nk)×q incidence matrix whose (i,j)th element equals the indicator function I(ti=tj0), j=1,,q. In addition, define hi=ti+10ti0, for i=1,,q1 and S as being the q×(q2) matrix with entries sij, for i=1,,q and j=2,,q1, given by

sj1,j=hj11,sjj=hj11hj1andsj+1,j=hj1,forj=2,,q1,

and si,j=0, for |ij|2. Now, consider R as being a (q2)×(q2) matrix with elements rij given by

rii=(hi1+hi)/3,fori=2,,q1,ri,i+1=ri+1,i=hi/6,fori=2,,q2,

and rij=0,for|ij|2. Then, K=SR1S.

Funding Statement

This work was supported by CNPq and Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG).

Note

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Atkinson A., Two graphical displays for outlying and influential observations in regression, Biometrika 68 (1981), pp. 13–20. [Google Scholar]
  • 2.Azzalini A., A class of distributions which includes the normal ones, Scand. J. Statist. 12 (1985), pp. 171–178. [Google Scholar]
  • 3.Boor C.D., A Practical Guide to Spline, Springer, Berlin, 1978. [Google Scholar]
  • 4.Cancho V.C., Lachos V.H., and Ortega E.M.M., A nonlinear regression model with skew-normal errors, Statist. Papers 51 (2010), pp. 547–558. [Google Scholar]
  • 5.Chen G. and You J., An asymptotic theory for semiparametric generalized least squares estimation in partially linear regression models, Statist. Papers 46 (2005), pp. 173–193. [Google Scholar]
  • 6.Cook R.D., Assessment of local influence, J. R. Stat. Soc. Ser. B 48 (1986), pp. 133–169. [Google Scholar]
  • 7.Cook R.D. and Weisberg S., Residuals and Influence in Regression, Chapman & Hall/CRC, Boca Raton, FL, 1982. [Google Scholar]
  • 8.Cook R.D. and Weisberg S., Diagnostics for heteroscedasticity in regression, Biometrika 70 (1983), pp. 1–10. [Google Scholar]
  • 9.Dempster A., Laird N., and Rubin D., Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B 39 (1977), pp. 1–38. [Google Scholar]
  • 10.Eilers P.H.C. and Marx B.D., Flexible smoothing with B-splines and penalties, Stat. Sci. 11 (1996), pp. 89–121. [Google Scholar]
  • 11.Ferreira C.S., Lachos V.H., and Garay A.M., Inference and diagnostics for heteroscedastic nonlinear regression models under skew scale mixtures of normal distributions, J. Appl. Stat. 47 (2020), pp. 1690–1719. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ferreira C.S. and Paula G.A., Estimation and diagnostic for skew-normal partially linear models, J. Appl. Stat. 44 (2017), pp. 3033–3053. [Google Scholar]
  • 13.Garay A.M., Lachos V.H., Labra F.V., and Ortega E.M.M., Statistical diagnostics for nonlinear regression models based on scale mixtures of skew-normal distributions, J. Stat. Comput. Simul. 84 (2014), pp. 1761–1778. [Google Scholar]
  • 14.Green P.J., Penalized likelihood for general semi-parametric regression models, Int. Stat. Rev. 55 (1987), pp. 245–259. [Google Scholar]
  • 15.Green P.J. and Silverman B.W., Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach, Chapman and Hall, Boca Raton, 1994. [Google Scholar]
  • 16.Härdle W., Müller M., Sperlich S., and Werwatz A., Nonparametric and Semiparametric Models, Springer, Berlin, 2004. [Google Scholar]
  • 17.Hastie T. and Tibshirani R., Generalized Additive Models, Chapman and Hall, London, 1990. [DOI] [PubMed] [Google Scholar]
  • 18.Ibacache-Pulgar G. and Paula G.A., Local influence for student-t partially linear models, Comput. Stat. Data Anal. 55 (2011), pp. 1462–1478. [Google Scholar]
  • 19.Ibacache-Pulgar G., Paula G.A., and Cysneiros F.J.A., Semiparametric additive models under symmetric distributions, Test 22 (2013), pp. 103–121. [Google Scholar]
  • 20.Johnson N.L., Kotz S., and Balakrishnan N., Continuous Univariate Distributions, Vol. 1, John Wiley, New York, 1994. [Google Scholar]
  • 21.Keilegom I.V. and Wang L., Semiparametric modeling and estimation of heteroscedasticity in regression analysis of cross-sectional data, Electron. J. Stat. 4 (2010), pp. 133–160. [Google Scholar]
  • 22.Labra F.V., Garay A.M., Lachos V.H., and Ortega E.M.M., Estimation and diagnostics for heteroscedastic nonlinear regression models based on scale mixtures of skew-normal distributions, J. Stat. Plan. Inference 142 (2012), pp. 2149–2165. [Google Scholar]
  • 23.Lachos V.H., Montenegro L.C., and Bolfarine H., Inference and influence diagnostics for skew-normal null intercept measurement errors models, J. Stat. Comput. Simul. 78 (2008), pp. 395–419. [Google Scholar]
  • 24.Lee S.Y. and Xu L., Influence analysis of nonlinear mixed-effects models, Comput. Stat. Data Anal. 45 (2004), pp. 321–341. [Google Scholar]
  • 25.Lesaffre E. and Verbeke G., Local influence in linear mixed models, Biometrics 54 (1998), pp. 570–582. [PubMed] [Google Scholar]
  • 26.Ma B., Chiou J., and Wang A., Efficient semiparametric estimator for heteroscedastic partially linear models, Biometrika 93 (2006), pp. 75–84. [Google Scholar]
  • 27.Mattos T.B. and Ferreira C.S., The mean-shift outlier model under skew normal distribution, Commun. Stat. Simul. Comput. 45 (2016), pp. 1905–1917. [Google Scholar]
  • 28.Meng X.L. and Rubin D.B., Maximum likelihood estimation via the ECM algorithm: A general framework, Biometrika 81 (1993), pp. 633–648. [Google Scholar]
  • 29.R Core Team , R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2021. http://www.R-project.org/.
  • 30.Ruppert D., Wand M.P., and Carrol R., Semiparametric Regression, Cambridge University Press, New York, 2003. [Google Scholar]
  • 31.Segal M.R., Bacchetti P., and Jewell N.P., Variances for maximum penalized likelihood estimates obtained via the EM algorithm, J. R. Stat. Soc. Ser. B 56 (1994), pp. 345–352. [Google Scholar]
  • 32.Stark P.C., Ryan L.M., McDonald J.L., and Burge H.A., Using meteorologic data to predict daily ragweed pollen levels, Aerobiologia 13 (1997), pp. 177–184. [Google Scholar]
  • 33.Wang J. and Genton M.G., The multivariate skew-slash distribution, J. Stat. Plan. Inference 136 (2006), pp. 209–220. [Google Scholar]
  • 34.Xie F.C., Wei B.C., and Lin J.G., Homogeneity diagnostics for skew-normal nonlinear regression models, Stat. Probab. Lett. 79 (2009), pp. 821–827. [Google Scholar]
  • 35.You J., Chen G., and Zhou Y., Statistical inference of partially linear regression models with heteroscedastic errors, J. Multivar. Anal. 98 (2007), pp. 1539–1557. [Google Scholar]
  • 36.Zeller C.B., Carvalho R.R., and Lachos V.H., On diagnostics in multivariate measurement error models under asymmetric heavy-tailed distributions, Statist. Papers 53 (2012), pp. 665–683. [Google Scholar]
  • 37.Zhu H. and Lee S., Local influence for incomplete-data models, J. R. Stat. Soc. Ser. B 63 (2001), pp. 111–126. [Google Scholar]
  • 38.Zhu H., Lee S., Wei B., and Zhou J., Case-deletion measures for models with incomplete data, Biometrika 88 (2001), pp. 727–737. [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES