Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2020 Aug 7;49(1):248–267. doi: 10.1080/02664763.2020.1803812

The semiparametric regression model for bimodal data with different penalized smoothers applied to climatology, ethanol and air quality data

J C S Vasconcelos a,CONTACT, G M Cordeiro b, E M M Ortega a
PMCID: PMC9042003  PMID: 35707795

ABSTRACT

Semiparametric regressions can be used to model data when covariables and the response variable have a nonlinear relationship. In this work, we propose three flexible regression models for bimodal data called the additive, additive partial and semiparametric regressions, basing on the odd log-logistic generalized inverse Gaussian distribution under three types of penalized smoothers, where the main idea is not to confront the three forms of smoothings but to show the versatility of the distribution with three types of penalized smoothers. We present several Monte Carlo simulations carried out for different configurations of the parameters and some sample sizes to verify the precision of the penalized maximum-likelihood estimators. The usefulness of the proposed regressions is proved empirically through three applications to climatology, ethanol and air quality data.

Keywords: Additive model, additive partial model, generalized inverse Gaussian distribution, semiparametric model, splines

1. Introduction

For many years, the normal linear regression model has been used to explain the most random phenomena. Even when the phenomenon under study does not present a response for which the normality assumption is reasonable, some types of transformations are suggested to achieve the desired normal distribution. Another important problem in regression models occurs when there are linear and nonlinear effects on the response variable in a single data set.

A great effort was undertaken to provide more flexible assumptions so that these regressions could model real situations with greater precision. However, these flexible assumptions lead to more complex regression models which are very hard to be interpreted in some cases. Nowadays, the literature has various types of regression models such as the generalized linear semiparametric models pioneered by Green and Yandell [10], where it was added a nonparametric term to the linear predictor. Another extension of the generalized linear models is the generalized additive model (GAM) introduced by Hastie and Tibshirani [11], in which the term that is controlled in parametric form is altered by an arbitrary function and becomes controlled in nonparametric form, and then it is estimated by smoothed curves (such as splines). Ruppert et al. [16] demonstrate that nonparametric regression can be considered as a relatively simple extension of parametric regression and combine the two together, in what refers to semiparametric regression, they approach semiparametric regression based on penalized regression splines and mixed models. Rigby and Stasinopoulos [15] developed a generalized additive model for location, scale and shape (GAMLSS), which has been widely used in various areas of science due to its flexibility, by allowing modeling the location, scale and shape simultaneously. The utility of the semiparametric regression method in scenarios of real change is of extreme importance. For example, Fan and Hyndman proposed a new statistical method to predict short-term electricity demand based on a semiparametric additive model, Lebotsa et al. [14] presented an application of partially linear additive quantile regression models to predict short-term electricity demand using data from South Africa, Hudson et al. [12] showed the benefits of the GAMLSS in the modeling and interpretation of possible nonlinear climate impacts on eucalyptus tree growth, Del Giudice et al. [4] presented a hedonic price function constructed through a semiparametric additive model, and more recently, Etienne et al. [7] utilized a semiparametric model and stochastic frontier model to estimate the efficiency of corn production by smallholders in Zimbabwe.

On the other hand, the distributions commonly used in regression models are being modified and/or generalized to enable them to model different complex forms of data. Hence, it is convenient to consider parametric families of distributions that are flexible enough to capture a wide range of symmetric, asymmetric and bimodal behaviors.

In this article, we adopt as baseline the odd log-logistic generalized inverse Gaussian (OLLGIG) distribution introduced recently by Souza Vasconcelos et al. [17]. Thus, the fundamental objective is to propose additive, additive partial and semiparametric regression models for bimodal data from in the OLLGIG distribution with different penalized smoothers.

The inferential component is carried out using the asymptotic distribution of the maximum-likelihood estimators (MLEs). These models are presented with some methods to effect global influence. Additionally, we develop residual analysis from quantile residuals (qrs). For some parameter settings, additive terms and sample sizes, diverse Monte Carlo simulations are carried out making comparison the empirical distribution of the qrs with the standard normal distribution. These simulations indicate that the empirical distribution of these residuals with different penalized smoothers present conformity in what it refers to standard normal distribution.

The rest of the paper is structured following way. In Section 2, the OLLGIG semiparametric regression model will be defined based on different penalized smoothers, estimate their parameters by the penalized maximum-likelihood method, diagnostic and residual analysis are discussed. In Section 3, some properties of the maximum-likelihood estimators are evaluated using a simulation study. In Section 4, we show empirically how flexible, practical relevance and applicability of the presented regression models by means of three real data sets. Section 5 is devoted to some concluding remarks.

2. The OLLGIG semiparametric regression

For modeling OLLGIG distributions, gamlss package [18] available in R software was used, implementing a new distribution, as described in Section 4.2 in [19]. For the regression analysis, we use the function gamlss(·) from the gamlss package [18], in which the regression structures with the penalized smoothers are described in Tables 5, 10 and 13.

Table 5. Systematic components of the OLLGIG, GIG and IG additive regressions and goodness-of-fit measures for climatology data.

Model Systematic structures GAIC
OLLGIG μi=exp[β0+cs(xi1)+cs(xi2)+cs(xi3)] 402.3305
GIG μi=exp[β0+cs(xi1)+cs(xi2)+cs(xi3)] 407.7017
IG μi=exp[β0+cs(xi1)+cs(xi2)+cs(xi3)] 413.6021
OLLGIG μi=exp[β0+ps(xi1)+ps(xi2)+ps(xi3)] 408.9545
GIG μi=exp[β0+ps(xi1)+ps(xi2)+ps(xi3)] 413.9161
IG μi=exp[β0+ps(xi1)+ps(xi2)+ps(xi3)] 419.9604
OLLGIG μi=exp[β0+pb(xi1)+pb(xi2)+pb(xi3)] 401.8478
GIG μi=exp[β0+pb(xi1)+pb(xi2)+pb(xi3)] 405.1313
IG μi=exp[β0+pb(xi1)+pb(xi2)+pb(xi3)] 410.8665

Table 10. Additive partial regressions and GAIC for some regressions fitted to the ethanol data.

Model Systematic structures GAIC
OLLGIG μi=exp[β0+β1wi1+cs(xi2)] 36.2462
GIG μi=exp[β0+β1wi1+cs(xi2)] 39.4854
IG μi=exp[β0+β1wi1+cs(xi2)] 91.4063
OLLGIG μi=exp[β0+β1wi1+ps(xi2)] 39.3340
GIG μi=exp[β0+β1wi1+ps(xi2)] 41.0862
IG μi=exp[β0+β1wi1+ps(xi2)] 92.2585
OLLGIG μi=exp[β0+β1wi1+pb(xi2)] 35.0047
GIG μi=exp[β0+β1wi1+pb(xi2)] 37.6757
IG μi=exp[β0+β1wi1+pb(xi2)] 91.0999

Table 13. Semiparametric regressions and GAIC statistic from the fitted regressions to the air quality data.

Model systematic components GAIC
OLLGIG μi=exp[β0+β1wi1+cs(xi2)+cs(xi3)] 936.3173
GIG μi=exp[β0+β1wi1+cs(xi2)+cs(xi3)] 940.2850
IG μi=exp[β0+β1wi1+cs(xi2)+cs(xi3)] 1019.6003
OLLGIG μi=exp[β0+β1wi1+ps(xi2)+ps(xi3)] 934.7905
GIG μi=exp[β0+β1wi1+ps(xi2)+ps(xi3)] 939.3618
IG μi=exp[β0+β1wi1+ps(xi2)+ps(xi3)] 1017.3683
OLLGIG μi=exp[β0+β1wi1+pb(xi2)+pb(xi3)] 941.1109
GIG μi=exp[β0+β1wi1+pb(xi2)+pb(xi3)] 941.7588
IG μi=exp[β0+β1wi1+pb(xi2)+pb(xi3)] 1018.3338

The inverse Gaussian (IG) and generalized inverse Gaussian (GIG) distributions are highly applied in various areas of science for example the survival and reliability analysis, meteorology, hydrology and engineering, among others. Recently, [17] defined the general form for the OLLGIG cdf, is (for y>0)

F(y)=F(y;μ,σ,ν,τ)=Gμ,σ,ν(y)τGμ,σ,ν(y)τ+[1Gμ,σ,ν(y)]τ, (1)

where

Gμ,σ,ν(y)=0y(bμ)νtν12Kν(σ2)exp[12σ2(btμ+μbt)]dt (2)

where Gμ,σ,ν(y) is the cdf of the GIG distribution, μ>0 represents the average of the GIG distribution, σ>0 is a scale parameter, νR and τ>0 are shape parameters, b=Kν+1(σ2)/Kν(σ2), and Kν(t)=120yν1exp[12t(u+u1)]du is the modified Bessel function of the third kind and index ν. Clearly, Gμ,σ,ν(y) is a special case of (1) when τ=1. Further details and properties of the GIG distribution can be found in [13].

We write η(y)=Gμ,σ,ν(y) to simplify the notation. Then, the OLLGIG density function (for y>0) can be written as

f(y)=f(y;μ,σ,ν,τ)=(bμ)ντyν12Kν(σ2)exp[12σ2(byμ+μby)]×{η(y)[1η(y)]}τ1{η(y)τ+[1η(y)]τ}2. (3)

The main properties and motivations of the OLLGIG distribution is that it is more flexible in relation to asymmetry and kurtosis as well as allowing bimodality when 0<τ<1. If Y is a random variable with density (3), then we write YOLLGIG(μ,σ,ν,τ). The OLLGIG distribution contains two important special cases, the GIG distribution when τ=1 and the IG distribution when τ=1, σ=σμ1/2 and ν=0.5.

The OLLGIG model is easily simulated since its quantile function (qf) takes the simple form y=QGIG(u1/τ/(u1/τ+[1u]1/τ)), where QGIG(u)=Gμ,σ,ν1(u) is the qf of the GIG distribution.

In many research areas, there are continuous explanatory variables with nonlinear effects in the response variable and more flexible models under less restrict assumptions are desirable. So, a nonparametric approach in one or more covariables may be a suitable choice to control the effects of the continuous covariables, or even to explain nonlinear tendencies of these variables. In this context, we propose three semiparametric regressions based on the OLLGIG distribution, namely: the OLLGIG additive regression, the OLLGIG additive partial regression and the OLLGIG semiparametric regression with different penalized smoothers. The likelihood ratio (LR) statistics can be adopted to discriminate among the OLLGIG, GIG and IG semiparametric regressions. The penalized likelihood function is used to fit the OLLGIG semiparametric regression.

Regression analysis involves specifications for the distribution of Yi given a vector of covariables wi=(wi1,,wip)T. Let xi=(xi1,,xiJ)T (for i=1,,n) be the vector of covariables that has a nonlinear form with the response variable. The important of the OLLGIG semiparametric regression defines the parameters depending on wi and xi. The μi parameters are related to covariables by the link functions:

μi=exp[ξ=1Jhξ(xiξ)]OLLGIG additive regression model; (4)
μi=exp[wiTβ+h(xiξ)]OLLGIG additive partial regression model; (5)
μi=exp[wiTβ+ξ=1Jhξ(xiξ)]OLLGIG semiparametric regression model, (6)

where hξ() is the smooth function related to the continuous explanatory variable with nonlinear effects that are nonparametric controlled (for i=1,,n, ξ=1,,J) and β=(β1,,βp)T is the full parameter vector, note that Equations (4) and (5) are particular cases of Equation (6).

In this article, we shall discuss two smoothing functions called cubic spline and P-spline in the systematic structure.

  • Cubic spline

    The cubic spline is represented by the cs(·) function, which uses the smooth.spline(·) command to smooth a curve available in the gamlss package [18]. Let y1,,yn be n observations from the OLLGIG(μi,σ,ν,τ) distribution. For the semiparametric models (4), (5) and (6), the fixed and random effects θ and h, respectively, are estimated by maximizing the penalized log-likelihood function (see, for instance, [9,11]) has the form
    lp(θ,h)=l(θ)ξ=1Jλξ2hξTKξhξ. (7)
    where
    l(θ)=nlog(τ)+νi=1nlog(bμi)+(ν1)i=1nlog(yi)nlog[2Kν(1σ2)]12σ2i=1n(byiμi+μibyi)+(τ1)i=1nlog{η(yi)[1η(yi)]}2i=1nlog{ητ(yi)+[1η(yi)]τ}, (8)
    θ=(βT,σ,ν,τ)T is the parameter vector, λξ>0 is the smoothing parameter, which characterizes the smoothness of the curve, i.e. it controls the quality of the curve fitting, for the vector of smoothed function hξ=(hξ(x1ξ),,hξ(xqξ))T, where q is the distinct and ordered observation of the covariable that is controlled in a nonparametric way, with ξ=1,,J number of covariables of nonlinear effect on yi(i=1,,n), Kξ=QR1QT is a q×q definite positive matrix, where Q is a matrix of order q×(q2) and R is a matrix of order (q2)×(q2). For more details, see [9].
    Equation (7) is a general form of writing the penalized log-likelihood function of the semiparametric regression models.
    • -
      If we do not consider the systematic form wiTβ, then Equation (7) refers to the penalized log-likelihood function for the OLLGIG additive regression,
    • -
      If ξ=1, it is the penalized log-likelihood function for the OLLGIG additive partial regression,
    • -
      If ξ>1, it refers to the log-likelihood function for the OLLGIG semiparametric regression.
  • P-spline

    The other smoothing function used in the paper is called the P-spline [6], which involves penalized splines, more particularly the ps(·) and pb(·) functions. The smoothed ps(·) function is based on the function of Brian Marx, while the smoothed pb(·) function follows the function defined by Paul Eilers. We present two main differences the ps(·) and pb(·) functions:
    • -
      the ps(·) function does not estimate the smoothing parameter;
    • -
      in computational terms, the pb(·) function is faster than the ps(·) function.

    More details in [20].

    The ps(·) and pb(·) functions can be determined from h(x)=N γ, where N is the incidence matrix which depends on the covariable x and γ is a parameter vector to be estimated under the matrix of B-spline bases. Further, these smoothing functions also have a quadratic penalty of the form λγTGγ, where G=DTD is a known penalty matrix, λ is the hyperparameter that regulates the number of smoother steps necessary for adjustment and the matrix D is defined in (10).

    Given this, the penalized log-likelihood function for θ and γ can be introduced as
    lp(θ,γ)=l(θ)12ξ=1JγξTGξ(λξ)γξ, (9)
    where θ=(βT,σ,ν,τ)T is the vector of parameters, J is the number of smoothers or covariables, which are controlled in nonparametric form, and γ is a vector of penalization coefficients to be estimated. The penalty matrix G is defined as G=(Dk)TDk, where the matrix Dk has order (qk)×q, recalling that q is the number of distinct values of the explanatory variables, which is controlled nonparametrically. The order to be applied depends on the smoothing of the variability curve of the data. The penalization standard normally used of order k = 2 can be referred as
    γT(D2)TD2γ=(γ12γ2+γ3)2++(γq2γq+γq)2. (10)
    Thus, the matrix D has the form
    D2=(121001210012).

The asymptotic distribution of (θ^θ) is multivariate normal Np+3(0,K(θ)1) under general regularity conditions, where K(θ) is the expected information matrix. The asymptotic covariance matrix K(θ)1 of θ^ can be approximated by the inverse of the (p+3)×(p+3) observed information matrix J(θ). By doing this, the inference on the parameter vector θ can be based on the multivariate normal distribution Np+3(0,J(θ)1) for θ^ and then a 100(1α)% asymptotic confidence interval for any parameter θq follows as

ACIq=(θ^qzα/2J^q,q,θ^q+zα/2J^q,q),

where J^q,q denotes the qth diagonal element of the inverse of the estimated observed information matrix J(θ^)1 and zα/2 is the quantile 1α/2 of the standard normal distribution.

We can use LR statistics for confront with some models embedded with the OLLGIG semiparametric regression model.

2.1. Diagnostic tools and residual analysis

In order to assess possible influential points, an analysis of global influence may be carried from case-deletion. The case-deletion regressions with systematic components (4), (5) and (6) can be expressed as μl=exp[ξJhξ(xlξ)], μl=exp[wlTβ+h(xl)]and μl=exp[wlTβ+ξJhξ(xlξ)] respectively, for ξ=1,,J,l=1,,n,li. The standardized norm of θ^(i)θ^, called the generalized Cook distance, is the first measure of the global influence defined by GDi(θ)=(θ^(i)θ^)T[L¨(θ)](θ^(i)θ^), where a quantity with subscript “(i)” means the original quantity with the ith observation deleted Another popular measure of the difference between θ^(i) and θ^ is the likelihood distance defined by LDi(θ)=2[l(θ^)l(θ^(i))].

Once the model is chosen and fitted, the analysis of the residuals is an efficient way to check the model adequacy. For a residual analysis, we suggest working with the quantile residual [5]. The qrs for the OLLGIG semiparametric regression with systematic component take the forms

qri=Φ1{η(yi)τη(yi)τ+[1η(yi)]τ}, (11)

where η() is given in Equation (2) and Φ1() is the inverse of the standard normal cdf. [1] suggested the construction of an envelope to have a better interpretation of the probability normal plot of the residuals.

3. Simulation study using different penalized smoothers

To verify the accuracy of the OLLGIG semiparametric regression MLEs with different penalties, a simulation study was performed, and also to explore the accuracy of the performance of the empirical distribution of the qrs. The response and the covariables are generated as follow: yiOLLGIG(μi,σ,ν,τ), wi1Normal(0,1), wi2Binomial(1,0.5) and xi3Uniform(0,7).

In this case, only the coefficients associated with the explanatory variables w1 and w2 will be analyzed, since the coefficient associated with the penalized smoothers does not have a direct explanation. Then, a graphical analysis is performed, where each of the plots presents true smooth curve defined by h(xi3)=[1+sin(xi3)]. We consider different sample sizes (n=50,100,and250) under three scenarios cs(·), ps(·) and pb(·) considering that the systematic component of the regression is μi=0.01wi1wi2+[1+sin(xi3)]. When n increases, the adjusted curves approach the actual curve (as expected).

For these scenarios, the numeric values of the parameters are taken as: β1=0.01, β2=1, σ=1.5, ν=6 and τ=0.8. Thus, for each combination of n, β1 and β2, 1000 Monte Carlo simulations are generated and for each of the samples, the MLEs of the model parameters are estimated. For each replication, we evaluate the MLEs of the parameters and then, after all replications, we compute the average estimates (AEs), biases and means squared errors (MSEs). Table 1 provides the different systematic components for the parameter μ.

Table 1. Systematic components for the parameters.

Regression Penalized smoothers Systematic components
  cs(·) μi=exp[β1wi1+β2wi2+cs(xi3)]
OLLGIG ps(·) μi=exp[β1wi1+β2wi2+ps(xi3)]
  pb(·) μi=exp[β1wi1+β2wi2+pb(xi3)]

From Table 2, you can see that the parameter EAs approach the parameters true value when n increases. Further, the biases and MSEs are small for the estimates of β1 and β2 even when n is small which supports that the asymptotic normal distribution provides an adequate approximation to the finite sample distribution of the MLEs.

Table 2. AEs, biases and MSEs for the fitted OLLGIG regression with penalized smoothers under scenarios 1[cs(·)], 2[ps(·)] and 3[pb(·)].

Scenario 1
  n = 50 n = 100 n = 250
Parameters AE Bias MSE AE Bias MSE AE Bias MSE
β1 0.0104 0.0004 0.0052 0.0085 −0.0015 0.0023 0.0100 0.0000 0.0009
β2 −0.9924 0.0076 0.0200 −0.9969 0.0031 0.0087 −1.0033 −0.0033 0.0035
Scenario 2
  n = 50 n = 100 n = 250
Parameters AE Bias MSE AE Bias MSE AE Bias MSE
β1 0.0102 0.0002 0.0051 0.0085 −0.0015 0.0022 0.0090 −0.0010 0.0009
β2 −0.9900 0.0100 0.0199 −0.9972 0.0028 0.0089 −1.0031 −0.0031 0.0034
Scenario 3
  n = 50 n = 100 n = 250
Parameters AE Bias MSE AE Bias MSE AE Bias MSE
β1 0.0081 −0.0019 0.0053 0.0066 −0.0034 0.0022 0.0095 −0.0005 0.0009
β2 −1.0068 −0.0068 0.0194 −0.9974 0.0026 0.0093 −0.9988 0.0012 0.0037

In Figure 1, we plot the adjusted and generated terms for the smooth functions representing the first, second and third scenarios with penalized smoothers cs(·), ps(·) and pb(·), respectively. For all scenarios, the generated smooth functions approximate the true curve when the sample size increases. Thus, we can conclude that the variability among the nonparametric function estimates is reduced when n increases. We can also note that the three smoothing functions have similar performances, i.e. we can not say that anyone is better than the others. Finally, we suggest readers always to work with the three soothing functions. This same procedure is adopted in the various examples in Section 4 using some goodness-of-fit statistics to choose one of the three smoothing functions.

Figure 1.

Figure 1.

The fitted and generated terms for the smooth functions based on 1,000 simulations. The first three plots (a) n = 50, (b) n = 100 and (c) n = 250 are related to the scenario (1) with penalized smoother cs(·). The middle three plots (d) n = 50, (e) n = 100 and (f) n = 250 refer to the scenario (2) with penalized smoother ps(·). The last three plots (g) n = 50, (h) n = 100 and (i) n = 250 are related to the scenario (3) with penalized smoother pb(·).

Empirical distribution of the residuals

We have implemented a simulation study to study the empirical distribution of ( qris) for the OLLGIG semiparametric regression model. The simulation algorithm follows the same patterns as described at the beginning of this section. We also construct the normal probability plots to assess the degree of deviation from the normality hypothesis for the residuals. Based on the plots in Figure 2 representing the first, second and third scenarios, respectively, we note that the empirical distribution of these residuals agrees with the standard normal distribution for all scenarios.

Figure 2.

Figure 2.

Normal probability plots for the qrs. The first three plots (a) n = 50, (b) n = 100 and (c) n = 250 are related to the scenario (1) with penalized smoother cs(·). The middle three plots (d) n = 50, (e) n = 100 and (f) n = 250 refer to the scenario (2) with penalized smoother ps(·). The last three plots (g) n = 50, (h) n = 100 and (i) n = 250 are related to the scenario (3) with penalized smoother pb(·).

4. Applications

In this section, we present three real data applications to prove empirically the flexibility of the OLLGIG additive, additive partial and semiparametric regressions with different penalized smoothers. All the computational works were implemented in the R software.

4.1. OLLGIG additive regression to climatology data

The first application is about the climatology data from the Department of Biosystems Engineering of the Luiz de Queiroz School of Agriculture, University of São Paulo (LEB-ESALQ-USP). The current data set is available at the link http://www.leb.esalq.usp.br/leb/anos.html. This data set was collected from 8 March to 8 August 2019. We consider the OLLGIG additive regression to explore the influence of the covariables (global radiation, relative humidity and maximum wind) in the evaporation (response variable). Then, the variables considered for this application are:

  • yi: Evaporation (mm);

  • xi1: Global radiation (cal/cm 2);

  • xi2: Relative humidity (%);

  • xi3: Maximum wind (m/s), for i=1,,154.

In Table 3, the MLEs are shown, their standard errors (SEs) in parentheses, the values of the Akaike information criterion (AIC) and global deviation (GD). The fitted model is better suited when the values of these criteria are small. The lower values of the two statistics in this table support that the OLLGIG distribution would be right for modeling these data.

Table 3. MLEs and SEs of the model parameters for climatology data.

Model log(μ) log(σ) ν τ AIC GD
OLLGIG 1.2780 .19504 27.8990 0.2875 482.7119 474.7119
  (0.0226) (0.0554) (9.8170) (0.0193)    
GIG 1.3176 −1.0429 3.3460 1 491.1583 487.1583
  (0.0269) (0.1295) (5.3590) (–)    
IG 1.3178 −1.7319 −0.5 1 492.6715 486.6715
  (0.0276) (0.0569) (–) (–)    

The proposed distribution is associated with two sub-models using LR statistics in Table 4. The figures in this table, specially the p-values, reveal that the OLLGIG model gives a better fit to these data than its two sub-models. Plots of the fitted OLLGIG, GIG and IG densities are displayed in Figure 3(b) to assess the appropriateness of the models. Plots of the estimated cumulative and the empirical distributions are exposed in Figure 3(c). They reveal that the OLLGIG distribution offers a efficient fit to the current data, thus capturing a slight bimodality with left asymmetry.

Table 4. LR tests for climatology data.

Models Hypotheses Statistic w p-value
OLLGIG vs GIG H0:τ=1 vs H1:H0is\,false 11.9596 0.0005
OLLGIG vs IG H0:τ=1 and ν=0.5 vs H1:H0is\,false 12.4464 0.0019

Figure 3.

Figure 3.

(a) Histogram of the evaporation variable. (b) Estimated OLLGIG, GIG and IG densities for climatology data. (c) Estimated cdf of the OLLGIG, GIG and IG distributions and the empirical cdf.

Regression analysis with systematic components

We note in Figure 4 that there is a nonlinear relationship between the response variable and each of the covariables x1, x2 and x3. Thus, the OLLGIG additive regression model is a good option for modeling these data. The systematic components for the parameter μ in Table 5 represent the OLLGIG, GIG and IG additive regressions with penalized smoothers for the explanatory variables x1, x2 and x3. The generalized Akaike information criterion (GAIC) measure is adopted for model selection [15] because smoothing terms are included in the systematic components. The measures of this statistic are displayed in Table 5 to verify the adequacy of all fitted models. They show that the fitted OLLGIG additive regression with pb(·) smoother has the lowest measure for the GAIC statistic among the fitted regressions. The MLEs of the model parameters listed in Table 6 are evaluated. Additional interpretations for this regression will be made at the end of this subsection. Table 7 compares the proposed distribution with two sub-models via LR statistics, where the p-values support that the OLLGIG additive regression with pb(·) provides a conducive fit to the current data than the null models. It was calculated the case-deletion measures GDi(θ) and LDi(θ) defined in Subsection 2.1. The results of such influence measure index plots are presented in Figure 5. The plots reveal that the cases 61, 89, 122 and 146 are possible influential observations. We perform the residual analysis by plotting in Figure 5(c) the qrs (see Subsection 2.1) against the index of the observations. Figure 5(d) gives the normal probability plot with generated envelope. So, the OLLGIG additive regression with pb(·) it is very appropriate for this data, although it has three observations out of the envelope, yet the percentage is less than 5%.

Figure 4.

Figure 4.

Dispersion diagrams for climatology data. (a) y versus x1. (b) y versus x2. (c) y versus x3.

Table 6. MLEs, SEs and p-values for the OLLGIG additive regression with pb(·) fitted to climatology data.

Parameter Estimate SE p-value
β0 −0.2576 0.1694 0.1305
log(σ) 0.0169 0.2495  
ν 0.4622 0.4266  
τ 4.3021 0.6548  

Table 7. LR tests for comparing regressions.

Regressions Hypotheses Statistic w p-value
OLLGIG pb(·) vs GIG pb(·) H0:τ=1 vs H1:H0is\,false 6.0701 0.0244
OLLGIG pb(·) vs IG pb(·) H0:τ=1 and ν=0.5 vs H1:H0is\,false 14.6503 0.0018

Figure 5.

Figure 5.

Index plots for θ: (a) LDi(θ) (likelihood distance) and (b) GDi(θ) (generalized Cook's distance). (c) Residual analysis of the OLLGIG additive regression with pb(·) smoother fitted to the climatology data. (d) Normal probability plot for the qrs with envelope.

Final interpretations

Figure 6 shows the estimation of the nonlinear effects. The horizontal axis in Figure 6(a) refers to the values of the covariable x1 and the vertical axis gives the contribution of the penalized smoother pb(·) for the adjusted values of the response variable (evaporation in mm). Note that the global radiation has a nonlinear relation with evaporation, such that:

  • for days with global radiation ( x1) of up to 300 cal/cm 2 (approximately), there is an increase in evaporation;

  • for days with global radiation between 300 cal/cm 2 and 380 cal/cm 2, there is a reduction of the evaporation;

  • for global radiation values near 380 cal/cm 2, the evaporation is increasing.

Figure 6.

Figure 6.

Shapes of the penalized smoothers pb(·) for the covariables (a) x1, (b) x2 and (c) x3 using the OLLGIG additive regression.

The effect of humidity (in %) also has a nonlinear effect to the evaporation [see Figure 6(b)]. Further, for days having relative humidity ( x2) up to 70% (approximately), there is constant evaporation, but for days with relative humidity greater than 70% (approximately), the evaporation increases. Further, the maximum wind speed (covariable x3) has a nonlinear effect on evaporation. For days with maximum wind speed between 2 m/s and 10 m/s (approximately), there is a rising evaporation rate, while on days with wind speeds greater than 10 m/s (approximately), the increase of evaporation is less pronounced, as can be noted in Figure 6(c).

4.2. OLLGIG additive partial regression fitted to ethanol data

This application is about the fuel ethanol burned in one cylinder engine. For various configurations of compression ratio and engine equivalency, nitrogen oxides (NOx) emissions were recorded. The ethanol data frame contains 88 sets of measurements for variables from an experiment in which ethanol was burned in a single cylinder automobile test engine. For more details about the data, see [2]. We consider the OLLGIG additive partial regression given in Equation (5) in comparison to the GIG and IG additive partial regressions with three types of penalized smoothers in the linear predictor, namely: cs(·), ps(·) and pb(·).

The variables in this study are:

  • yi: NOx (concentration of nitrogen oxides (NO and NO2) in micrograms/J);

  • wi1: the compression ratio of the engine;

  • xi2: equivalence ratio, a measure of the richness of the air and ethanol fuel mixture (for i=1,,88).

Table 8 lists the MLEs of the parameters, their SEs and the AIC and GD measures for the OLLGIG, GIG and IG distributions. The statistics in this table reveal that the OLLGIG distribution presents the lowest values among those of all fitted distributions. So, it could be designated as the best distribution for current data.

Table 8. MLEs and SEs (in parentheses) of the model parameters for ethanol data.

Model log(μ) log(σ) ν τ AIC GD
OLLGIG 0.5341 .11017 22.8500 0.1951 250.8621 242.8621
  (0.0662) (0.9342) (9.8800) (0.0656)    
GIG 0.6717 −0.1787 1.7900 1 262.7169 256.7169
  (0.0678) (0.2475) (1.0310) (–)    
IG 0.6716 −0.6436 −0.5 1 265.1399 261.1399
  (0.0783) (0.0754) (–) (–)    

The LR statistics to confront nested distributions are reported in Table 9. Clearly, the OLLGIG distribution outperforms the GIG and IG distributions. The plots of the fitted OLLGIG, GIG and IG densities are exposed in Figure 7(a). It is clear that the histogram of the data has a bimodal shape and that the estimated OLLGIG density provides the plus approximate fit to the histogram. The plots for the GIG and IG densities can not have this shape. Further, the plots of the fitted OLLGIG, GIG and IG cdf and the empirical cdf are exposed in Figure 7(b). They also pointing that the wider distribution features a appropriate fit to these data. Thus, the OLLGIG distribution is a good choice for modeling the current data.

Table 9. LR tests for ethanol data.

Models Hypothesis Statistic w p-value
OLLGIG vs GIG H0:τ=1 vs H1:H0is\,false 13.8548 <0.001
OLLGIG vs IG H0:τ=1 and ν=0.5 vs H1:H0is\,false 18.2778 <0.001

Figure 7.

Figure 7.

(a) Estimated OLLGIG, GIG and IG densities for ethanol data. (b) Estimated cumulative functions of the OLLGIG, GIG and IG distributions and the empirical cdf for ethanol data. (c) Scatter diagram: emission of NOx versus air/ethanol mix.

Regression analysis with systematic components

We can note from Figure 7(c) that there is a nonlinear effect between the response variable y and the explanatory variable x2. So, we adopt the OLLGIG additive partial regression with different penalized smoothers. For the OLLGIG, GIG and IG additive partial regressions, the systematic components for the parameter μ by taking the nonlinear effect in the explanatory variable x2 are given in Table 10. The values of the GAIC statistic for the nine fitted regressions are reported in Table 10. Based on these numerical results, the GAIC measure for the OLLGIG additive partial regression with penalized smoother pb(·) is the smallest among those of the nine fitted regressions. Hence, the proposed regression can be chosen as the best model for the current data. Table 11 gives the MLEs, SEs and p-values of the model parameters. We can note that the linear ( w1) and nonlinear ( x2) effects are statistically significant at 5%. Thus, an interpretation of the linear effect is that, as the compression ratio of the motor increases, so does the NOx emission. The interpretation of the nonlinear effect is addressed at the end of this application. For comparing the regressions, we consider LR statistics and formal tests. The values of the LR statistics for testing two sub-models of the OLLGIG additive partial regression are given in Table 12. These values yield favorable indications for the OLLGIG additive partial regression with pb () penalized smoother. The case-deletion measures GDi(θ) and LDi(θ) are presented in the plots of Figure 8(a,b), which show that the cases 14, 24, 38 and 88 are likely influential observations. The plot of the qrs versus adjusted values is given in Figure 8(c) for detecting possible outliers in the OLLGIG additive partial regression with pb(·) smoother. We note that the residuals have a random behavior and there is no observation outside the range [3,3]. Figure 8(d) displays the normal probability plot for the qrs with the simulated envelope, which shows the good adequacy of the fitted regression. Finally, is presented the estimation of the nonlinear effect in Figure 8(e). In the horizontal axis, we have the values of the covariant x2 and in the vertical axis the contribution of the penalized smoother pb(·) to the adjusted values of the NOx emission. The effect of the air/ethanol mix is nonlinear in relation to the NOx emission (as expected). Further, for values of x2 around 0.9, there is an increase in NOx emission which also presents a greater variability, but from 0.9, the equivalence ratio x2 decreases with little variability of NOx emission.

Table 11. MLEs, SEs and p-values for the OLLGIG additive partial regression with pb(·) fitted to ethanol data.

Parameter Estimate SE p-value
β0 −1.3744 0.0791 <0.001
β1 0.0261 0.0041 <0.001
log(σ) 20.3111 0.0051  
ν 4.4625 0.5352  
τ 3.3840 0.2694  

Table 12. LR statistics for testing some regressions.

Models Hypotheses Statistic w p-value
OLLGIG pb vs GIG pb H0:τ=1 vs H1:H0is\,false 4.2999 0.0281
OLLGIG pb vs IG pb H0:τ=1 and ν=0.5 vs H1:H0is\,false 60.7959 <0.001

Figure 8.

Figure 8.

(a) LDi(θ) (likelihood distance). (b) GDi(θ) (generalized Cook's distance).(c) Residual analysis of the OLLGIG additive partial regression with pb(·) smoother fitted to the ethanol data. (d) Normal probability plot for the qrs with envelope. (e) Shape of the penalized smoothers pb(·) for the covariable x2.

4.3. OLLGIG semiparametric regression fitted to air quality data

The application refers to the air quality data (airquality) available in the R software. For this analysis, the lines with missing information were omitted. The data are the daily air quality readings (from 1 May to 30 September 1973) obtained from the New York State, Department of Environmental Conservation (ozone data) and the U.S. National Weather Service (meteorological data) (more details see [3]). In this application, is it used the OLLGIG semiparametric regression and compare it with the GIG and IG sub-models, where the systematic component is given in Equation (6) to describe the relation between the air quality and the other covariables. We also consider (as in the first application) the penalized smoothers cs(·), ps(·) and pb(·) in the linear predictors. The data are:

  • yi: average ozone concentration in parts per billion from 1:00 to 3:00 p.m. on Roosevelt Island;

  • wi1: the explanatory variable month, considered as a factor with five levels (May, June, July, August and September);

  • xi2: solar radiation in Langleys in the frequency range from 4000 to 7700 Angstroms from 8:00 a.m. to 12:00 noon in Central Park;

  • xi3: maximum daily temperature in degrees Fahrenheit at La Guardia Airport, i=1,,111.

Figure 9 shows that there is a nonlinear relationship between the response variable and each of the covariables x2 and x3. Then, we adopt the OLLGIG semiparametric regression with different penalized smoothers to analyze these data. Table 13 presents the OLLGIG, GIG and IG semiparametric regressions with different systematic components with nonlinear effects in the explanatory variables x2 and x3.

Figure 9.

Figure 9.

Scatter diagram: (a) yi versus xi2. (b) yi versus xi3.

The values of the GAIC measure for the nine fitted regressions are listed in Table 13. The OLLGIG semiparametric regression with ps(·) smoother has the smallest GAIC among those of the nine fitted regressions, and then it can be indicated as the best model. Table 14 gives the MLEs, SEs and p-values of the model parameters. For the 5% significant level, the explanatory variable w1 is significant. Since the values of the estimates are negative, there is a strong evidence in June and September and a beginning of a lower average level of ozone in May. The values of the LR statistics for testing two sub-models of the OLLGIG semiparametric regression with the ps(·) smoother are reported in Table 15, which yield favorable indications for the wider semiparametric regression. Generalized Cook's distance GDi(θ) and likelihood distance LDi(θ) are displayed in Figure 10. These plots show that the cases 23 and 77 are possible influential observations. On the other hand, the plot of the qrs versus the fitted is explicit in Figure 10(c). It is clear a random performance of the residuals around the x-axis and that the observation 17 is outside the range [3,3]. We verify the quality of the adjustment range of the OLLGIG semiparametric regression by the normal probability plot for the rqs with the simulated envelope given in Figure 10(d). This plot supports the good fit of the OLLGIG semiparametric regression with ps(·) to the current data. The values of the covariables x2 and x3 are expressed in the horizontal axis of Figure 11 and the contribution of the penalized smoother ps(·) in each of these covariables in the vertical axis. We note that the effects of solar radiation and temperature are nonlinear as expected. We have two conclusions:

  • The penalized smoother for x2 as noted in Figure 11(a) presents an increasing period of median ozone incidence and the decay of the adjusted curve from 240 (approximately). In relation to the variability remained constant, only above the level of solar radiation around 300 occurred an increase in the variability of the median incidence of ozone.

  • The functional form of the covariable x3 in Figure 11(b) shows a continuous increase in the median incidence of ozone in relation to the temperature up to around 95 °F, thus tending to decrease the adjusted curve. Further, there is a considerable increase in the variability of median ozone incidence when the temperature is above 95 °F.

Table 14. MLEs, SEs and p-values for the fitted semiparametric OLLGIG regression with ps(·) to the air quality data.

Parameter Estimate SE p-value
β0 −0.6978 0.5333 0.1938
β11 −0.3783 0.1737 0.0318
β12 −0.1439 0.1715 0.4033
β13 −0.1056 0.1716 0.5395
β14 −0.3329 0.1418 0.0209
log(σ) 3.7104 0.3348  
ν 0.2315 0.0251  
τ 6.9203 0.4931  

Table 15. LR tests for some semiparametric regressions.

Models Hypotheses Statistic w p-value
OLLGIG ps vs GIG ps H0:τ=1 vs H1:H0is\,false 6.5713 0.0104
OLLGIG ps vs IG ps H0:τ=1 and ν=0.5 vs H1:H0is\,false 86.5778 <0.001

Figure 10.

Figure 10.

Index plots for θ: (a) LDi(θ) (likelihood distance) and (b) GDi(θ) (generalized Cook's distance). (c) Residual analysis of the fitted OLLGIG semiparamteric regression to the current data. (d) Normal probability plot for the qrs with envelope.

Figure 11.

Figure 11.

Shapes of the penalized smoothers ps(·) for the covariables x2 and x3 via the OLLGIG semiparametic regression model.

5. Concluding remarks

This paper presents the additive, partial additive and semiparametric regression models under a distribution, called the odd log-logistic generalized inverse Gaussian (OLLGIG), which are very flexible for both unimodal and bimodal data. The proposed regressions include as embedded models the generalized inverse Gaussian and inverse Gaussian regressions in addition to the systematic components with three types of penalized smoothers. The proposed regressions extend some existing additive, additive partial and semiparametric regressions and they can be valuable additions for search line in regression models and extensions. The maximum penalized likelihood method is detailed to estimate the model parameters. The sensitivity of penalized maximum-likelihood estimates of adjusted regressions using quantile residuals was also discussed. The versatility of the proposed regressions is proved empirically through three applications to climatology, ethanol and air quality data.

Acknowledgments

This work was supported by CNPq and CAPES, Brazil.

Funding Statement

This work was supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Coordenaçẽo de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Brazil.

Disclosure statement

No potential conflict of interest was reported by the authors.

References

  • 1.Atkinson A.C., Plots, Transformations and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis, University Press, Oxford, 1985. [Google Scholar]
  • 2.Brinkman N.D., Ethanol fuel-Single-Cylinder engine study of efficiency and exhaust emissions, SAE Trans. (1981), pp. 1410–1424. doi: 10.4271/810345. [DOI] [Google Scholar]
  • 3.Chambers J.M., Cleveland W.S., Kleiner B., and Tukey P.A., Graphical Methods for Data Analysis, Wadsworth, Belmont, CA, 1983. [Google Scholar]
  • 4.Del Giudice V., Manganelli B., and De Paola P., Spline smoothing for estimating hedonic housing price models, in International Conference on Computational Science and Its Applications, Springer, Cham, pp. 210–219, 2015.
  • 5.Dunn P.K. and Smyth G.K., Randomized quantile residuals, J. Comput. Graph. Stat. 5 (1996), pp. 236–244. [Google Scholar]
  • 6.Eilers P.H.C. and Marx B.D., Flexible smoothing with B-splines and penalties, Stat. Sci. 11 (1996), pp. 89–121. doi: 10.1214/ss/1038425655 [DOI] [Google Scholar]
  • 7.Etienne X.L., Ferrara G., and Mugabe D., How efficient is maize production among smallholder farmers in Zimbabwe? A comparison of semiparametric and parametric frontier efficiency analyses, Appl. Econ. 51 (2019), pp. 2855–2871. doi: 10.1080/00036846.2018.1558363 [DOI] [Google Scholar]
  • 8.Fan S. and Hyndman R.J., Short-term load forecasting based on a semi-parametric additive model, IEEE Trans. Power Syst. 27 (2011), pp. 134–141. doi: 10.1109/TPWRS.2011.2162082 [DOI] [Google Scholar]
  • 9.Green P.J. and Silverman B.W., Nonparametric Regression and Generalized Linear Models, Chapman and Hall, London, 1994. [Google Scholar]
  • 10.Green P. and Yandell B., Semi-parametric generalized linear models. in Generalized Linear Models, Springer, New York, NY, 1985. pp. 44–55.
  • 11.Hastie T.J. and Tibshirani R.J., Generalized Additive Models, Chapman and Hall, London, 1990. [Google Scholar]
  • 12.Hudson I.L., Kim S.W., and Keatley M.R., Climatic influences on the flowering phenology of four Eucalypts: a GAMLSS approach, in Phenological Research, Springer, Dordrecht, 2010. pp. 209–228.
  • 13.Jørgensen B., Statistical Properties of the Generalized Inverse Gaussian Distribution, 2nd ed., Springer, New York, 1982. [Google Scholar]
  • 14.Lebotsa M.E., Sigauke C., Bere A., Fildes R., and Boylan J.E., Short term electricity demand forecasting using partially linear additive quantile regression with an application to the unit commitment problem, Appl. Energy. 222 (2018), pp. 104–118. doi: 10.1016/j.apenergy.2018.03.155 [DOI] [Google Scholar]
  • 15.Rigby R.A. and Stasinopoulos D.M., Generalized additive models for location, scale and shape, J. R. Stat. Soc. Ser. C (Appl. Stat.) 54 (2005), pp. 507–554. doi: 10.1111/j.1467-9876.2005.00510.x [DOI] [Google Scholar]
  • 16.Ruppert D., Wand M.P., and Carroll R.J., Semiparametric Regression, Cambridge University Press, 2003. [Google Scholar]
  • 17.Souza Vasconcelos J.C., Cordeiro G.M., Ortega E.M.M., and Araújo E.G., The new odd log-logistic generalized inverse Gaussian regression model, J. Probab. Stat. 2019 (2019), pp. 1–13. doi: 10.1155/2019/8575424 [DOI] [Google Scholar]
  • 18.Stasinopoulos D.M. and Rigby R.A., Generalized additive models for location scale and shape (GAMLSS) in R, J. Stat. Softw. 23 (2007), pp. 1–46. doi: 10.18637/jss.v023.i07 [DOI] [Google Scholar]
  • 19.Stasinopoulos M., Rigby B., and Akantziliotou C., Instructions on How to use the Gamlss Package in R, 2nd ed. 2008. Manual available at https://www.gamlss.com/wp-content/uploads/2013/01/gamlss-manual.pdf.
  • 20.Stasinopoulos D.M., Rigby R.A., Heller G.Z., Voudouris V., and De Bastiani F., Flexible Regression and Smoothing: Using GAMLSS in R, Chapman and Hall/CRC The R Series, Boca Raton, FL, 2017. [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES