The semiparametric regression model for bimodal data with different penalized smoothers applied to climatology, ethanol and air quality data

J C S Vasconcelos; G M Cordeiro; E M M Ortega

doi:10.1080/02664763.2020.1803812

. 2020 Aug 7;49(1):248–267. doi: 10.1080/02664763.2020.1803812

The semiparametric regression model for bimodal data with different penalized smoothers applied to climatology, ethanol and air quality data

J C S Vasconcelos ^a,^CONTACT, G M Cordeiro ^b, E M M Ortega ^a

PMCID: PMC9042003 PMID: 35707795

ABSTRACT

Semiparametric regressions can be used to model data when covariables and the response variable have a nonlinear relationship. In this work, we propose three flexible regression models for bimodal data called the additive, additive partial and semiparametric regressions, basing on the odd log-logistic generalized inverse Gaussian distribution under three types of penalized smoothers, where the main idea is not to confront the three forms of smoothings but to show the versatility of the distribution with three types of penalized smoothers. We present several Monte Carlo simulations carried out for different configurations of the parameters and some sample sizes to verify the precision of the penalized maximum-likelihood estimators. The usefulness of the proposed regressions is proved empirically through three applications to climatology, ethanol and air quality data.

Keywords: Additive model, additive partial model, generalized inverse Gaussian distribution, semiparametric model, splines

1. Introduction

For many years, the normal linear regression model has been used to explain the most random phenomena. Even when the phenomenon under study does not present a response for which the normality assumption is reasonable, some types of transformations are suggested to achieve the desired normal distribution. Another important problem in regression models occurs when there are linear and nonlinear effects on the response variable in a single data set.

A great effort was undertaken to provide more flexible assumptions so that these regressions could model real situations with greater precision. However, these flexible assumptions lead to more complex regression models which are very hard to be interpreted in some cases. Nowadays, the literature has various types of regression models such as the generalized linear semiparametric models pioneered by Green and Yandell [10], where it was added a nonparametric term to the linear predictor. Another extension of the generalized linear models is the generalized additive model (GAM) introduced by Hastie and Tibshirani [11], in which the term that is controlled in parametric form is altered by an arbitrary function and becomes controlled in nonparametric form, and then it is estimated by smoothed curves (such as splines). Ruppert et al. [16] demonstrate that nonparametric regression can be considered as a relatively simple extension of parametric regression and combine the two together, in what refers to semiparametric regression, they approach semiparametric regression based on penalized regression splines and mixed models. Rigby and Stasinopoulos [15] developed a generalized additive model for location, scale and shape (GAMLSS), which has been widely used in various areas of science due to its flexibility, by allowing modeling the location, scale and shape simultaneously. The utility of the semiparametric regression method in scenarios of real change is of extreme importance. For example, Fan and Hyndman proposed a new statistical method to predict short-term electricity demand based on a semiparametric additive model, Lebotsa et al. [14] presented an application of partially linear additive quantile regression models to predict short-term electricity demand using data from South Africa, Hudson et al. [12] showed the benefits of the GAMLSS in the modeling and interpretation of possible nonlinear climate impacts on eucalyptus tree growth, Del Giudice et al. [4] presented a hedonic price function constructed through a semiparametric additive model, and more recently, Etienne et al. [7] utilized a semiparametric model and stochastic frontier model to estimate the efficiency of corn production by smallholders in Zimbabwe.

On the other hand, the distributions commonly used in regression models are being modified and/or generalized to enable them to model different complex forms of data. Hence, it is convenient to consider parametric families of distributions that are flexible enough to capture a wide range of symmetric, asymmetric and bimodal behaviors.

In this article, we adopt as baseline the odd log-logistic generalized inverse Gaussian (OLLGIG) distribution introduced recently by Souza Vasconcelos et al. [17]. Thus, the fundamental objective is to propose additive, additive partial and semiparametric regression models for bimodal data from in the OLLGIG distribution with different penalized smoothers.

The inferential component is carried out using the asymptotic distribution of the maximum-likelihood estimators (MLEs). These models are presented with some methods to effect global influence. Additionally, we develop residual analysis from quantile residuals (qrs). For some parameter settings, additive terms and sample sizes, diverse Monte Carlo simulations are carried out making comparison the empirical distribution of the qrs with the standard normal distribution. These simulations indicate that the empirical distribution of these residuals with different penalized smoothers present conformity in what it refers to standard normal distribution.

The rest of the paper is structured following way. In Section 2, the OLLGIG semiparametric regression model will be defined based on different penalized smoothers, estimate their parameters by the penalized maximum-likelihood method, diagnostic and residual analysis are discussed. In Section 3, some properties of the maximum-likelihood estimators are evaluated using a simulation study. In Section 4, we show empirically how flexible, practical relevance and applicability of the presented regression models by means of three real data sets. Section 5 is devoted to some concluding remarks.

2. The OLLGIG semiparametric regression

For modeling OLLGIG distributions, gamlss package [18] available in R software was used, implementing a new distribution, as described in Section 4.2 in [19]. For the regression analysis, we use the function gamlss(·) from the gamlss package [18], in which the regression structures with the penalized smoothers are described in Tables 5, 10 and 13.

Table 5. Systematic components of the OLLGIG, GIG and IG additive regressions and goodness-of-fit measures for climatology data.

Model	Systematic structures	GAIC
OLLGIG	$μ_{i} = \exp [β_{0} + c s (x_{i 1}) + c s (x_{i 2}) + c s (x_{i 3})]$	402.3305
GIG	$μ_{i} = \exp [β_{0} + c s (x_{i 1}) + c s (x_{i 2}) + c s (x_{i 3})]$	407.7017
IG	$μ_{i} = \exp [β_{0} + c s (x_{i 1}) + c s (x_{i 2}) + c s (x_{i 3})]$	413.6021
OLLGIG	$μ_{i} = \exp [β_{0} + p s (x_{i 1}) + p s (x_{i 2}) + p s (x_{i 3})]$	408.9545
GIG	$μ_{i} = \exp [β_{0} + p s (x_{i 1}) + p s (x_{i 2}) + p s (x_{i 3})]$	413.9161
IG	$μ_{i} = \exp [β_{0} + p s (x_{i 1}) + p s (x_{i 2}) + p s (x_{i 3})]$	419.9604
OLLGIG	$μ_{i} = \exp [β_{0} + p b (x_{i 1}) + p b (x_{i 2}) + p b (x_{i 3})]$	401.8478
GIG	$μ_{i} = \exp [β_{0} + p b (x_{i 1}) + p b (x_{i 2}) + p b (x_{i 3})]$	405.1313
IG	$μ_{i} = \exp [β_{0} + p b (x_{i 1}) + p b (x_{i 2}) + p b (x_{i 3})]$	410.8665

Model	Systematic structures	GAIC
OLLGIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + c s (x_{i 2})]$	36.2462
GIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + c s (x_{i 2})]$	39.4854
IG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + c s (x_{i 2})]$	91.4063
OLLGIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p s (x_{i 2})]$	39.3340
GIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p s (x_{i 2})]$	41.0862
IG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p s (x_{i 2})]$	92.2585
OLLGIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p b (x_{i 2})]$	35.0047
GIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p b (x_{i 2})]$	37.6757
IG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p b (x_{i 2})]$	91.0999

Model	systematic components	GAIC
OLLGIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + c s (x_{i 2}) + c s (x_{i 3})]$	936.3173
GIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + c s (x_{i 2}) + c s (x_{i 3})]$	940.2850
IG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + c s (x_{i 2}) + c s (x_{i 3})]$	1019.6003
OLLGIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p s (x_{i 2}) + p s (x_{i 3})]$	934.7905
GIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p s (x_{i 2}) + p s (x_{i 3})]$	939.3618
IG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p s (x_{i 2}) + p s (x_{i 3})]$	1017.3683
OLLGIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p b (x_{i 2}) + p b (x_{i 3})]$	941.1109
GIG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p b (x_{i 2}) + p b (x_{i 3})]$	941.7588
IG	$μ_{i} = \exp [β_{0} + β_{1} w_{i 1} + p b (x_{i 2}) + p b (x_{i 3})]$	1018.3338

Regression	Penalized smoothers	Systematic components
	cs(·)	$μ_{i} = \exp [β_{1} w_{i 1} + β_{2} w_{i 2} + c s (x_{i 3})]$
OLLGIG	ps(·)	$μ_{i} = \exp [β_{1} w_{i 1} + β_{2} w_{i 2} + p s (x_{i 3})]$
	pb(·)	$μ_{i} = \exp [β_{1} w_{i 1} + β_{2} w_{i 2} + p b (x_{i 3})]$

Scenario 1
	n = 50			n = 100			n = 250
Parameters	AE	Bias	MSE	AE	Bias	MSE	AE	Bias	MSE
$β_{1}$	0.0104	0.0004	0.0052	0.0085	−0.0015	0.0023	0.0100	0.0000	0.0009
$β_{2}$	−0.9924	0.0076	0.0200	−0.9969	0.0031	0.0087	−1.0033	−0.0033	0.0035
Scenario 2
	n = 50			n = 100			n = 250
Parameters	AE	Bias	MSE	AE	Bias	MSE	AE	Bias	MSE
$β_{1}$	0.0102	0.0002	0.0051	0.0085	−0.0015	0.0022	0.0090	−0.0010	0.0009
$β_{2}$	−0.9900	0.0100	0.0199	−0.9972	0.0028	0.0089	−1.0031	−0.0031	0.0034
Scenario 3
	n = 50			n = 100			n = 250
Parameters	AE	Bias	MSE	AE	Bias	MSE	AE	Bias	MSE
$β_{1}$	0.0081	−0.0019	0.0053	0.0066	−0.0034	0.0022	0.0095	−0.0005	0.0009
$β_{2}$	−1.0068	−0.0068	0.0194	−0.9974	0.0026	0.0093	−0.9988	0.0012	0.0037

Model	$\log (μ)$	$\log (σ)$	ν	τ	AIC	GD
OLLGIG	1.2780	−.19504	27.8990	0.2875	482.7119	474.7119
	(0.0226)	(0.0554)	(9.8170)	(0.0193)
GIG	1.3176	−1.0429	3.3460	1	491.1583	487.1583
	(0.0269)	(0.1295)	(5.3590)	(–)
IG	1.3178	−1.7319	−0.5	1	492.6715	486.6715
	(0.0276)	(0.0569)	(–)	(–)

Models	Hypotheses	Statistic w	p-value
OLLGIG vs GIG	$H_{0} : τ = 1$ vs $H_{1} : H_{0} is\,false$	11.9596	0.0005
OLLGIG vs IG	$H_{0} : τ = 1$ and $ν = - 0.5$ vs $H_{1} : H_{0} is\,false$	12.4464	0.0019

Parameter	Estimate	SE	p-value
$β_{0}$	−0.2576	0.1694	0.1305
$\log (σ)$	0.0169	0.2495
ν	0.4622	0.4266
τ	4.3021	0.6548

Regressions	Hypotheses	Statistic w	p-value
OLLGIG pb(·) vs GIG pb(·)	$H_{0} : τ = 1$ vs $H_{1} : H_{0} is\,false$	6.0701	0.0244
OLLGIG pb(·) vs IG pb(·)	$H_{0} : τ = 1$ and $ν = - 0.5$ vs $H_{1} : H_{0} is\,false$	14.6503	0.0018

Model	$\log (μ)$	$\log (σ)$	ν	τ	AIC	GD
OLLGIG	0.5341	−.11017	22.8500	0.1951	250.8621	242.8621
	(0.0662)	(0.9342)	(9.8800)	(0.0656)
GIG	0.6717	−0.1787	1.7900	1	262.7169	256.7169
	(0.0678)	(0.2475)	(1.0310)	(–)
IG	0.6716	−0.6436	−0.5	1	265.1399	261.1399
	(0.0783)	(0.0754)	(–)	(–)

Models	Hypothesis	Statistic w	p-value
OLLGIG vs GIG	$H_{0} : τ = 1$ vs $H_{1} : H_{0} is\,false$	13.8548	<0.001
OLLGIG vs IG	$H_{0} : τ = 1$ and $ν = - 0.5$ vs $H_{1} : H_{0} is\,false$	18.2778	<0.001

Parameter	Estimate	SE	p-value
$β_{0}$	−0.6978	0.5333	0.1938
$β_{11}$	−0.3783	0.1737	0.0318
$β_{12}$	−0.1439	0.1715	0.4033
$β_{13}$	−0.1056	0.1716	0.5395
$β_{14}$	−0.3329	0.1418	0.0209
$\log (σ)$	3.7104	0.3348
ν	0.2315	0.0251
τ	6.9203	0.4931

Parameter	Estimate	SE	p-value
$β_{0}$	−1.3744	0.0791	<0.001
$β_{1}$	0.0261	0.0041	<0.001
$\log (σ)$	20.3111	0.0051
ν	4.4625	0.5352
τ	3.3840	0.2694

Models	Hypotheses	Statistic w	p-value
OLLGIG pb vs GIG pb	$H_{0} : τ = 1$ vs $H_{1} : H_{0} is\,false$	4.2999	0.0281
OLLGIG pb vs IG pb	$H_{0} : τ = 1$ and $ν = - 0.5$ vs $H_{1} : H_{0} is\,false$	60.7959	<0.001

Models	Hypotheses	Statistic w	p-value
OLLGIG ps vs GIG ps	$H_{0} : τ = 1$ vs $H_{1} : H_{0} is\,false$	6.5713	0.0104
OLLGIG ps vs IG ps	$H_{0} : τ = 1$ and $ν = - 0.5$ vs $H_{1} : H_{0} is\,false$	86.5778	<0.001

PERMALINK

The semiparametric regression model for bimodal data with different penalized smoothers applied to climatology, ethanol and air quality data

J C S Vasconcelos

G M Cordeiro

E M M Ortega

ABSTRACT

1. Introduction

2. The OLLGIG semiparametric regression

Table 5. Systematic components of the OLLGIG, GIG and IG additive regressions and goodness-of-fit measures for climatology data.

Table 10. Additive partial regressions and GAIC for some regressions fitted to the ethanol data.

Table 13. Semiparametric regressions and GAIC statistic from the fitted regressions to the air quality data.

2.1. Diagnostic tools and residual analysis

3. Simulation study using different penalized smoothers

Table 1. Systematic components for the parameters.

Table 2. AEs, biases and MSEs for the fitted OLLGIG regression with penalized smoothers under scenarios 1[cs(·)], 2[ps(·)] and 3[pb(·)].

Figure 1.

Figure 2.

4. Applications

4.1. OLLGIG additive regression to climatology data

Table 3. MLEs and SEs of the model parameters for climatology data.

Table 4. LR tests for climatology data.

Figure 3.

Figure 4.

Table 6. MLEs, SEs and p-values for the OLLGIG additive regression with pb(·) fitted to climatology data.

Table 7. LR tests for comparing regressions.

Figure 5.

Figure 6.

4.2. OLLGIG additive partial regression fitted to ethanol data

Table 8. MLEs and SEs (in parentheses) of the model parameters for ethanol data.

Table 9. LR tests for ethanol data.

Figure 7.

Table 11. MLEs, SEs and p-values for the OLLGIG additive partial regression with pb(·) fitted to ethanol data.

Table 12. LR statistics for testing some regressions.

Figure 8.

4.3. OLLGIG semiparametric regression fitted to air quality data

Figure 9.

Table 14. MLEs, SEs and p-values for the fitted semiparametric OLLGIG regression with ps(·) to the air quality data.

Table 15. LR tests for some semiparametric regressions.

Figure 10.

Figure 11.

5. Concluding remarks

Acknowledgments

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases