ABSTRACT
The coefficients of regression are usually estimated for minimization problems with asymmetric loss functions. In this paper, we rather correct predictions so that the prediction error follows a generalized Gaussian distribution. In our method, we not only minimize the expected value of the asymmetric loss, but also lower the variance of the loss. Predictions usually have errors. Therefore, it is necessary to use predictions in consideration of these errors. Our approach takes into account prediction errors. Furthermore, even if we do not understand the prediction method, which is a possible circumstance in, e.g. deep learning, we can use our method if we know the prediction error distribution and asymmetric loss function. Our method can be applied to procurement of electricity from electricity markets.
Keywords: Asymmetric loss function, gamma function, generalized Gaussian distribution, minimizing the expectation value, risk function
2010 Mathematics Subject Classifications: 62A99, 62E99
1. Introduction
In this paper, we treat minimization problems with loss functions as follows: Let be a dataset, where are explanatory vectors and are target variables. We assume that the following linear regression model:
where , , and X is an design matrix having as the ith row. Let L be the loss function for the parameter estimation. The parameter vector can be estimated by
The case of is known as the quadratic loss function, which is symmetric (see, e.g. Refs. [1,10,12]). In the case of an asymmetric loss function, we refer the reader to, e.g. Refs. [3,5,9,16]. Consider the case that the prediction accuracy is assessed by a risk function based on an asymmetric loss function. Let y and be an observed and a predicted value of y. Suppose that the prediction error follows a general Gaussian distribution. Then, the error can be corrected by adding a constant C, i.e. , determined by minimizing the risk function. We make the following assumptions:
- Let , . If there is a mismatch between y and , we suffer a loss,
Here, could be a Gaussian distribution or Laplace distribution with mean zero (see, e.g. [[7, p. 80], [8, p. 164]]). That is, is the probability density function of a Gaussian distribution when , and is the probability density function of a Laplace distribution when a = 1.
Under the assumptions (I) and (II), we will derive the optimized predicted value minimizing the expected value of the loss (risk). That is, we will solve the following risk minimization problem:
In our method, we not only minimize the expected value of the asymmetric loss, but also lower the variance of the loss:
The variance of the loss is reduced by correcting the predicted value y to the optimized predicted value .
The motivation of our research is as follows: (1) Suppose we have a risk minimization problem based on an asymmetric loss function. Then, we can reduce the risk by correcting the predictors. For example, the paper [15] formulates a method for minimizing the expected value of the procurement cost of electricity in two popular spot markets: day-ahead and intra-day, under the assumption that the expected value of the unit prices and the distributions of the prediction errors for the electricity demand traded in two markets are known. The paper showed that if the procurement is increased or decreased from the prediction, in some cases, the expected value of the procurement cost is reduced. Our method is a generalization of the method in [15].
(2) In recent years, prediction methods have been black boxed by big data and machine learning (see, e.g. Ref. [6]). The day will soon come when we must minimize the objective function by using predictions obtained by such black-box methods. In our method, even if we do not know the prediction , we can determine the parameter C if we know the prediction error distribution and asymmetric loss function L.
The remainder of this paper is organized as follows: In Section 2, we introduce the expected value and the variance of and determine the value c = C that gives the minimum value of . In addition, we give a geometrical interpretation of the parameter C and give the minimized expected value . In Section 3, we prove the inequality for the variance of the loss. In Section 4, we report the results of simulations of our method using actual data.
2. Expected value and variance of the loss
In the following, we assume that (I) and (II) hold. Here, we introduce the expected value and the variance of and determine the value that gives the minimum value of . In addition, we give a geometrical interpretation of the parameter C and give the minimized expected value .
2.1. Expected value and variance of the loss
Let y be an observation value and be a predicted value of y. Let be the gamma function, and let and be the upper and the lower incomplete gamma functions, respectively (see, e.g. Refs. [2, p. 197, 4, p. 2, 14, p. 93]), defined by
where and . Also, for , let . Then, the expected value and variance of are as follows (see Appendix 1 for the proof): For any , we have
| (1) |
| (2) |
From this, we have the following:
| (3) |
| (4) |
Let be the error function (see, e.g. Refs. [2, p. 196]) defined by
for any . We give two examples of and . In the case of , since , we have
To derive these expressions, we used the equations , and , which are easily obtained from the definitions of the incomplete gamma functions. In the case of , since , we have
To derive these expressions, we used the equations , and , which are easily obtained from the definitions of the incomplete gamma functions. For and b = 1, we can plot and for the Laplace and the Gaussian distributions respect to as follows:
In both cases, the graph of the expected value is a straight line (positive slope), and the graph of the variance is a quadratic curve (convex downward) for (Figures 1–2).
Figure 1.
Plots for a Laplace distribution. (a) Expected value of . (b) Variance of .
Figure 2.
Plots for a Gaussian distribution. (a) Expected value of . (b) Variance of .
2.2. Parameter value minimizing the expected value
Here, we determine the value c = C that gives the minimum value of . Since
we have
We will denote the value of c satisfying as C. Then, from the first derivative test, we find that has a minimum value at c = C.
| c | Less than C | C | More than C |
|---|---|---|---|
| Negative | 0 | Positive | |
| Strongly decreasing | Strongly increasing |
Also, it follows from
| (5) |
that and C = 0 only when . Equation (5) presents us with a geometrical interpretation of C as follows: The ratio of and is . That is, the vertical axis divides the area between and the t-axis into .
Let be the inverse error function. We give two examples of C. In the case of , since a = 1, from , we have
In the case of , since , from , we have
| (6) |
Figure 3 plots C for Laplace and Gaussian distributions respect to for and b = 1.
Figure 3.
Plots of C for Laplace and Gaussian distributions. (a) When . (b) When .
2.3. Minimized expected value of the loss
Here, we derive the minimum value of . Substituting in Equation (A1), we have, from Equation (5),
This is the minimum value of . From this and Equation (3), we have
Figure 4 plots for the Laplace and Gaussian distributions respect to for and b = 1.
Figure 4.
Plots of for the Laplace and Gaussian distributions. (a) When . (b) When .
3. Inequality for the variance of the loss
Here, we derive an inequality for the variance of . Let C be the value of c giving the minimum value of . Then, the following holds:
Theorem 3.1
We have
where equality holds only when that is, when .
Proof.
Put . It follows from Equation (5) that
Hence, substituting c = C in Equation (A2), we have
From this and Equation (4), we obtain
where, for a>0 and , is defined as
Here, since
from Lemma A.1 in Appendix 2, we have (a>0, x>0). Also, holds for a>0. Therefore, we obtain
where equality holds only when C = 0. Moreover, from Equation (5), we find that C = 0 holds only when .
Figure 5 plots for the Laplace and Gaussian distributions respect to for and b = 1.
Figure 5.
Plots of for the Laplace and the Gaussian distributions. (a) When . (b) When .
4. Simulation
The simulations of our method used actual data, in particular ‘cars’ of the R datasets package. The cars is a dataset of speeds and stopping distances of cars.
We separated the cars data into training and test datasets as follows: The odd-numbered data were selected the cars is the training dataset, and the even-numbered data were the test dataset. The following figure shows the scatter plots of the training and test data. The horizontal axis represents speed, and the vertical axis represents stopping distances of cars (Figure 6).
Figure 6.
Scatter plots of the training and test data. (a) Training data. (b) Verification data.
We used the training dataset to fit the parameters of the regression model and also used it to find the solution to the minimization problem:
The regression coefficients of y = ax + b were estimated to be and by using the least squares method, and the unbiased sample variance of the error was 308.42. We obtained a prediction and . Fix the conditions as and . Then C = 16.99 from Equation (6), and the estimated solution to the minimization problem is . Figure 7 plots (i) , (ii) , and (iii) .
Figure 7.
Scatter plots and plots of (i), (ii), and (iii). (a) Scatter plot for the training data and plots of (i), (ii), and (iii). (b) Scatter plot for the test data and plots of (i), (ii), and (iii).
The slope of is steeper than the slope of . Therefore, if the model is true, , and , then the value is far from when x is large. That is, the loss is large when x is large. This implies that using as a prediction is high risk. On the other hand, the slope of is equal to the slope of . Therefore, using as a prediction is low risk.
Table 1 lists the sample means and sample variances of the loss for the training and test data when and .
Table 1. Sample means and sample variances of the loss for the training and the test data when and .
| Mean (training) | Mean (test) | Variance (training) | Variance (test) | |
|---|---|---|---|---|
| (i) | 37.31 | 28.05 | 3337.93 | 896.99 |
| (ii) | 31.31 | 21.86 | 834.35 | 105.40 |
| (iii) | 28.48 | 25.56 | 612.18 | 127.50 |
Figure 8 is a plot of the sample means of the loss for the training and test data for (i), (ii), and (iii) respect to .
Figure 8.
Plots of sample means for the training and the test data. (a) Sample means for the training data. (b) Sample means for the test data.
For the test data, the plot of the loss of (iii) is in the form of steps. This is because and are discontinuous with respect to . On the other hand, the plot of the loss of (ii) is smooth. This is because C is continuous with respect to . Therefore, the loss of (ii) is more stable than the loss of (iii). In addition, deriving and is troublesome, but deriving C is easy because C is explicitly a function of and .
Figure 9 plots sample variances of the loss for the training and test data for (i), (ii), and (iii) respect to .
Figure 9.
Plots of sample variances for the training and the test data. (a) Sample variance for the training data. (b) Sample variances for the training data. (c) Sample variance for the test data. (d) Sample variances for the test data.
Clealy, from Figures 8 and 9, the sample means and sample variances of (ii) and (iii) are lower than those of (i) for any . This simulation shows that it is best to use (ii) as the prediction. Other simulation results are listed in Appendix 3.
Appendix 1.
Calculation of the expected value and variance of the loss
A.1. Expected value of the loss
Put . Then,
Replace z with bz to get
When , we have
When c<0, we have
From the above, for any , we have
Now set to get
where . Therefore, for any , we have
A.2. Variance of the loss
Put . Then,
Replace z with bz to get
When , we have
When c<0, we have
From the above, for any , we have
Now set to get
where . Therefore, for any , we have
Moreover, from Appendix A.1, we have
Therefore, for any , we have
Appendix 2. Inequalities for the gamma and the incomplete gamma functions
Here, we prove the following lemma used in Theorem 3.1.
Lemma A.1
For a>0 and
To prove Lemma A.1, we need the following three lemmas:
Lemma A.2
For
Lemma A.3
For a>0 and
Lemma A.4
For a>0 and
Proof Proof of Lemma A.2 —
First, we prove
(A1) Let . Accordingly, we have
Therefore, we obtain
Thus, Equation (A1) is proved.
Next, by using Equation (A1), we prove
(A2) for a>0. Let
To prove for a>0, we use the following formula [2, p. 13, Theorem 1.2.5]:
where is Euler's constant. Taking the logarithmic derivative of , from the above formula, we have
for a>0. Moreover, from Equation (A1), we obtain for a>0. This leads to for a>0. Equation (A2) follows from this and .
Using Equation (A2) and the formula [2, p. 22, Theorem 6.5.1]
we complete the proof of Lemma A.2 as follows:
Proof Proof of Lemma A.3 —
For a>0 and , we define
Then, we have
The lemma follows from this and .
Proof Proof of Lemma A.4 —
When , the statement easily follows from the definition of . When b>0, we use L'Hôpital's rule to obtain
Proof Proof of Lemma A.1 —
For a>0 and , we define
Let us prove (a>0, x>0). For a>0 and , we define
Then, we have
From these relations, we find that the signs of and (i = 1, 2, 3) are the same for a>0 and . Let (i = 2, 3, 4) be the value of x satisfying . It is easily verified that and for a>0. Therefore, from the first derivative test, we obtain Tables 2 and 3. Moreover, using Lemmas A.3 and A.4 and L'Hôpital's rule, we obtain
From these results, Lemma A.2, and the fact that the signs of and (i = 1, 2, 3) are the same for a>0 and x>0, we obtain Tables 4 and 5. From Tables 4 and 5, we can verify that holds for and x>0. This completes the proof of the lemma.
Table A1. Case of 0<a<1.
x 0 ··· ··· ··· 0 + 0 − − − 0 0 + + + 0 − − Table A2. Case of .
x 0 ··· ··· 0 + 0 − 0 0 + + + Table A3. Case of 0<a<1.
x 0 ··· ··· ··· ··· + + + + + 0 − 0 − − − − 0 + + + 0 − − − − 0 + + + 0 + + 0 − − − − − 0 + + 0 − − − − − 0 0 + + + + + + + 0 Table A4. Case of .
x 0 ··· ··· ··· 0 + + + + + − − − − 0 + 0 − − − − 0 + 0 + + 0 − − − 0 + + 0 − − − 0 0 + + + + + 0
Appendix 3. Other simulations
We conducted three other simulations of our method using actual data, i.e. ‘cars’ of the R dataset package. This dataset includes the speeds and stopping distances of automobiles.
We separated the dataset into training and test datasets as follows: The even-numbered data were selected as the training dataset, and the odd-numbered data were the test dataset. The regression coefficients of were estimated as and by using the least squares method, and unbiased sample variance of the error was 159.31. We obtained a prediction and . Fix the conditions as and . Then, C = 16.99 by Equation (6), and the estimated solution to the minimization problem is . Figure A1 plots (i) , (ii) , and (iii) .
Figure A.1.
Scatter plots and plots of (i), (ii), and (iii). (a) Scatter plot for the training data and plots of (i), (ii), and (iii). (b) Scatter plot for the test data and plots of (i), (ii), and (iii).
Table 6 lists the sample means and sample variances of the loss for the training and test data when and .
Table A5. Sample means and sample variances of the loss for the training and test data when and .
| Mean (training) | Mean (test) | Variance (training) | Variance (test) | |
|---|---|---|---|---|
| (i) | 31.42 | 42.54 | 1307.61 | 4065.98 |
| (ii) | 20.04 | 31.74 | 216.34 | 1836.93 |
| (iii) | 19.56 | 30.40 | 169.21 | 1540.48 |
Figure A2 plots the sample means of the loss for the training and test data for (i), (ii), and (iii) respect to .
Figure A.2.
Plots of sample means for the training and the test data. (a) Sample means for the training data. (b) Sample means for the test data.
Moreover Figure A3 plots the sample variances of the loss for the training and test data for (i), (ii), and (iii) respect to .
Figure A.3.
Plots of sample variances for the training and the test data. (a) Sample variance for the training data. (b) Sample variances for the training data. (c) Sample variance for the test data. (d) Sample variances for the test data.
Next, we separated the cars dataset into training and test datasets as follows: The 1st to 25th data of the cars dataset were selected as the training dataset (the data of the cars dataset are arranged in ascending order of speed), and the 26th to 50th data were the test dataset. The regression coefficients of y = ax + b were estimated to be and by using the least squares method, and unbiased sample variance of the error was 178.77. Figure A4 plots the sample means of the loss for the training and test data for (i), (ii), and (iii) respect to .
Figure A.4.
Plots of sample means for the training and test data. (a) Sample means for the training data. (b) Sample means for the test data.
Moreover, Figure A5 plots the sample variances of the loss for the training and test data for (i), (ii), and (iii) for respect to .
Figure A.5.
Plots of sample variances for the training and test data.(a) Sample variance for the training data. (b) Sample variances for the training data. (c) Sample variance for the test data. (d) Sample variances for the test data.
Finally, we separated the cars into training and test datasets as follows: The 26th to 50th data were selected as the training dataset, and the 26th to 50th data were the test dataset. The regression coefficients of were estimated to be and by using the least squares method, and the unbiased sample variance of the error was 277.30. Figure A6 plots the sample means of the loss for the training and test data for (i), (ii), and (iii) respect to .
Figure A.6.
Plots of sample means for the training and the test data. (a) Sample means for the training data. (b) Sample means for the test data.
Moreover, Figure A7 plots the sample variances of the loss for the training and test data for (i), (ii), and (iii) respect to .
Figure A.7.
Plots of sample variances for the training and test data. (a) Sample variance for the training data. (b) Sample variances for the training data. (c) Sample variance for the test data. (d) Sample variances for the test data.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Aldrich J., Doing least squares: Perspectives from gauss and yule, Int. Stat. Rev. 66 (1998), pp. 61–81. Available at https://onlinelibrary.wiley.com/doi/abs/ 10.1111/j.1751-5823.1998.tb00406.x. [DOI] [Google Scholar]
- 2.Andrews G.E., Askey R., and Roy R., Special Functions, Encyclopedia of Mathematics and its Applications, Cambridge University Press, New York, 1999. [Google Scholar]
- 3.Breckling J. and Chambers R., M-quantiles, Biometrika 75 (1988), pp. 761–771. Available at http://www.jstor.org/stable/2336317. doi: 10.1093/biomet/75.4.761 [DOI] [Google Scholar]
- 4.Dytso A., Bustin R., Poor H.V., and Shamai S., Analytical properties of generalized gaussian distributions, J. Stat. Distrib. Appl. 5 (2018), p. 6. Available at 10.1186/s40488-018-0088-5. [DOI] [Google Scholar]
- 5.Efron B., Regression percentiles using asymmetric squared error loss, Stat. Sin. 1 (1991), pp. 93–125. Available at http://www.jstor.org/stable/24303995. [Google Scholar]
- 6.Guidotti R., Monreale A., Turini F., Pedreschi D., and Giannotti F., A survey of methods for explaining black box models, ACM Comput. Surv. 51 (2018), pp. 93:1–93:42. [Google Scholar]
- 7.Johnson N., Kotz S., and Balakrishnan N., Continuous Univariate Distributions, 2nd ed., Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, Vol. 1, Wiley-Interscience, 1994. [Google Scholar]
- 8.Johnson N., Kotz S., and Balakrishnan N., Continuous Univariate Distributions, 2nd ed., Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, Vol. 2, Wiley-Interscience, 1995. [Google Scholar]
- 9.Koenker R. and Bassett G., Regression quantiles, Econometrica 46 (1978), pp. 33–50. Available at http://www.jstor.org/stable/1913643. doi: 10.2307/1913643 [DOI] [Google Scholar]
- 10.Legendre A., Nouvelles méthodes pour la détermination des orbites des comètes, Nineteenth Century Collections Online (NCCO): Science, Technology, and Medicine: 1780–1925, F. Didot, 1805. Available at https://books.google.co.jp/books?id=FRcOAAAAQAAJ.
- 11.Nadarajah S., A generalized normal distribution, J. Appl. Stat. 32 (2005), pp. 685–694. Available at 10.1080/02664760500079464. [DOI] [Google Scholar]
- 12.Stigler S.M., Gauss and the invention of least squares, Ann. Statist. 9 (1981), pp. 465–474. Available at 10.1214/aos/1176345451. [DOI] [Google Scholar]
- 13.Subbotin T., On the law of frequency of error, Recueil Math. 31 (1923), pp. 296–301. [Google Scholar]
- 14.Wang Z.X. and Guo D.R., Special Functions, World Scientific, 1989. Available at https://www.worldscientific.com/doi/abs/ 10.1142/0653. [DOI] [Google Scholar]
- 15.Yamaguchi N., Hori M., and Ideguchi Y., Minimising the expectation value of the procurement cost in electricity markets based on the prediction error of energy consumption, Pac. J. Math. Ind. 10 (2018), p. 4. Available at 10.1186/s40736-018-0038-7. [DOI] [Google Scholar]
- 16.Zellner A., Bayesian estimation and prediction using asymmetric loss functions, J. Am. Stat. Assoc. 81 (1986), pp. 446–451. Available at http://www.jstor.org/stable/2289234. doi: 10.1080/01621459.1986.10478289 [DOI] [Google Scholar]
















