Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function

Lili Zhang; Trent Geisler; Herman Ray; Ying Xie

doi:10.1080/02664763.2021.1939662

. 2021 Jun 16;49(13):3257–3277. doi: 10.1080/02664763.2021.1939662

Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function

Lili Zhang ^a, Trent Geisler ^a, Herman Ray ^b,^CONTACT, Ying Xie ^c

PMCID: PMC9542776 PMID: 36213775

Abstract

Logistic regression is estimated by maximizing the log-likelihood objective function formulated under the assumption of maximizing the overall accuracy. That does not apply to the imbalanced data. The resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood function. Existing solutions require either hard hyperparameter estimating or high computational complexity. We propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients. In the experiments, the proposed logistic regression model is compared with the existing ones on the statistics of area under receiver operating characteristics (ROC) curve from 10 public datasets and 16 simulated datasets, as well as the training time. A detailed analysis is conducted on an imbalanced credit dataset to examine the estimated probability distributions, additional performance measurements (i.e. type I error and type II error) and model coefficients. The results demonstrate that both the discrimination ability and computation efficiency of logistic regression models are improved using the proposed log-likelihood function as the learning objective.

Keywords: Logistic regression, binary classification, imbalanced data, maximum likelihood, penalized log-likelihood function, cost-sensitive

1. Introduction

The imbalanced data present a big challenge in the data-driven world. The minority class (i.e. event) is usually the class of interest and more costly if misclassified, such as fraud in the fraud detection problem [36], malignance in the breast cancer diagnosis problem [27] and delinquency in the credit scoring problem [5]. By definition, the imbalanced data are considered by most practitioners to be the data where the number of observations labeled by the majority class (i.e. non-event) is twice of the minority class or more [21].

The challenge is that most statistics and machine learning methods are biased towards the majority class and cannot predict the minority class accurately, including the logistic regression model that has been favored for its high interpretability. The bias can bring a great loss (money, reputation, etc.). It is caused by the underlying assumption of the optimization objective (e.g. log-likelihood function), which is to maximize the overall accuracy [21,38]. However, the overall accuracy is not a valid performance measurement on the classification of imbalanced data [28].

Because of significant and broad applications related to the imbalanced data, researchers have made efforts to improve solutions in the past decades in the levels of input data (e.g. oversampling, undersampling) [4,16,19], feature (e.g. feature selection, variable discretization) [33,46,48], algorithm (e.g. cost-sensitive learning, ensemble) [2,10] and output (e.g. thresholding) [22].

In the present work, we focus on improving the logistic regression on the imbalanced data from the perspective of the cost-sensitive learning, considering that the interpretability is often required for prescriptive actions. Logistic regression has been widely used for decision-making systems since its development by Cox and Duncan independently in 1950s and 1960s for binary classification problems [11,12,43]. It is a linear classification model with the optimal model coefficients estimated by maximizing the log-likelihood function [34]. The advantages of logistic regression are multifold, including high interpretability and low time complexity [23]. The application of logistic regression covers broad areas, such as bankruptcy prediction [30], credit scoring [2], heart sound segmentation [40], landslide susceptibility prediction [13], urban land spatial expansion analysis [39] and adolescent obesity risk [47].

To mitigate the bias on the imbalanced data, one strategy is to penalize misclassification costs of observations differently in the log-likelihood objective function that is used to train the logistic regression model for optimal coefficients [42]. However, the existing solutions require either some hard hyperparameter estimation or very high time complexity [14,32]. In the present work, we propose a novel penalized log-likelihood objective function by including penalty weights as decision variables for event observations and learning them from data along with model coefficients via the gradient descent method. By using the proposed log-likelihood function as the optimization objective to train logistic regression models, both differentiation ability and computation efficiency are improved, based on our experimental results.

The paper is structured in the following way. In Section 2, the existing penalized log-likelihood functions for the imbalanced data in the literature are reviewed. Section 3 describes the proposed log-likelihood function and how it can be solved. Section 4 illustrates the experiments and results. In Section 5, the estimated probability distribution and the estimated coefficients of resulting models are examined on a credit dataset as a case study. In Section 6, the conclusions are presented. In Section 7, the future work is discussed.

2. Related work

Logistic regression is a linear model for binary classification problems. In logistic regression, the values of input independent variables (i.e. $x_{i 0}, \dots, x_{i n}$ ) are linearly combined, defined in Equation (2), and then transformed by a sigmoid function, defined in Equation (1), as shown in Figure 1. The notations can be found in Table 1.

Table 1.

Notations.

Notation	Meaning
m	Total number of observations in the training data
n	Total number of independent variables
i	Index of observations, $i = 1, \dots, m$
j	Index of independent variables, $j = 0, \dots, n$
$x_{i j}$	Value of the jth independent variable in the ith observation
$x_{i}$	Vector of values of independent variables in the ith observation
$y_{i}$	True class label of the ith observation
$β_{j}$	Estimated coefficient of the jth independent variable
$β$	Vector of estimated coefficients of independent variables
$h_{i}$	Model output for the ith observation
$\hat{y_{i}}$	Estimated class label for the ith observation

Open in a new tab

h_{i} = π (β^{T} x_{i}) = \frac{1}{1 + e^{- β^{T} x_{i}}},

(1)

where

β^{T} x_{i} = \sum_{j = 0}^{n} β_{j} x_{i j} = β_{0} x_{i 0} + β_{1} x_{i 1} + \dots + β_{n} x_{i n}

(2)

with $x_{i 0} = 1$ , which makes $β_{0}$ the intercept.

The sigmoid function in Equation (1) restricts the model output between 0 and 1. The model output is interpreted as the estimated probability of the event occurrence, considering that the event of the interest (e.g. fraud, delinquency, failure, malignant) is always coded as 1 while the non-event (e.g. non-fraud, non-delinquency, pass, benign) is always coded as 0 [23,31]. Take the ith observation as an example, the probability of the event occurrence is estimated by Equation (3), and correspondingly the probability of the non-event occurrence is estimated by Equation (4). Mathematically, these two equations can be equivalently re-written into one equation as Equation (5).

\begin{aligned} P (Y = 1 | X = x_{i}) = π (β^{T} x_{i}), \end{aligned}

(3)

\begin{aligned} P (Y = 0 | X = x_{i}) = 1 - π (β^{T} x_{i}), \end{aligned}

(4)

\begin{aligned} P (Y = y_{i} | X = x_{i}) = π (β^{T} x_{i})^{y_{i}} (1 - π (β^{T} x_{i}))^{(1 - y_{i})} . \end{aligned}

(5)

Assuming that all observations are independent, the overall likelihood can be expressed by the likelihood function in Equation (6), which is the product of the individual likelihood of the training data. The problem is to identify the model parameters β that maximize the overall likelihood. To improve the computation efficiency, the likelihood function is transformed into its log form as Equation (7), called the log-likelihood function. To solve this unconstrained optimization problem, the most commonly used algorithm is the gradient descent algorithm [41], where the partial derivative is first computed.

\begin{aligned} L (β) & = \prod_{i = 1}^{m} P (Y = y_{i} | X = x_{i}) \\ = \prod_{i = 1}^{m} [π (β^{T} x_{i})^{y_{i}} (1 - π (β^{T} x_{i}))^{(1 - y_{i})}], \end{aligned}

(6)

\begin{aligned} L L (β) & = \sum_{i = 1}^{m} [y_{i} \log (π (β^{T} x_{i})) + (1 - y_{i}) \log (1 - π (β^{T} x_{i}))] . \end{aligned}

(7)

Maximizing the log-likelihood in Equation (7) is equivalently minimizing the negative log-likelihood in Equation (8), which is referred to as the loss function or cost function of logistic regression. The time complexity for solving Equation (8) is $O (n)$ by the gradient descent algorithm [17].

The loss function in Equation (8) can be interpreted in two parts. The first part $- y_{i} \log (π (β^{T} x_{i}))$ is the misclassification costs for event observations (i.e. $y_{i} = 1$ ), while the second part $- (1 - y_{i}) \log (1 - π (β^{T} x_{i}))$ is the misclassification costs for non-event observations (i.e. $y_{i} = 0$ ), shown in Equation (9). By assuming that the number of events and non-events is equal and the misclassifications of events and non-events are equal, this objective function essentially maximizes the overall accuracy.

\begin{aligned} min_{β} - \sum_{i = 1}^{m} [y_{i} \log (π (β^{T} x_{i})) + (1 - y_{i}) \log (1 - π (β^{T} x_{i}))], \end{aligned}

(8)

\begin{aligned} {cost}_{i} = {\begin{cases} - y_{i} \log (π (β^{T} x_{i})) & if y_{i} = 1, \\ - (1 - y_{i}) \log (1 - π (β^{T} x_{i})) & if y_{i} = 0. \end{cases} \end{aligned}

(9)

However, as Kubat et al. pointed out, the overall accuracy is not a valid and effective performance measurement for the imbalanced data [28]. In the imbalanced data, the number of observations in the majority class (i.e. non-events) is usually two times of the minority class (i.e. events) or more [21]. By maximizing the overall accuracy, logistic regression tends to be biased towards the majority class and misclassifies events as non-events severely [21,38,44]. For example, in an empirical study on the influence of the event rate on discrimination abilities of bankruptcy prediction models, when the event rate (i.e. the proportion of bankruptcy observations) is 0.12%, the accuracy of the logistic regression model is 99.41%, but its type II error is 95.01%, which indicates 95.01% of bankruptcy observations are misclassified as non-bankruptcy [45]. This bias can bring great loss in practice, for example, when banks approve loans to organizations with predicted low but truely high bankruptcy probability. To appropriately measure the model performance on the imbalanced data, researchers have suggested to provide a comprehensive assessment with both curve-based measurements (e.g. ROC, precision–recall curve) and point-value measurements (e.g. type I error, type II error, F-measure, G-mean) [20,28].

To apply logistic regression in the imbalanced data (i.e. rare event data), King and Zeng penalized misclassification costs of events and non-events differently by penalty weights $W_{1}$ and $W_{0}$ in the log-likelihood function [26], as shown in Equation (12). Penalty weights $W_{1}$ and $W_{0}$ are determined by the population proportion of events τ and the sample proportion of events $\bar{y}$ , defined in Equation (11). $W_{1}$ is the penalty weight for all event observations, while $W_{0}$ is the penalty weight for all non-event observations. Because they are invariant to values of independent variables, they are referred to as global penalty weights in this research context. Because $W_{0}$ and $W_{1}$ are pre-defined and plugged into the log-likelihood function as constants, the resulting loss function in Equation (10) can be solved in the same time complexity $O (n)$ as the standard log-likelihood function in Equation (8). The misclassification costs associated with this loss function can be found in Equation (12). One challenge in this method is that it is hard to estimate the population proportion of events τ accurately [15], which ultimately influences the performance of logistic regression driven by global penalty weights $W_{0}$ and $W_{1}$ as found in an empirical study [46].

min_{β} - W_{1} \sum_{i = 1}^{m} y_{i} \log (π (β^{T} x_{i})) - W_{0} \sum_{i = 1}^{m} (1 - y_{i}) \log (1 - π (β^{T} x_{i})),

(10)

where

W_{1} = \frac{τ}{\bar{y}} and W_{0} = \frac{1 - τ}{1 - \bar{y}}

(11)

with τ denoting the population fraction of events induced by choice-based sampling and $\bar{y}$ denoting the sample proportion of events.

{cost}_{i} = {\begin{cases} - W_{1} y_{i} \log (π (β^{T} x_{i})) & if y_{i} = 1, \\ - W_{0} (1 - y_{i}) \log (1 - π (β^{T} x_{i})) & if y_{i} = 0. \end{cases}

(12)

Instead of penalizing misclassification costs based on classes (i.e. event and non-event), Deng proposed to penalize the misclassification cost of each observation differently by a penalty weight $w_{i}$ , where i is the observation index, as shown in Equation (13). $w_{i}$ is determined by the Gaussian kernel function, defined in Equation (14), where $K_{w}^{2}$ is called the kernel width, a hyperparameter to tune. This is called the locally weighted logistic regression or kernel logistic regression [8,49]. The resulting loss function in Equation (13) can be solved in time complexity $O (n^{3})$ [25,32]. The corresponding misclassification costs of the loss function are in Equation (15). The increase of the time complexity is caused by the computation of distance matrices using the Gaussian kernel in Equation (14), which limits its application on large datasets.

min_{β} - \sum_{i = 1}^{m} w_{i} [y_{i} \log (π (β^{T} x_{i})) + (1 - y_{i}) \log (1 - π (β^{T} x_{i}))],

(13)

where

w_{i} = \exp (\frac{(x_{i} - x_{q})^{2}}{K_{w}^{2}})

(14)

with $x_{q}$ denoting the query observation being evaluated:

{cost}_{i} = {\begin{cases} - w_{i} y_{i} \log (π (β^{T} x_{i})) & if y_{i} = 1, \\ - w_{i} (1 - y_{i}) \log (1 - π (β^{T} x_{i})) & if y_{i} = 0. \end{cases}

(15)

By including both global penalty weights (i.e. $W_{0}$ , $W_{1}$ ) and local penalty weights (i.e. $w_{i}$ ) above along with a regularization term (i.e. $(α / 2) ‖ β ‖^{2}$ ) in the log-likelihood function, Maalouf and Trafalis proposed a rare event weighted kernel logistic regression [31], as shown in Equation (16). This loss function can be solved in the time complexity $O (n^{3})$ . The associated costs with the loss function can be found in Equation (17). Besides high computational complexity, this method also introduces one more hyperparameter α to tune in the regularization term.

min_{β} - W_{1} \sum_{i = 1}^{m} w_{i} y_{i} \log (π (β^{T} x_{i})) - W_{0} \sum_{i = 1}^{m} w_{i} (1 - y_{i}) \log (1 - π (β^{T} x_{i})) + \frac{α}{2} ‖ β ‖^{2},

(16)

where α is the regularization strength, a hyperparameter tuned by users.

{cost}_{i} = {\begin{cases} - W_{1} w_{i} y_{i} \log (π (β^{T} x_{i})) + \frac{α}{2} ‖ β ‖^{2} & if y_{i} = 1, \\ - W_{0} w_{i} (1 - y_{i}) \log (1 - π (β^{T} x_{i})) + \frac{α}{2} ‖ β ‖^{2} & if y_{i} = 0. \end{cases}

(17)

3. A novel penalized log-likelihood function

To address the challenges of hard parameter estimation (i.e. the population proportion of the event) and high time complexity in the related work, we introduce the local penalty weights $λ_{i}$ for event observations as decision variables in the log-likelihood objective function. The loss function is redefined in Equation (18), denoted as $- L L (λ, β)$ . The misclassification costs of the event observations are penalized by $λ_{i}$ , while the misclassification costs of the non-event observations are not penalized, shown in Equation (19). The intention is to reduce the number of decision variables and the complexity of the optimization problem, as it increases the computational complexity by iteratively updating a large number of decision variables for reaching the optimum in the learning process which will be discussed in the next section.

min_{β, λ} - \sum_{i = 1}^{m} [λ_{i} y_{i} \log (π (β^{T} x_{i})) + (1 - y_{i}) \log (1 - π (β^{T} x_{i}))],

(18)

where $λ_{i} > 0$ , a parameter learned from the data:

{cost}_{i} = {\begin{cases} - λ_{i} y_{i} \log (π (β^{T} x_{i})) & if y_{i} = 1, \\ - (1 - y_{i}) \log (1 - π (β^{T} x_{i})) & if y_{i} = 0. \end{cases}

(19)

3.1. Learning by gradient descent

The optimization of the proposed log-likelihood function in Equation (18) is a nonlinear programing problem with two sets of decision variables $β$ and $λ$ , which can be solved by the gradient descent algorithm in the time complexity $O (n)$ . First, the partial derivative on $β_{j}$ and $λ_{i}$ is derived in Equations (20) and (21), respectively. They are updated iteratively by the rules in Equations (22) and (23), respectively, where $α_{1}$ is the learning rate for $β_{j}$ and $α_{2}$ is the learning rate for $λ_{i}$ . The learning rates are tuned by users. The gradient descent-based algorithm is summarized in Algorithm 1, which is guaranteed to converge as proved by other researchers [1,3,9]. In this setting, to ensure larger penalty weights to be given to the event observations, $λ$ is initialized to be 1s and updated to be a larger value iteratively. Of note, the update in Equation (23) is in the same direction of the derivative due to a suspected relationship between the $λ$ and $β$ parameters :

\begin{aligned} \frac{\partial (- L L (β, λ))}{\partial β_{j}} \\ = - \sum_{i = 1}^{m} [λ_{i} y_{i} \frac{\partial \log (π (β^{T} x_{i}))}{\partial β_{j}} + (1 - y_{i}) \frac{\partial \log (1 - π (β^{T} x_{i}))}{\partial β_{j}}] \\ = - \sum_{i = 1}^{m} [λ_{i} y_{i} \frac{1}{π (β^{T} x_{i})} \frac{\partial π (β^{T} x_{i})}{\partial β_{j}} - (1 - y_{i}) \frac{1}{1 - π (β^{T} x_{i})} \frac{\partial π (β^{T} x_{i})}{\partial β_{j}}] \\ = - \sum_{i = 1}^{m} [(\frac{λ_{i} y_{i}}{π (β^{T} x_{i})} - \frac{(1 - y_{i})}{1 - π (β^{T} x_{i})}) \frac{\partial π (β^{T} x_{i})}{\partial β_{j}}] \\ = - \sum_{i = 1}^{m} [(\frac{λ_{i} y_{i}}{π (β^{T} x_{i})} - \frac{(1 - y_{i})}{1 - π (β^{T} x_{i})}) π (β^{T} x_{i}) (1 - π (β^{T} x_{i})) \frac{\partial (β^{T} x_{i})}{\partial β_{j}}] \\ = - \sum_{i = 1}^{m} [(\frac{λ_{i} y_{i}}{π (β^{T} x_{i})} - \frac{(1 - y_{i})}{1 - π (β^{T} x_{i})}) π (β^{T} x_{i}) (1 - π (β^{T} x_{i})) x_{i j}] \\ = - \sum_{i = 1}^{m} [\frac{λ_{i} y_{i} (1 - π (β^{T} x_{i})) - (1 - y_{i}) π (β^{T} x_{i})}{π (β^{T} x_{i}) (1 - π (β^{T} x_{i}))} π (β^{T} x_{i}) (1 - π (β^{T} x_{i})) x_{i j}] \\ = - \sum_{i = 1}^{m} {[(λ_{i} y_{i} (1 - π (β^{T} x_{i}))) - (1 - y_{i}) π (β^{T} x_{i})] x_{i j}} \\ = - \sum_{i = 1}^{m} {[λ_{i} y_{i} - λ_{i} y_{i} π (β^{T} x_{i}) - π (β^{T} x_{i}) + y_{i} π (β^{T} x_{i})] x_{i j}} \\ = - \sum_{i = 1}^{m} [(λ_{i} y_{i} - λ_{i} y_{i} h_{i} - h_{i} + y_{i} h_{i}) x_{i j}], \\ \frac{\partial (- L L (β, λ))}{\partial λ_{i}} = - y_{i} \log (π (β^{T} x_{i})), \end{aligned}

(20)

\begin{aligned} β_{j, NEW} = β_{j, CURRENT} - α_{1} \frac{\partial (- L L (β, λ))}{\partial β_{j, CURRENT}}, \end{aligned}

(22)

\begin{aligned} λ_{i, NEW} = λ_{i, CURRENT} + α_{2} \frac{\partial (- L L (β, λ))}{\partial λ_{i, CURRENT}} . \end{aligned}

(23)

3.1.

3.2. Probability estimation

To interpret the role of local penalty weights $λ$ , we reverse the log in Equation (18) and trace it back to the likelihood function. As shown in Equation (24), the penalty weights $λ$ essentially regularize the process of learning the model coefficients $β$ from the training data by weighting the estimated probabilities. Then the learned $β$ are used to estimate the probability for the event occurrence based on Equation (3) on the validation data and test data. Because $λ$ only regularizes the learning process and is not used for the probability estimation together with $β$ , the interpretability of logistic regression is maintained.

\begin{aligned} L (β) & = \prod_{i = 1}^{m} P (Y = y_{i} | X = x_{i})^{λ_{i}} \\ = \prod_{i = 1}^{m} [π (β^{T} x_{i})^{y_{i}} (1 - π (β^{T} x_{i}))^{(1 - y_{i})}]^{λ_{i}}, \end{aligned}

(24)

where $λ_{i} = 1$ when $y_{i} = 0$ and $λ_{i}$ are values learned from data when $y_{i} = 1$ .

3.3. Comparison with other penalized log-likelihood functions

Our proposed penalized log-likelihood function is compared comprehensively with the existing log-likelihood functions in Table 2. As a linear model, it considers the imbalance of the data, is much less complicated in perspectives of time complexity and the number of estimated parameter sets than nonlinear models with penalty weights determined by Gaussian kernel, and does not have any penalty weight-related hyperparameter to tune. Its advantages will be demonstrated in experimental results in Section 4.

Table 2.

Comparison of penalized log-likelihood functions.

Equations	Imbalance	Linearity	Computational complexity	Estimated parameter sets	Hyperparameter
Equation (8)	No	Linear	$O (n)$	1	N/A
Equation (10)	Yes	Linear	$O (n)$	1	τ
Equation (13)	Yes	Nonlinear	$O (n^{3})$	Test size	$K_{w}$
Equation (16)	Yes	Nonlinear	$O (n^{3})$	Test size	τ, $K_{w}$
Equation (18)	Yes	Linear	$O (n)$	1	N/A

Open in a new tab

4. Experiments

Both real-world and simulated datasets are collected and generated to test the performance of the various models. For the real-world data, 10 public imbalanced datasets from multiple domains are collected and used in the experimental study. The basic characteristics for all the datasets can be found in Table 3, including data source, target (i.e. dependent variable), event rate, the number of observations, the number of variables, variable types and domain area. The event rate in the real-world datasets ranges from 0.76% to 10.42%. For the simulated data, 16 datasets with 2000 observations each are generated by varying both the event rate from 1% to 20% and the number of attributes (i.e. slopes) from 1 to 8. All simulated attributes are uncorrelated and the data are generated with a linear relationship using a standard normal distribution with different attribute values. The simulated dataset with three attribute values is generated with a non-linear relationship to the target.

Table 3.

Basic characteristics of datasets.

Dataset	Repository	Target	Event rate (%)	Observations	Attributes	Domain
abalone_19	UCI	19	0.76	4177	7C,1N	Life
arrhythmia	UCI	06	5.55	452	206C, 73N	Biology
ecoli	UCI	imU	10.42	336	7C	Life
oil	UCI	minority	4.35	937	49C	Environment
ozone_level	UCI	ozone day	2.86	2536	72C	Environment
solar_flare_m0	UCI	M-class > 0	5.00	1389	10N	Nature
us_crime	UCI	freq > 0.65	7.69	1994	122C	Social
wine_quality	UCI	score ≤ 4	3.70	4898	11C	Business
yeast_me2	UCI	ME2	3.44	1484	8C	Life
yeast_ml8	LIBSVM	8	7.14	2417	103C	Life
Simulated	Generated	y	1, 5, 10, 20	2000	1, 3, 4 and 8C	None

Open in a new tab

4.1. Experimental methodology

Logistic regression models, trained by the proposed penalized log-likelihood function and the existing ones, are compared comprehensively on each dataset, as listed below. Their performance is evaluated by 100 runs of 10-fold stratified cross validation, which reflects more the model generalization ability on the new data compared with other validation techniques (e.g. bootstrapping). In each iteration of cross validation, the area under ROC curve (i.e. AUROC) on the validation data is computed.

Standard: The logistic regression model that is trained by the loss function in Equation (8) with no penalty weights. To fit the model, the logistic regression model function from the Scikit-Learn python package [6,35] is used because its optimizer provides the global optimal solution to Equation (8).
Balanced: The logistic regression model that is trained by the loss function in Equation (10) with balanced global penalty weights by taking τ as 0.5 in Equation (11), adjusting weights inversely proportional to class frequencies [7]. To fit the model, the logistic regression model function with the hyperparameter ‘class_weight’ set as balanced from the Scikit-Learn python package is used because its optimizer provides the global optimal solution to Equation (10).
Weighted: The logistic regression model that is trained by the loss function in Equation (10) with global penalty weights (i.e. $W_{0}$ , $W_{1}$ ) in Equation (11) tuned based on τ from 0 to 0.5 with step size 0.01 [46]. To fit the model, the logistic regression model function with the hyperparameter class_weight from the Scikit-Learn python package is used because its optimizer provides the global optimal solution to Equation (10).
Kernel: The logistic regression model that is trained by the loss function in Equation (13) with local kernel penalty weights in Equation (14). This model is implemented by a custom built function. $K_{w}$ is tuned as a hyperparameter from 0 to 1 with step size 0.1, capturing the nonlinearity of the model [14]. For the simulated data, $K_{w}$ is tuned with 3, 10, 20 and 30 as part of the grid.
Learnable: The logistic regression model that is trained by the loss function in Equation (18) with learnable local penalty weights $λ_{i}$ as decision variables. $λ_{i}$ are initialized to be 1 and the learning rates (i.e. $α_{1}$ , $α_{2}$ ) are tuned for each dataset. This model is implemented by a custom built function. The learning process is terminated when the AUROC on the validation data ceases to increase for the purpose of preventing the overfitting [29].

In the experiment of comparing the computation time of these five models using the real-world datasets, all models are implemented with the same data structures used in the custom built function of the learnable model to eliminate the effect caused by the different data structures used in the Scikit-Learn python package and custom built functions.

4.2. Experimental results

Models on each dataset are compared based on the statistics of AUROCs of 100 runs of 10-fold stratified cross validation, including 95% confidence interval, mean and standard deviation. For the real-world datasets, the training time is also computed. Tables 4–6 list the results from the real-world and simulated datasets, respectively. The insights from the results are listed below in Sections 4.2.1 and 4.2.2.

Table 4.

Real-world data results of logistic regression models.

Dataset	Model	95% confidence interval	Mean	Std	Training time per run (s)
abalone_19	Standard	(0.8168, 0.8279)	0.8224	0.0889	39.1791
abalone_19	Balanced	(0.8396, 0.8490)	0.8443	0.0764	45.4924
abalone_19	Weighted	(0.8396, 0.8490)	0.8443	0.0764	39.3664
abalone_19	Kernel	(0.7018, 0.7139)	0.7078	0.0969	1435.3485
abalone_19	Learnable	(0.8595, 0.8681)	0.8638	0.0690	30.5765
arrhythmia	Standard	(0.8452, 0.8568)	0.8510	0.0935	19.8571
arrhythmia	Balanced	(0.8582, 0.8695)	0.8639	0.0908	22.4839
arrhythmia	Weighted	(0.8582, 0.8695)	0.8639	0.0908	22.4354
arrhythmia	Kernel	(0.5553, 0.5648)	0.5600	0.0771	141.7217
arrhythmia	Learnable	(0.8765, 0.8867)	0.8816	0.0823	15.3639
ecoli	Standard	(0.9161, 0.9272)	0.9216	0.0895	12.1102
ecoli	Balanced	(0.9077, 0.9196)	0.9136	0.0957	14.6242
ecoli	Weighted	(0.9196, 0.9302)	0.9249	0.0851	14.0671
ecoli	Kernel	0.9405, 0.9458)	0.9431	0.0428	27.4736
ecoli	Learnable	(0.9404, 0.9473)	0.9439	0.0553	7.3705
oil	Standard	(0.9329, 0.9396)	0.9362	0.0544	19.1043
oil	Balanced	(0.9167, 0.9263)	0.9215	0.0771	21.9376
oil	Weighted	(0.9167, 0.9263)	0.9215	0.0771	21.6915
oil	Kernel	0.8832, 0.8922)	0.8877	0.0718	170.3713
oil	Learnable	(0.9472, 0.9518)	0.9495	0.0377	13.9312
ozone_level	Standard	(0.8936, 0.9007)	0.8971	0.0573	34.8695
ozone_level	Balanced	(0.8725, 0.8813)	0.8769	0.0708	38.9631
ozone_level	Weighted	(0.9006, 0.9069)	0.9038	0.0509	37.2606
ozone_level	Kernel	(0.4838, 0.4967)	0.4903	0.1041	1635.0841
ozone_level	Learnable	(0.9162, 0.9221)	0.9191	0.0474	23.1777
solar_flare_m0	Standard	(0.7701, 0.7818)	0.7759	0.0948	21.5702
solar_flare_m0	Balanced	(0.7610, 0.7732)	0.7671	0.0983	24.9768
solar_flare_m0	Weighted	(0.7669, 0.7790)	0.7730	0.0970	24.1436
solar_flare_m0	Kernel	(0.7341, 0.7447)	0.7394	0.0852	293.0257
solar_flare_m0	Learnable	(0.8191, 0.8282)	0.8236	0.0731	11.9129
us_crime	Standard	(0.9173, 0.9217)	0.9195	0.0351	41.2519
us_crime	Balanced	(0.9085, 0.9130)	0.9107	0.0363	51.1435
us_crime	Weighted	(0.9159, 0.9202)	0.9180	0.0351	37.8568
us_crime	Kernel	(0.7410, 0.7485)	0.7447	0.0608	966.8329
us_crime	Learnable	(0.9290, 0.9328)	0.9309	0.0306	16.5772
wine_quality	Standard	(0.7792, 0.7866)	0.7829	0.0600	52.7785
wine_quality	Balanced	(0.7797, 0.7870)	0.7834	0.0588	57.7737
wine_quality	Weighted	(0.7807, 0.7880)	0.7843	0.0593	44.4346
wine_quality	Kernel	(0.8592, 0.8641)	0.8617	0.0395	2013.2618
wine_quality	Learnable	(0.7861, 0.7933)	0.7897	0.0581	54.7964
yeast_me2	Standard	(0.8679, 0.8792)	0.8736	0.0909	24.8167
yeast_me2	Balanced	(0.8687, 0.8793)	0.8740	0.0851	27.2143
yeast_me2	Weighted	(0.8704, 0.8816)	0.8760	0.0899	23.5347
yeast_me2	Kernel	(0.8804, 0.8915)	0.8859	0.0891	259.5875
yeast_me2	Learnable	(0.8937, 0.9022)	0.8979	0.0684	19.5705
yeast_ml8	Standard	(0.5648, 0.5728)	0.5688	0.0649	58.8315
yeast_ml8	Balanced	(0.5548, 0.5630)	0.5589	0.0657	64.3612
yeast_ml8	Weighted	(0.5572, 0.5653)	0.5613	0.0656	45.2029
yeast_ml8	Kernel	0.5467, 0.5483)	0.5475	0.0123	167.4668
yeast_ml8	Learnable	(0.6279, 0.6350)	0.6315	0.0571	63.2463

Open in a new tab

Table 6.

Simulation results of logistic regression models (part 2).

Dataset	Model	95% confidence interval	Mean	Std
4 slope(s), event rate: 1%	Standard	(0.9683, 0.9724)	0.9704	0.0325
4 slope(s), event rate: 1%	Balanced	(0.9662, 0.9705)	0.9684	0.0350
4 slope(s), event rate: 1%	Weighted	(0.9676, 0.9719)	0.9698	0.0343
4 slope(s), event rate: 1%	Kernel	(0.9000, 0.9103)	0.9051	0.0830
4 slope(s), event rate: 1%	Learnable	(0.9757, 0.9784)	0.9771	0.0212
4 slope(s), event rate: 5%	Standard	(0.9482, 0.9518)	0.9500	0.0286
4 slope(s), event rate: 5%	Balanced	(0.9476, 0.9511)	0.9494	0.0285
4 slope(s), event rate: 5%	Weighted	(0.9481, 0.9517)	0.9499	0.0290
4 slope(s), event rate: 5%	Kernel	(0.9438, 0.9475)	0.9456	0.0299
4 slope(s), event rate: 5%	Learnable	(0.9516, 0.9554)	0.9535	0.0308
4 slope(s), event rate: 10%	Standard	(0.9327, 0.9357)	0.9342	0.0246
4 slope(s), event rate: 10%	Balanced	(0.9322, 0.9354)	0.9338	0.0257
4 slope(s), event rate: 10%	Weighted	(0.9326, 0.9357)	0.9342	0.0247
4 slope(s), event rate: 10%	Kernel	(0.9311, 0.9342)	0.9326	0.0256
4 slope(s), event rate: 10%	Learnable	(0.9354, 0.9379)	0.9367	0.0196
4 slope(s), event rate: 20%	Standard	(0.9399, 0.9421)	0.9410	0.0173
4 slope(s), event rate: 20%	Balanced	(0.9400, 0.9421)	0.9410	0.0169
4 slope(s), event rate: 20%	Weighted	(0.9401, 0.9422)	0.9412	0.0171
4 slope(s), event rate: 20%	Kernel	(0.9396, 0.9418)	0.9407	0.0176
4 slope(s), event rate: 20%	Learnable	(0.9411, 0.9430)	0.9420	0.0158
8 slope(s), event rate: 1%	Standard	(0.9976, 0.9981)	0.9978	0.0036
8 slope(s), event rate: 1%	Balanced	(0.9969, 0.9974)	0.9971	0.0044
8 slope(s), event rate: 1%	Weighted	(0.9975, 0.9980)	0.9977	0.0037
8 slope(s), event rate: 1%	Kernel	(0.9643, 0.9680)	0.9662	0.0294
8 slope(s), event rate: 1%	Learnable	(0.9970, 0.9976)	0.9973	0.0046
8 slope(s), event rate: 5%	Standard	(0.9918, 0.9926)	0.9922	0.0062
8 slope(s), event rate: 5%	Balanced	(0.9912, 0.992)	0.9916	0.0061
8 slope(s), event rate: 5%	Weighted	(0.9915, 0.9923)	0.9919	0.0066
8 slope(s), event rate: 5%	Kernel	(0.9889, 0.9898)	0.9894	0.0079
8 slope(s), event rate: 5%	Learnable	(0.9919, 0.9926)	0.9922	0.006
8 slope(s), event rate: 10%	Standard	(0.9899, 0.9905)	0.9902	0.005
8 slope(s), event rate: 10%	Balanced	(0.9899, 0.9905)	0.9902	0.0048
8 slope(s), event rate: 10%	Weighted	(0.9898, 0.9905)	0.9901	0.0051
8 slope(s), event rate: 10%	Kernel	(0.9883, 0.9890)	0.9887	0.0058
8 slope(s), event rate: 10%	Learnable	(0.9906, 0.9911)	0.9908	0.004
8 slope(s), event rate: 20%	Standard	(0.9861, 0.9870)	0.9866	0.007
8 slope(s), event rate: 20%	Balanced	(0.9862, 0.9870)	0.9866	0.0068
8 slope(s), event rate: 20%	Weighted	(0.9861, 0.9870)	0.9866	0.0069
8 slope(s), event rate: 20%	Kernel	(0.9836, 0.9846)	0.9841	0.0081
8 slope(s), event rate: 20%	Learnable	(0.9858, 0.9867)	0.9863	0.0068

Open in a new tab

4.2.1. Real-world data results

On 9 of all 10 datasets, the learnable models produce a higher 95% confidence interval, higher mean and smaller standard deviation of AUROCs, as highlighted in bold in Table 4, compared with the standard models, balanced models and weighted models.
Only on the dataset wine_quality, the kernel model generates a higher 95% confidence interval and higher mean of AUROCs than other models. The reason is that the nonlinear relationship in the dataset wine_quality is captured by the kernel model quite well. The kernel model is a nonlinear model, where $K_{w}$ is restricted to small values to ensure the model nonlinearity in the experiment setting, while other models are linear models. If the relationships between independent variables and the dependent variable are nonlinear, the kernel model captures patterns better. Otherwise, it will be worse. Take the dataset wine_quality as an example. By evaluating the empirical logit plots in Figure A1 in the Appendix, most independent variables have nonlinear relationship with the dependent variable. This leads the kernel model to perform the best.
On the datasets abalone_19, arrhythmia and oil, the balanced models are identical to the weighted models with $τ = 0.5$ .
On the datasets abalone_19, arrhythmia and ozone_level, the weighted models have a higher 95% confidence interval and higher mean of AUROCs than standard models. On the datasets oil, solar_flare_m0 and yeast_ml8, the standard models have a higher 95% confidence interval and higher mean of AUROCs than the weighted models. On the datasets ecoli, us_crime, wine_quality and yeast_me2, the standard models and weighted models are similar regarding the 95% confidence interval and mean of AUROCs.
On all datasets, kernel models take the largest training time. On the datasets abalone_19, arrhythmia, ecoli, oil, ozone_level, solar_flare_m0, us_crime and yeast_me2, the learnable models take the least training time. For the dataset wine_quality and yeast_ml8, the training time of the learnable model is smaller than the balanced model, although it is slightly greater than the standard model and the weighted model.

4.2.2. Simulation results

On 8 of 16 datasets, the learnable models produce a higher mean of AUROCs than other models, as shown in Tables 5 and 6.
On 12 of 16 datasets, the learnable models produce a lower standard deviation of AUROCs than other models.
On 8 of 16 datasets, the learnable models produce both a higher mean and lower standard deviation of AUROCs than other models.
On 11 of 16 datasets, the learnable model produces a statistically significant higher 95% confidence interval of AUROCs than at least one other model, as highlighted in bold in Tables 5 and 6. The italicized models indicate the model(s) that the learnable model outperforms.
The learnable model outperforms all other models in three of the simulated datasets (i.e. three slopes, event rate: 1%; four slopes, event rate: 1%; eight slopes, event rate 10%). The learnable model is the only model to out-perform all models on a given dataset.
On the simulated data with three slopes, generated with a non-linear relationship, the learnable model out-performs all models at all event rates except the kernel model. Interestingly, the learnable model also out-performs the kernel model for three slopes when the event rate is most extreme at 1%. The kernel model performance for non-linear data is consistent with the findings in Section 4.2.1.
All models perform equally well on all datasets with only 1 attribute value.

Table 5.

Simulation results of logistic regression models (part 1).

Dataset	Model	95% confidence interval	Mean	Std
1 slope(s), event rate: 1%	Standard	(0.6965, 0.7154)	0.7059	0.1523
1 slope(s), event rate: 1%	Balanced	(0.6983, 0.7175)	0.7079	0.1548
1 slope(s), event rate: 1%	Weighted	(0.6980, 0.7169)	0.7075	0.1523
1 slope(s), event rate: 1%	Kernel	(0.6996, 0.7180)	0.7088	0.1481
1 slope(s), event rate: 1%	Learnable	(0.6954, 0.7166)	0.7060	0.1717
1 slope(s), event rate: 5%	Standard	(0.7270, 0.7354)	0.7312	0.0679
1 slope(s), event rate: 5%	Balanced	(0.7266, 0.7353)	0.7309	0.0701
1 slope(s), event rate: 5%	Weighted	(0.7266, 0.7352)	0.7309	0.0696
1 slope(s), event rate: 5%	Kernel	(0.7269, 0.7353)	0.7311	0.0678
1 slope(s), event rate: 5%	Learnable	(0.7268, 0.7340)	0.7304	0.0582
1 slope(s), event rate: 10%	Standard	(0.7252, 0.7321)	0.7286	0.0551
1 slope(s), event rate: 10%	Balanced	(0.7255, 0.732)	0.7287	0.0524
1 slope(s), event rate: 10%	Weighted	(0.7255, 0.7318)	0.7286	0.0511
1 slope(s), event rate: 10%	Kernel	(0.7255, 0.7319)	0.7287	0.0518
1 slope(s), event rate: 10%	Learnable	(0.7266, 0.7318)	0.7292	0.0423
1 slope(s), event rate: 20%	Standard	(0.7392, 0.7442)	0.7417	0.0409
1 slope(s), event rate: 20%	Balanced	(0.7389, 0.7441)	0.7415	0.0422
1 slope(s), event rate: 20%	Weighted	(0.7391, 0.7444)	0.7418	0.0422
1 slope(s), event rate: 20%	Kernel	(0.7392, 0.7442)	0.7417	0.0400
1 slope(s), event rate: 20%	Learnable	(0.7392, 0.7441)	0.7417	0.0391
3 slope(s), event rate: 1%	Standard	(0.5233, 0.5510)	0.5371	0.2239
3 slope(s), event rate: 1%	Balanced	(0.5033, 0.5311)	0.5172	0.2241
3 slope(s), event rate: 1%	Weighted	(0.5225, 0.5504)	0.5364	0.2247
3 slope(s), event rate: 1%	Kernel	(0.6622, 0.6820)	0.6721	0.1593
3 slope(s), event rate: 1%	Learnable	(0.6908, 0.7065)	0.6987	0.1265
3 slope(s), event rate: 5%	Standard	(0.5860, 0.5979)	0.592	0.0963
3 slope(s), event rate: 5%	Balanced	(0.5811, 0.5929)	0.587	0.0952
3 slope(s), event rate: 5%	Weighted	(0.5873, 0.5993)	0.5933	0.0968
3 slope(s), event rate: 5%	Kernel	(0.6234, 0.6326)	0.628	0.0736
3 slope(s), event rate: 5%	Learnable	(0.6106, 0.6199)	0.6153	0.075
3 slope(s), event rate: 10%	Standard	(0.5266, 0.5339)	0.5303	0.0588
3 slope(s), event rate: 10%	Balanced	(0.5288, 0.5360)	0.5324	0.0585
3 slope(s), event rate: 10%	Weighted	(0.5281, 0.5356)	0.5319	0.0603
3 slope(s), event rate: 10%	Kernel	(0.5921, 0.5999)	0.596	0.0624
3 slope(s), event rate: 10%	Learnable	(0.5645, 0.5704)	0.5675	0.0476
3 slope(s), event rate: 20%	Standard	(0.5048, 0.5107)	0.5077	0.0477
3 slope(s), event rate: 20%	Balanced	(0.5008, 0.5066)	0.5037	0.0476
3 slope(s), event rate: 20%	Weighted	(0.5054, 0.5113)	0.5084	0.0478
3 slope(s), event rate: 20%	Kernel	(0.5981, 0.6038)	0.6009	0.0463
3 slope(s), event rate: 20%	Learnable	(0.5517, 0.5561)	0.5539	0.0352

Open in a new tab

In summary, compared with the balanced and weighted models, the learnable models perform better based on AUROCs as well as the training time in most cases on the real-world datasets. Compared with the standard models, the learnable models perform better based on AUROCs on the real-world datasets, but their training time performance depends on the datasets. For the simulated datasets, the learnable model consistently has the highest AUROC mean and it is the only model to significantly outperform all other models for a given dataset. Compared with the kernel models, the learnable models perform better based on training time in all cases as well as AUROCs in linear cases on the real-world data. For simulated data, the learnable model outperforms the kernel model for all linear cases and the non-linear case when the event rate is 1%. The higher performance of the kernel model for most non-linear cases is consistent with the findings from the real-world datasets. Experimental results are consistent with characteristics of penalized log-likelihood functions summarized in Table 2.

5. Estimated probability distribution and additional performance analysis of resulting models

To further analyze the models in respect to the estimated probability distribution and additional performance measurements including type I error, type II error, and accuracy, a more detailed study is conducted on the dataset GiveMeSomeCredit from Kaggle. The problem is to predict whether a client will experience financial distress in the next two years or not using their biographical and financial information (e.g. monthly income, number of dependents, number of open credit lines and loans, revolving utilization of unsecured lines, number of time 30–59 days past due not worse, number of time 60–89 days past due not worse and number of times 90 days late) [24]. The proportion of delinquency observations is $6.68 %$ .

The detailed information, exploratory analysis and basic data processing can be found in our previously published study [46].

5.1. Estimated probability distribution

Four models are built, including standard, balanced, weighted and learnable, described in Section 4.1. The weighted model is not compared in the following text because the weighted and the balanced model are the same on this dataset. The estimated probability distributions of the models on the test data can be found in Figure 2(a,c,e), respectively, where the probabilities estimated for true non-event observations (i.e. Class 0) can be found under “Predicted Probability Distribution of Class 0” and the probabilities estimated for true event observations (i.e. Class 1) can be found under “Predicted Probability Distribution of Class 1”. We have the following insights:

For the standard model, the estimated probabilities for the most non-event observations fall in the range $[0, 0.2]$ , which is very similar to the estimated probabilities for the most event observations. They overlap together.
For the balanced model, the estimated probabilities for the most non-event observations fall in the range $[0.1, 0.9]$ , while the estimated probabilities for the most event observations fall in the range $[0.2, 1]$ , which shift towards 1 a little bit compared with the non-event observations.
For the learnable model, the estimated probabilities for the most non-event observations fall in the range $[0.9, 0.97]$ , while the estimated probabilities for the most event observations fall in the range $[0.95, 1]$ , which shift towards 1 much more compared with the non-event observations.

Figure 2. — Predicted probabilities and probability cutoff on test data. (a) Standard: predicted probability; (b) standard: probability cutoff; (c) balanced: predicted probability; (d) balanced: probability cutoff; (e) learnable: predicted probability; and (f) learnable: probability cutoff.

It is reasonable and expected that the estimated probabilities by the learnable model are shifted towards 1 overall compared with the standard model and balanced model because more penalty weights are applied on the misclassifications of event observations during the training phase. This can be effectively addressed by choosing an appropriate probability cutoff to achieve the best performance, which will be illustrated in Section 5.2. The results suggest that the learnable model differentiates true events and true non-events better, compared with the standard model and the balanced model.

5.2. Performance measures under probability cutoff

The probability cutoff is set to be the probability at the intersection point of the sensitivity plot and the specificity plot in order to transform a probability to a binary decision [18,37] for the purpose of balancing type I error and type II error. The probability cutoff for three models (i.e. standard, balanced, learnable) on the test data is 0.0675, 0.4516, and 0.9749, as shown in Figure 2(b,d,f), respectively. Under corresponding probability cutoffs, the additional performance measurements are computed, including type I error, type II error, and accuracy, as well as AUROC, which can be found in Table 7. We have the following insights:

Compared with the standard model, the learnable model decreases type I error by 9.01%, decreases type II error by 8.89%, increases accuracy by 9.01% and increases AUROC by 0.0899 on the test data.
Compared with the balanced model, the learnable model decreases type I error by 1.78%, decreases type II error by 1.80%, increases accuracy by 1.79% and increases AUROC by 0.0179 on the test data.

Table 7.

Performance measures on validation data and test data.

Dataset	Model	Type I error (%)	Type II error (%)	Accuracy	AUROC	Probability cutoff
Validation data	Standard	36.00	36.03	64.00%	0.6398	0.0672
	Balanced	27.14	27.13	72.86%	0.7287	0.4572
	Learnable	26.12	25.93	73.89%	0.7397	0.9751
Test data	Standard	35.75	35.65	64.26%	0.6430	0.0675
	Balanced	28.52	28.47	71.48%	0.7150	0.4516
	Learnable	26.74	26.67	73.27%	0.7329	0.9749

Open in a new tab

The learnable model performs the best according to all these performance measures on both the validation data and the test data, followed by the balanced model and the standard model.

5.3. Estimated model coefficients

The estimated coefficients of three models (i.e. standard, balanced, learnable) in Section 5.2 are examined. As shown in Table 8, the models have different estimated values for each independent variable, as well as the sign of the variable $number of time 60 - 89 days past due not worse$ , which is negative in the standard model and positive in the balanced model and the learnable model. Its empirical logit plot in Figure 3 shows positive relationship. Based on its variance inflation factor, the multicollinearity exists with the variables $number of time 30 - 59 days past due not worse$ and $number of times 90 days late$ , which causes the sign change [46]. Because all of their information values are above 0.1, none of them should be dropped. Despite the multicollinearity, the balanced model and the learnable model generate the positive estimation that is consistent with the univariate effect.

Table 8.

Estimated model coefficients.

Variable	Standard model	Balanced model	Learnable model
intercept	−2.77	−0.19	3.62
age	−0.03	−0.02	−0.02
Revolving utilization of unsecured lines	−0.40	−0.40	−0.22
$number of time 30 - 59 days past due not worse$	1.77	2.05	1.34
$number of times 90 days late$	1.58	2.30	1.86
$number of time 60 - 89 days past due not worse$	−3.19	0.87	1.19

Open in a new tab

Figure 3. — Empirical logit plot of $number of time 60 - 89 days past due not worse$ .

5.4. Discussions

One potential issue of this method is the overfitting because local penalty weights are learned from the training data along with the model coefficients. This is a common issue for cost-sensitive learning methods [29], as there are more parameters determined from the training data. The strategy we have adopted for preventing overfitting is to stop the training process when the AUROC on the validation data ceases to increase, instead of the convergence, although the gradient descent-based algorithm is guaranteed to converge. As shown by the performance measurements on the validation data and test data of the credit dataset in Table 7, the generalization ability of the logistic regression model trained by the proposed log-likelihood function is good.

6. Conclusions

To improve the classification on the imbalanced data, we propose a novel penalized log-likelihood function by including penalty weights as decision variables for event observations and learning them from data along with model coefficients. Its advantages in the discrimination ability and computation efficiency are demonstrated in its application to train logistic regression models and in a comprehensive comparison study with logistic regression models trained by other penalized log-likelihood functions on 10 public datasets from multiple domains. Their performance is evaluated by the statistics (i.e. 95% confidence interval, mean, standard deviation) of AUROCs over 100 runs of 10-fold stratified cross validation, as well as their training time. A detailed study is also conducted on an imbalanced credit dataset to examine the distributions of estimated probabilities, the additional performance measurements under the chosen probability cutoff (i.e. type I error, type II error, accuracy) and the estimated model coefficients.

7. Future work

In the future, to further improve the proposed algorithm, we would like to incorporate the intelligent tuning of the learning rate into the modified gradient descent algorithm and use the simulated data to study more properties of the estimated parameters. In practice, we plan to apply the proposed penalized log-likelihood function to improve neural network and deep learning on classifying the imbalanced data and extend this method to multi-class classification problems.

Appendix. Empirical logit plots of the dataset wine_quality.

Figure A1. — Empirical logit plots of the dataset wine_quality. (a) var0, (b) var1, (c) var2, (d) var3, (e) var4, (f) var5, (g) var6, (h) var7, (i) var8 and (j) var9.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Arora S., Cohen N., Golowich N., and Hu W., A convergence analysis of gradient descent for deep linear neural networks, preprint (2018). arXiv:1810.02281.
2.Bahnsen A.C., Aouada D., and Ottersten B., Example-dependent cost-sensitive logistic regression for credit scoring, in 2014 13th International Conference on Machine Learning and Applications (ICMLA), IEEE, Detroit, MI, 2014, pp. 263–269.
3.Baird III L.C. and Moore A.W., Gradient descent for general reinforcement learning, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 1999, pp. 968–974.
4.Barandela R., Valdovinos R.M., Sánchez J.S., and Ferri F.J., The imbalanced training sample problem: under or over sampling? in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, Lisbon, 2004, pp. 806–814.
5.Brown I. and Mues C., An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Syst. Appl. 39 (2012), pp. 3446–3453. [Google Scholar]
6.Buitinck L., Louppe G., Blondel M., Pedregosa F., Mueller A., Grisel O., Niculae V., Prettenhofer P., Gramfort A., Grobler J., Layton R., VanderPlas J., Joly A., Holt B., and Varoquaux G., API design for machine learning software: experiences from the scikit-learn project, in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, Prague, 2013, pp. 108–122.
7.Buitinck L., Louppe G., Blondel M., Pedregosa F., Mueller A., Grisel O., Niculae V., Prettenhofer P., Gramfort A., Grobler J., Layton R., VanderPlas J., Joly A., Holt B., and Varoquaux G., Scikit learn documentation in logistic regression (2018). Available at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Accessed 2018-09-30.
8.Canu S. and Smola A., Kernel methods and the exponential family, Neurocomputing 69 (2006), pp. 714–720. [Google Scholar]
9.Chizat L. and Bach F., On the global convergence of gradient descent for over-parameterized models using optimal transport, in Advances in Neural Information Processing Systems, Montreal, 2018, pp. 3036–3046.
10.Collell G., Prelec D., and Patil K.R., A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data, Neurocomputing 275 (2018), pp. 330–340. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Cox D.R., The regression analysis of binary sequences, J. R. Stat. Soc. Ser. B (Methodol.) 20 (1958), pp. 215–242. [Google Scholar]
12.Cox D.R., Some procedures associated with the logistic qualitative response curve, Research papers in statistics: Festschrift for J. Neyman, John Wiley & Sons, London, 1966, pp. 55–71.
13.Demir G., Aytekin M., and Akgun A., Landslide susceptibility mapping by frequency ratio and logistic regression methods: an example from Niksar–Resadiye (Tokat, Turkey), Arabian J. Geosci. 8 (2015), pp. 1801–1812. [Google Scholar]
14.Deng K., Omega: on-line memory-based general purpose system classifier, Ph.D. diss., Carnegie Mellon University, 1998.
15.Ding J. and Xiong W., A new estimator for a population proportion using group testing, Commun. Stat. Simul. Comput. 45 (2016), pp. 101–114. [Google Scholar]
16.Drummond C. and Holte R.C., C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, in Workshop on Learning from Imbalanced Datasets II, Vol. 11, Citeseer, Washington, DC, 2003, pp. 1–8.
17.Reza Z., Streaming Stochastic Gradient Descent for Generalized Linear Models (2015). Available at https://stanford.edu/∼rezab/classes/cme323/S15/notes/lec11.pdf, Accessed 2018-04-02.
18.Habibzadeh F., Habibzadeh P., and Yadollahie M., On determining the most appropriate test cut-off value: the case of tests with continuous results, Biochem. Med. 26 (2016), pp. 297–307. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Han H., Wang W.Y., and Mao B.H., Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, International Conference on Intelligent Computing, Springer, Hefei, 2005, pp. 878–887.
20.He H. and Garcia E.A., Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (2008), pp. 1263–1284. [Google Scholar]
21.He H. and Ma Y., Imbalanced Learning: Foundations, Algorithms, and Applications, John Wiley & Sons, Hoboken, 2013. [Google Scholar]
22.Hong C., Ghosh R., and Srinivasan S., Dealing with class imbalance using thresholding, preprint (2016). arXiv:1607.02705.
23.Hosmer Jr D.W., Lemeshow S., and Sturdivant R.X., Applied Logistic Regression, Vol. 398, John Wiley & Sons, Hoboken, 2013. [Google Scholar]
24.Kaggle , Give me some credit (2011). Available at https://www.kaggle.com/c/GiveMeSomeCredit/data, Accessed 2018-02-01.
25.Karsmakers P., Pelckmans K., and Suykens J.A., Multi-class kernel logistic regression: a fixed-size implementation, in 2007 International Joint Conference on Neural Networks, IJCNN, IEEE, Orlando, 2007, pp. 1756–1761.
26.King G. and Zeng L., Logistic regression in rare events data, Polit. Anal. 9 (2001), pp. 137–163. [Google Scholar]
27.Krawczyk B., Galar M., Jeleń Ł., and Herrera F., Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy, Appl. Soft Comput. 38 (2016), pp. 714–726. [Google Scholar]
28.Kubat M., Holte R.C., and Matwin S., Machine learning for the detection of oil spills in satellite radar images, Mach. Learn. 30 (1998), pp. 195–215. [Google Scholar]
29.Kukar M. and Kononenko I., Cost-sensitive learning with neural networks, in ECAI, Brighton, 1998, pp. 445–449.
30.Laitinen E.K. and Laitinen T., Bankruptcy prediction: application of the Taylor's expansion in logistic regression, Int. Rev. Financial Anal. 9 (2000), pp. 327–349. [Google Scholar]
31.Maalouf M. and Trafalis T.B., Robust weighted kernel logistic regression in imbalanced and rare events data, Comput. Stat. Data Anal. 55 (2011), pp. 168–183. [Google Scholar]
32.Maalouf M., Trafalis T.B., and Adrianto I., Kernel logistic regression using truncated newton method, Comput. Manage. Sci. 8 (2011), pp. 415–428. [Google Scholar]
33.Moayedikia A., Ong K.L., Boo Y.L., Yeoh W.G., and Jensen R., Feature selection for high dimensional imbalanced class data using harmony search, Eng. Appl. Artif. Intell. 57 (2017), pp. 38–49. [Google Scholar]
34.Özöür-akyüz S., Ünay D., and Smola A., Guest editorial: model selection and optimization in machine learning, Mach. Learn. 85 (2011), p. 1. [Google Scholar]
35.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., and Duchesnay E., Scikit-learn: machine learning in Python, J. Mach. Learn. Res. 12 (2011), pp. 2825–2830. [Google Scholar]
36.Phua C., Alahakoon D., and Lee V., Minority report in fraud detection: classification of skewed data, ACM Sigkdd Explor. Newslett. 6 (2004), pp. 50–59. [Google Scholar]
37.Pramono L.A., Setiati S., Soewondo P., Subekti I., Adisasmita A., Kodim N., and Sutrisna B., Prevalence and predictors of undiagnosed diabetes mellitus in Indonesia, Age 46 (2010), pp. 100–100. [PubMed] [Google Scholar]
38.Provost F., Machine learning from imbalanced data sets 101, Proceedings of the AAAI'2000 Workshop on Imbalanced Data Sets, Austin, 2000, pp. 1–3.
39.Shu B., Zhang H., Li Y., Qu Y., and Chen L., Spatiotemporal variation analysis of driving forces of urban land spatial expansion using logistic regression: a case study of port towns in Taicang City, China, Habitat Int. 43 (2014), pp. 181–190. [Google Scholar]
40.Springer D.B., Tarassenko L., and Clifford G.D., Logistic regression-HSMM-based heart sound segmentation, IEEE Trans. Biomed. Eng. 63 (2016), pp. 822–832. [DOI] [PubMed] [Google Scholar]
41.Sra S., Nowozin S., and Wright S.J., Optimization for Machine Learning, MIT Press, Cambridge, 2012. [Google Scholar]
42.Wahba G., Gu C., Wang Y., and Campbell R., Soft classification, aka risk estimation, via penalized log likelihood and smoothing spline analysis of variance, The Mathematics of Generalization, CRC Press, Boca Raton, 2018, pp. 331–359.
43.Walker S.H. and Duncan D.B., Estimation of the probability of an event as a function of several independent variables, Biometrika 54 (1967), pp. 167–179. [PubMed] [Google Scholar]
44.Weiss G.M. and Provost F., Learning when training data are costly: the effect of class distribution on tree induction, J. Artif. Intell. Res. 19 (2003), pp. 315–354. [Google Scholar]
45.Zhang L., Priestley J., and Ni X., Influence of the event rate on discrimination abilities of bankruptcy prediction models, Int. J. Database Manage. Syst. 10 (2018), pp. 1–14. [Google Scholar]
46.Zhang L., Ray H., Priestley J., and Tan S., A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data, J. Appl. Stat. 47 (2019), pp. 568–581. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Zheng S., Strasser S., Holt N., Quinn M., Liu Y., and Morrell C., Stratified multilevel logistic regression modeling for risk factors of adolescent obesity in Tennessee, Int. J. High Risk Behav. Addict. 7 (2018), p. e58597. [Google Scholar]
48.Zheng Z., Wu X., and Srihari R., Feature selection for text categorization on imbalanced data, ACM Sigkdd Explor. Newslett. 6 (2004), pp. 80–89. [Google Scholar]
49.Zhu J. and Hastie T., Kernel logistic regression and the import vector machine, Journal of Computational and Graphical Statistics. 14 (2005), pp. 185–205.

[CIT0001] 1.Arora S., Cohen N., Golowich N., and Hu W., A convergence analysis of gradient descent for deep linear neural networks, preprint (2018). arXiv:1810.02281.

[CIT0002] 2.Bahnsen A.C., Aouada D., and Ottersten B., Example-dependent cost-sensitive logistic regression for credit scoring, in 2014 13th International Conference on Machine Learning and Applications (ICMLA), IEEE, Detroit, MI, 2014, pp. 263–269.

[CIT0003] 3.Baird III L.C. and Moore A.W., Gradient descent for general reinforcement learning, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 1999, pp. 968–974.

[CIT0004] 4.Barandela R., Valdovinos R.M., Sánchez J.S., and Ferri F.J., The imbalanced training sample problem: under or over sampling? in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, Lisbon, 2004, pp. 806–814.

[CIT0005] 5.Brown I. and Mues C., An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Syst. Appl. 39 (2012), pp. 3446–3453. [Google Scholar]

[CIT0006] 6.Buitinck L., Louppe G., Blondel M., Pedregosa F., Mueller A., Grisel O., Niculae V., Prettenhofer P., Gramfort A., Grobler J., Layton R., VanderPlas J., Joly A., Holt B., and Varoquaux G., API design for machine learning software: experiences from the scikit-learn project, in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, Prague, 2013, pp. 108–122.

[CIT0007] 7.Buitinck L., Louppe G., Blondel M., Pedregosa F., Mueller A., Grisel O., Niculae V., Prettenhofer P., Gramfort A., Grobler J., Layton R., VanderPlas J., Joly A., Holt B., and Varoquaux G., Scikit learn documentation in logistic regression (2018). Available at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Accessed 2018-09-30.

[CIT0008] 8.Canu S. and Smola A., Kernel methods and the exponential family, Neurocomputing 69 (2006), pp. 714–720. [Google Scholar]

[CIT0009] 9.Chizat L. and Bach F., On the global convergence of gradient descent for over-parameterized models using optimal transport, in Advances in Neural Information Processing Systems, Montreal, 2018, pp. 3036–3046.

[CIT0010] 10.Collell G., Prelec D., and Patil K.R., A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data, Neurocomputing 275 (2018), pp. 330–340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0011] 11.Cox D.R., The regression analysis of binary sequences, J. R. Stat. Soc. Ser. B (Methodol.) 20 (1958), pp. 215–242. [Google Scholar]

[CIT0012] 12.Cox D.R., Some procedures associated with the logistic qualitative response curve, Research papers in statistics: Festschrift for J. Neyman, John Wiley & Sons, London, 1966, pp. 55–71.

[CIT0013] 13.Demir G., Aytekin M., and Akgun A., Landslide susceptibility mapping by frequency ratio and logistic regression methods: an example from Niksar–Resadiye (Tokat, Turkey), Arabian J. Geosci. 8 (2015), pp. 1801–1812. [Google Scholar]

[CIT0014] 14.Deng K., Omega: on-line memory-based general purpose system classifier, Ph.D. diss., Carnegie Mellon University, 1998.

[CIT0015] 15.Ding J. and Xiong W., A new estimator for a population proportion using group testing, Commun. Stat. Simul. Comput. 45 (2016), pp. 101–114. [Google Scholar]

[CIT0016] 16.Drummond C. and Holte R.C., C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, in Workshop on Learning from Imbalanced Datasets II, Vol. 11, Citeseer, Washington, DC, 2003, pp. 1–8.

[CIT0017] 17.Reza Z., Streaming Stochastic Gradient Descent for Generalized Linear Models (2015). Available at https://stanford.edu/∼rezab/classes/cme323/S15/notes/lec11.pdf, Accessed 2018-04-02.

[CIT0018] 18.Habibzadeh F., Habibzadeh P., and Yadollahie M., On determining the most appropriate test cut-off value: the case of tests with continuous results, Biochem. Med. 26 (2016), pp. 297–307. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0019] 19.Han H., Wang W.Y., and Mao B.H., Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, International Conference on Intelligent Computing, Springer, Hefei, 2005, pp. 878–887.

[CIT0020] 20.He H. and Garcia E.A., Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (2008), pp. 1263–1284. [Google Scholar]

[CIT0021] 21.He H. and Ma Y., Imbalanced Learning: Foundations, Algorithms, and Applications, John Wiley & Sons, Hoboken, 2013. [Google Scholar]

[CIT0022] 22.Hong C., Ghosh R., and Srinivasan S., Dealing with class imbalance using thresholding, preprint (2016). arXiv:1607.02705.

[CIT0023] 23.Hosmer Jr D.W., Lemeshow S., and Sturdivant R.X., Applied Logistic Regression, Vol. 398, John Wiley & Sons, Hoboken, 2013. [Google Scholar]

[CIT0024] 24.Kaggle , Give me some credit (2011). Available at https://www.kaggle.com/c/GiveMeSomeCredit/data, Accessed 2018-02-01.

[CIT0025] 25.Karsmakers P., Pelckmans K., and Suykens J.A., Multi-class kernel logistic regression: a fixed-size implementation, in 2007 International Joint Conference on Neural Networks, IJCNN, IEEE, Orlando, 2007, pp. 1756–1761.

[CIT0026] 26.King G. and Zeng L., Logistic regression in rare events data, Polit. Anal. 9 (2001), pp. 137–163. [Google Scholar]

[CIT0027] 27.Krawczyk B., Galar M., Jeleń Ł., and Herrera F., Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy, Appl. Soft Comput. 38 (2016), pp. 714–726. [Google Scholar]

[CIT0028] 28.Kubat M., Holte R.C., and Matwin S., Machine learning for the detection of oil spills in satellite radar images, Mach. Learn. 30 (1998), pp. 195–215. [Google Scholar]

[CIT0029] 29.Kukar M. and Kononenko I., Cost-sensitive learning with neural networks, in ECAI, Brighton, 1998, pp. 445–449.

[CIT0030] 30.Laitinen E.K. and Laitinen T., Bankruptcy prediction: application of the Taylor's expansion in logistic regression, Int. Rev. Financial Anal. 9 (2000), pp. 327–349. [Google Scholar]

[CIT0031] 31.Maalouf M. and Trafalis T.B., Robust weighted kernel logistic regression in imbalanced and rare events data, Comput. Stat. Data Anal. 55 (2011), pp. 168–183. [Google Scholar]

[CIT0032] 32.Maalouf M., Trafalis T.B., and Adrianto I., Kernel logistic regression using truncated newton method, Comput. Manage. Sci. 8 (2011), pp. 415–428. [Google Scholar]

[CIT0033] 33.Moayedikia A., Ong K.L., Boo Y.L., Yeoh W.G., and Jensen R., Feature selection for high dimensional imbalanced class data using harmony search, Eng. Appl. Artif. Intell. 57 (2017), pp. 38–49. [Google Scholar]

[CIT0034] 34.Özöür-akyüz S., Ünay D., and Smola A., Guest editorial: model selection and optimization in machine learning, Mach. Learn. 85 (2011), p. 1. [Google Scholar]

[CIT0035] 35.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., and Duchesnay E., Scikit-learn: machine learning in Python, J. Mach. Learn. Res. 12 (2011), pp. 2825–2830. [Google Scholar]

[CIT0036] 36.Phua C., Alahakoon D., and Lee V., Minority report in fraud detection: classification of skewed data, ACM Sigkdd Explor. Newslett. 6 (2004), pp. 50–59. [Google Scholar]

[CIT0037] 37.Pramono L.A., Setiati S., Soewondo P., Subekti I., Adisasmita A., Kodim N., and Sutrisna B., Prevalence and predictors of undiagnosed diabetes mellitus in Indonesia, Age 46 (2010), pp. 100–100. [PubMed] [Google Scholar]

[CIT0038] 38.Provost F., Machine learning from imbalanced data sets 101, Proceedings of the AAAI'2000 Workshop on Imbalanced Data Sets, Austin, 2000, pp. 1–3.

[CIT0039] 39.Shu B., Zhang H., Li Y., Qu Y., and Chen L., Spatiotemporal variation analysis of driving forces of urban land spatial expansion using logistic regression: a case study of port towns in Taicang City, China, Habitat Int. 43 (2014), pp. 181–190. [Google Scholar]

[CIT0040] 40.Springer D.B., Tarassenko L., and Clifford G.D., Logistic regression-HSMM-based heart sound segmentation, IEEE Trans. Biomed. Eng. 63 (2016), pp. 822–832. [DOI] [PubMed] [Google Scholar]

[CIT0041] 41.Sra S., Nowozin S., and Wright S.J., Optimization for Machine Learning, MIT Press, Cambridge, 2012. [Google Scholar]

[CIT0042] 42.Wahba G., Gu C., Wang Y., and Campbell R., Soft classification, aka risk estimation, via penalized log likelihood and smoothing spline analysis of variance, The Mathematics of Generalization, CRC Press, Boca Raton, 2018, pp. 331–359.

[CIT0043] 43.Walker S.H. and Duncan D.B., Estimation of the probability of an event as a function of several independent variables, Biometrika 54 (1967), pp. 167–179. [PubMed] [Google Scholar]

[CIT0044] 44.Weiss G.M. and Provost F., Learning when training data are costly: the effect of class distribution on tree induction, J. Artif. Intell. Res. 19 (2003), pp. 315–354. [Google Scholar]

[CIT0045] 45.Zhang L., Priestley J., and Ni X., Influence of the event rate on discrimination abilities of bankruptcy prediction models, Int. J. Database Manage. Syst. 10 (2018), pp. 1–14. [Google Scholar]

[CIT0046] 46.Zhang L., Ray H., Priestley J., and Tan S., A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data, J. Appl. Stat. 47 (2019), pp. 568–581. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0047] 47.Zheng S., Strasser S., Holt N., Quinn M., Liu Y., and Morrell C., Stratified multilevel logistic regression modeling for risk factors of adolescent obesity in Tennessee, Int. J. High Risk Behav. Addict. 7 (2018), p. e58597. [Google Scholar]

[CIT0048] 48.Zheng Z., Wu X., and Srihari R., Feature selection for text categorization on imbalanced data, ACM Sigkdd Explor. Newslett. 6 (2004), pp. 80–89. [Google Scholar]

[CIT0049] 49.Zhu J. and Hastie T., Kernel logistic regression and the import vector machine, Journal of Computational and Graphical Statistics. 14 (2005), pp. 185–205.

PERMALINK

Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function

Lili Zhang

Trent Geisler

Herman Ray

Ying Xie

Abstract

1. Introduction

2. Related work

Table 1.

Figure 1.

3. A novel penalized log-likelihood function

3.1. Learning by gradient descent

3.2. Probability estimation

3.3. Comparison with other penalized log-likelihood functions

Table 2.

4. Experiments

Table 3.

4.1. Experimental methodology

4.2. Experimental results

Table 4.

Table 6.

4.2.1. Real-world data results

4.2.2. Simulation results

Table 5.

5. Estimated probability distribution and additional performance analysis of resulting models

5.1. Estimated probability distribution

Figure 2.

5.2. Performance measures under probability cutoff

Table 7.

5.3. Estimated model coefficients

Table 8.

Figure 3.

5.4. Discussions

6. Conclusions

7. Future work

Appendix. Empirical logit plots of the dataset wine_quality.

Figure A1.

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases