Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2021 Jun 16;49(13):3257–3277. doi: 10.1080/02664763.2021.1939662

Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function

Lili Zhang a, Trent Geisler a, Herman Ray b,CONTACT, Ying Xie c
PMCID: PMC9542776  PMID: 36213775

Abstract

Logistic regression is estimated by maximizing the log-likelihood objective function formulated under the assumption of maximizing the overall accuracy. That does not apply to the imbalanced data. The resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood function. Existing solutions require either hard hyperparameter estimating or high computational complexity. We propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients. In the experiments, the proposed logistic regression model is compared with the existing ones on the statistics of area under receiver operating characteristics (ROC) curve from 10 public datasets and 16 simulated datasets, as well as the training time. A detailed analysis is conducted on an imbalanced credit dataset to examine the estimated probability distributions, additional performance measurements (i.e. type I error and type II error) and model coefficients. The results demonstrate that both the discrimination ability and computation efficiency of logistic regression models are improved using the proposed log-likelihood function as the learning objective.

Keywords: Logistic regression, binary classification, imbalanced data, maximum likelihood, penalized log-likelihood function, cost-sensitive

1. Introduction

The imbalanced data present a big challenge in the data-driven world. The minority class (i.e. event) is usually the class of interest and more costly if misclassified, such as fraud in the fraud detection problem [36], malignance in the breast cancer diagnosis problem [27] and delinquency in the credit scoring problem [5]. By definition, the imbalanced data are considered by most practitioners to be the data where the number of observations labeled by the majority class (i.e. non-event) is twice of the minority class or more [21].

The challenge is that most statistics and machine learning methods are biased towards the majority class and cannot predict the minority class accurately, including the logistic regression model that has been favored for its high interpretability. The bias can bring a great loss (money, reputation, etc.). It is caused by the underlying assumption of the optimization objective (e.g. log-likelihood function), which is to maximize the overall accuracy [21,38]. However, the overall accuracy is not a valid performance measurement on the classification of imbalanced data [28].

Because of significant and broad applications related to the imbalanced data, researchers have made efforts to improve solutions in the past decades in the levels of input data (e.g. oversampling, undersampling) [4,16,19], feature (e.g. feature selection, variable discretization) [33,46,48], algorithm (e.g. cost-sensitive learning, ensemble) [2,10] and output (e.g. thresholding) [22].

In the present work, we focus on improving the logistic regression on the imbalanced data from the perspective of the cost-sensitive learning, considering that the interpretability is often required for prescriptive actions. Logistic regression has been widely used for decision-making systems since its development by Cox and Duncan independently in 1950s and 1960s for binary classification problems [11,12,43]. It is a linear classification model with the optimal model coefficients estimated by maximizing the log-likelihood function [34]. The advantages of logistic regression are multifold, including high interpretability and low time complexity [23]. The application of logistic regression covers broad areas, such as bankruptcy prediction [30], credit scoring [2], heart sound segmentation [40], landslide susceptibility prediction [13], urban land spatial expansion analysis [39] and adolescent obesity risk [47].

To mitigate the bias on the imbalanced data, one strategy is to penalize misclassification costs of observations differently in the log-likelihood objective function that is used to train the logistic regression model for optimal coefficients [42]. However, the existing solutions require either some hard hyperparameter estimation or very high time complexity [14,32]. In the present work, we propose a novel penalized log-likelihood objective function by including penalty weights as decision variables for event observations and learning them from data along with model coefficients via the gradient descent method. By using the proposed log-likelihood function as the optimization objective to train logistic regression models, both differentiation ability and computation efficiency are improved, based on our experimental results.

The paper is structured in the following way. In Section 2, the existing penalized log-likelihood functions for the imbalanced data in the literature are reviewed. Section 3 describes the proposed log-likelihood function and how it can be solved. Section 4 illustrates the experiments and results. In Section 5, the estimated probability distribution and the estimated coefficients of resulting models are examined on a credit dataset as a case study. In Section 6, the conclusions are presented. In Section 7, the future work is discussed.

2. Related work

Logistic regression is a linear model for binary classification problems. In logistic regression, the values of input independent variables (i.e. xi0,,xin) are linearly combined, defined in Equation (2), and then transformed by a sigmoid function, defined in Equation (1), as shown in Figure 1. The notations can be found in Table 1.

Table 1.

Notations.

Notation Meaning
m Total number of observations in the training data
n Total number of independent variables
i Index of observations, i=1,,m
j Index of independent variables, j=0,,n
xij Value of the jth independent variable in the ith observation
xi Vector of values of independent variables in the ith observation
yi True class label of the ith observation
βj Estimated coefficient of the jth independent variable
β Vector of estimated coefficients of independent variables
hi Model output for the ith observation
yi^ Estimated class label for the ith observation
hi=π(βTxi)=11+eβTxi, (1)

where

βTxi=j=0nβjxij=β0xi0+β1xi1++βnxin (2)

with xi0=1, which makes β0 the intercept.

Figure 1.

Figure 1.

Logistic regression.

The sigmoid function in Equation (1) restricts the model output between 0 and 1. The model output is interpreted as the estimated probability of the event occurrence, considering that the event of the interest (e.g. fraud, delinquency, failure, malignant) is always coded as 1 while the non-event (e.g. non-fraud, non-delinquency, pass, benign) is always coded as 0 [23,31]. Take the ith observation as an example, the probability of the event occurrence is estimated by Equation (3), and correspondingly the probability of the non-event occurrence is estimated by Equation (4). Mathematically, these two equations can be equivalently re-written into one equation as Equation (5).

P(Y=1|X=xi)=π(βTxi), (3)
P(Y=0|X=xi)=1π(βTxi), (4)
P(Y=yi|X=xi)=π(βTxi)yi(1π(βTxi))(1yi). (5)

Assuming that all observations are independent, the overall likelihood can be expressed by the likelihood function in Equation (6), which is the product of the individual likelihood of the training data. The problem is to identify the model parameters β that maximize the overall likelihood. To improve the computation efficiency, the likelihood function is transformed into its log form as Equation (7), called the log-likelihood function. To solve this unconstrained optimization problem, the most commonly used algorithm is the gradient descent algorithm [41], where the partial derivative is first computed.

L(β)=i=1mP(Y=yi|X=xi)=i=1m[π(βTxi)yi(1π(βTxi))(1yi)], (6)
LL(β)=i=1m[yilog(π(βTxi))+(1yi)log(1π(βTxi))]. (7)

Maximizing the log-likelihood in Equation (7) is equivalently minimizing the negative log-likelihood in Equation (8), which is referred to as the loss function or cost function of logistic regression. The time complexity for solving Equation (8) is O(n) by the gradient descent algorithm [17].

The loss function in Equation (8) can be interpreted in two parts. The first part yilog(π(βTxi)) is the misclassification costs for event observations (i.e. yi=1), while the second part (1yi)log(1π(βTxi)) is the misclassification costs for non-event observations (i.e. yi=0), shown in Equation (9). By assuming that the number of events and non-events is equal and the misclassifications of events and non-events are equal, this objective function essentially maximizes the overall accuracy.

minβi=1m[yilog(π(βTxi))+(1yi)log(1π(βTxi))], (8)
costi={yilog(π(βTxi))ifyi=1,(1yi)log(1π(βTxi))ifyi=0. (9)

However, as Kubat et al. pointed out, the overall accuracy is not a valid and effective performance measurement for the imbalanced data [28]. In the imbalanced data, the number of observations in the majority class (i.e. non-events) is usually two times of the minority class (i.e. events) or more [21]. By maximizing the overall accuracy, logistic regression tends to be biased towards the majority class and misclassifies events as non-events severely [21,38,44]. For example, in an empirical study on the influence of the event rate on discrimination abilities of bankruptcy prediction models, when the event rate (i.e. the proportion of bankruptcy observations) is 0.12%, the accuracy of the logistic regression model is 99.41%, but its type II error is 95.01%, which indicates 95.01% of bankruptcy observations are misclassified as non-bankruptcy [45]. This bias can bring great loss in practice, for example, when banks approve loans to organizations with predicted low but truely high bankruptcy probability. To appropriately measure the model performance on the imbalanced data, researchers have suggested to provide a comprehensive assessment with both curve-based measurements (e.g. ROC, precision–recall curve) and point-value measurements (e.g. type I error, type II error, F-measure, G-mean) [20,28].

To apply logistic regression in the imbalanced data (i.e. rare event data), King and Zeng penalized misclassification costs of events and non-events differently by penalty weights W1 and W0 in the log-likelihood function [26], as shown in Equation (12). Penalty weights W1 and W0 are determined by the population proportion of events τ and the sample proportion of events y¯, defined in Equation (11). W1 is the penalty weight for all event observations, while W0 is the penalty weight for all non-event observations. Because they are invariant to values of independent variables, they are referred to as global penalty weights in this research context. Because W0 and W1 are pre-defined and plugged into the log-likelihood function as constants, the resulting loss function in Equation (10) can be solved in the same time complexity O(n) as the standard log-likelihood function in Equation (8). The misclassification costs associated with this loss function can be found in Equation (12). One challenge in this method is that it is hard to estimate the population proportion of events τ accurately [15], which ultimately influences the performance of logistic regression driven by global penalty weights W0 and W1 as found in an empirical study [46].

minβW1i=1myilog(π(βTxi))W0i=1m(1yi)log(1π(βTxi)), (10)

where

W1=τy¯andW0=1τ1y¯ (11)

with τ denoting the population fraction of events induced by choice-based sampling and y¯ denoting the sample proportion of events.

costi={W1yilog(π(βTxi))ifyi=1,W0(1yi)log(1π(βTxi))ifyi=0. (12)

Instead of penalizing misclassification costs based on classes (i.e. event and non-event), Deng proposed to penalize the misclassification cost of each observation differently by a penalty weight wi, where i is the observation index, as shown in Equation (13). wi is determined by the Gaussian kernel function, defined in Equation (14), where Kw2 is called the kernel width, a hyperparameter to tune. This is called the locally weighted logistic regression or kernel logistic regression [8,49]. The resulting loss function in Equation (13) can be solved in time complexity O(n3) [25,32]. The corresponding misclassification costs of the loss function are in Equation (15). The increase of the time complexity is caused by the computation of distance matrices using the Gaussian kernel in Equation (14), which limits its application on large datasets.

minβi=1mwi[yilog(π(βTxi))+(1yi)log(1π(βTxi))], (13)

where

wi=exp((xixq)2Kw2) (14)

with xq denoting the query observation being evaluated:

costi={wiyilog(π(βTxi))ifyi=1,wi(1yi)log(1π(βTxi))ifyi=0. (15)

By including both global penalty weights (i.e. W0, W1) and local penalty weights (i.e. wi) above along with a regularization term (i.e. (α/2)β2) in the log-likelihood function, Maalouf and Trafalis proposed a rare event weighted kernel logistic regression [31], as shown in Equation (16). This loss function can be solved in the time complexity O(n3). The associated costs with the loss function can be found in Equation (17). Besides high computational complexity, this method also introduces one more hyperparameter α to tune in the regularization term.

minβW1i=1mwiyilog(π(βTxi))W0i=1mwi(1yi)log(1π(βTxi))+α2β2, (16)

where α is the regularization strength, a hyperparameter tuned by users.

costi={W1wiyilog(π(βTxi))+α2β2ifyi=1,W0wi(1yi)log(1π(βTxi))+α2β2ifyi=0. (17)

3. A novel penalized log-likelihood function

To address the challenges of hard parameter estimation (i.e. the population proportion of the event) and high time complexity in the related work, we introduce the local penalty weights λi for event observations as decision variables in the log-likelihood objective function. The loss function is redefined in Equation (18), denoted as LL(λ,β). The misclassification costs of the event observations are penalized by λi, while the misclassification costs of the non-event observations are not penalized, shown in Equation (19). The intention is to reduce the number of decision variables and the complexity of the optimization problem, as it increases the computational complexity by iteratively updating a large number of decision variables for reaching the optimum in the learning process which will be discussed in the next section.

minβ,λi=1m[λiyilog(π(βTxi))+(1yi)log(1π(βTxi))], (18)

where λi>0, a parameter learned from the data:

costi={λiyilog(π(βTxi))ifyi=1,(1yi)log(1π(βTxi))ifyi=0. (19)

3.1. Learning by gradient descent

The optimization of the proposed log-likelihood function in Equation (18) is a nonlinear programing problem with two sets of decision variables β and λ, which can be solved by the gradient descent algorithm in the time complexity O(n). First, the partial derivative on βj and λi is derived in Equations (20) and (21), respectively. They are updated iteratively by the rules in Equations (22) and (23), respectively, where α1 is the learning rate for βj and α2 is the learning rate for λi. The learning rates are tuned by users. The gradient descent-based algorithm is summarized in Algorithm 1, which is guaranteed to converge as proved by other researchers [1,3,9]. In this setting, to ensure larger penalty weights to be given to the event observations, λ is initialized to be 1s and updated to be a larger value iteratively. Of note, the update in Equation (23) is in the same direction of the derivative due to a suspected relationship between the λ and β parameters :

(LL(β,λ))βj=i=1m[λiyilog(π(βTxi))βj+(1yi)log(1π(βTxi))βj]=i=1m[λiyi1π(βTxi)π(βTxi)βj(1yi)11π(βTxi)π(βTxi)βj]=i=1m[(λiyiπ(βTxi)(1yi)1π(βTxi))π(βTxi)βj]=i=1m[(λiyiπ(βTxi)(1yi)1π(βTxi))π(βTxi)(1π(βTxi))(βTxi)βj]=i=1m[(λiyiπ(βTxi)(1yi)1π(βTxi))π(βTxi)(1π(βTxi))xij]=i=1m[λiyi(1π(βTxi))(1yi)π(βTxi)π(βTxi)(1π(βTxi))π(βTxi)(1π(βTxi))xij]=i=1m{[(λiyi(1π(βTxi)))(1yi)π(βTxi)]xij}=i=1m{[λiyiλiyiπ(βTxi)π(βTxi)+yiπ(βTxi)]xij}=i=1m[(λiyiλiyihihi+yihi)xij],(LL(β,λ))λi=yilog(π(βTxi)), (20)
βj,NEW=βj,CURRENTα1(LL(β,λ))βj,CURRENT, (22)
λi,NEW=λi,CURRENT+α2(LL(β,λ))λi,CURRENT. (23)

3.1.

3.2. Probability estimation

To interpret the role of local penalty weights λ, we reverse the log in Equation (18) and trace it back to the likelihood function. As shown in Equation (24), the penalty weights λ essentially regularize the process of learning the model coefficients β from the training data by weighting the estimated probabilities. Then the learned β are used to estimate the probability for the event occurrence based on Equation (3) on the validation data and test data. Because λ only regularizes the learning process and is not used for the probability estimation together with β, the interpretability of logistic regression is maintained.

L(β)=i=1mP(Y=yi|X=xi)λi=i=1m[π(βTxi)yi(1π(βTxi))(1yi)]λi, (24)

where λi=1 when yi=0 and λi are values learned from data when yi=1.

3.3. Comparison with other penalized log-likelihood functions

Our proposed penalized log-likelihood function is compared comprehensively with the existing log-likelihood functions in Table 2. As a linear model, it considers the imbalance of the data, is much less complicated in perspectives of time complexity and the number of estimated parameter sets than nonlinear models with penalty weights determined by Gaussian kernel, and does not have any penalty weight-related hyperparameter to tune. Its advantages will be demonstrated in experimental results in Section 4.

Table 2.

Comparison of penalized log-likelihood functions.

Equations Imbalance Linearity Computational complexity Estimated parameter sets Hyperparameter
Equation (8) No Linear O(n) 1 N/A
Equation (10) Yes Linear O(n) 1 τ
Equation (13) Yes Nonlinear O(n3) Test size Kw
Equation (16) Yes Nonlinear O(n3) Test size τ, Kw
Equation (18) Yes Linear O(n) 1 N/A

4. Experiments

Both real-world and simulated datasets are collected and generated to test the performance of the various models. For the real-world data, 10 public imbalanced datasets from multiple domains are collected and used in the experimental study. The basic characteristics for all the datasets can be found in Table 3, including data source, target (i.e. dependent variable), event rate, the number of observations, the number of variables, variable types and domain area. The event rate in the real-world datasets ranges from 0.76% to 10.42%. For the simulated data, 16 datasets with 2000 observations each are generated by varying both the event rate from 1% to 20% and the number of attributes (i.e. slopes) from 1 to 8. All simulated attributes are uncorrelated and the data are generated with a linear relationship using a standard normal distribution with different attribute values. The simulated dataset with three attribute values is generated with a non-linear relationship to the target.

Table 3.

Basic characteristics of datasets.

Dataset Repository Target Event rate (%) Observations Attributes Domain
abalone_19 UCI 19 0.76 4177 7C,1N Life
arrhythmia UCI 06 5.55 452 206C, 73N Biology
ecoli UCI imU 10.42 336 7C Life
oil UCI minority 4.35 937 49C Environment
ozone_level UCI ozone day 2.86 2536 72C Environment
solar_flare_m0 UCI M-class > 0 5.00 1389 10N Nature
us_crime UCI freq > 0.65 7.69 1994 122C Social
wine_quality UCI score ≤ 4 3.70 4898 11C Business
yeast_me2 UCI ME2 3.44 1484 8C Life
yeast_ml8 LIBSVM 8 7.14 2417 103C Life
Simulated Generated y 1, 5, 10, 20 2000 1, 3, 4 and 8C None

4.1. Experimental methodology

Logistic regression models, trained by the proposed penalized log-likelihood function and the existing ones, are compared comprehensively on each dataset, as listed below. Their performance is evaluated by 100 runs of 10-fold stratified cross validation, which reflects more the model generalization ability on the new data compared with other validation techniques (e.g. bootstrapping). In each iteration of cross validation, the area under ROC curve (i.e. AUROC) on the validation data is computed.

  1. Standard: The logistic regression model that is trained by the loss function in Equation (8) with no penalty weights. To fit the model, the logistic regression model function from the Scikit-Learn python package [6,35] is used because its optimizer provides the global optimal solution to Equation (8).

  2. Balanced: The logistic regression model that is trained by the loss function in Equation (10) with balanced global penalty weights by taking τ as 0.5 in Equation (11), adjusting weights inversely proportional to class frequencies [7]. To fit the model, the logistic regression model function with the hyperparameter ‘class_weight’ set as balanced from the Scikit-Learn python package is used because its optimizer provides the global optimal solution to Equation (10).

  3. Weighted: The logistic regression model that is trained by the loss function in Equation (10) with global penalty weights (i.e. W0, W1) in Equation (11) tuned based on τ from 0 to 0.5 with step size 0.01 [46]. To fit the model, the logistic regression model function with the hyperparameter class_weight from the Scikit-Learn python package is used because its optimizer provides the global optimal solution to Equation (10).

  4. Kernel: The logistic regression model that is trained by the loss function in Equation (13) with local kernel penalty weights in Equation (14). This model is implemented by a custom built function. Kw is tuned as a hyperparameter from 0 to 1 with step size 0.1, capturing the nonlinearity of the model [14]. For the simulated data, Kw is tuned with 3, 10, 20 and 30 as part of the grid.

  5. Learnable: The logistic regression model that is trained by the loss function in Equation (18) with learnable local penalty weights λi as decision variables. λi are initialized to be 1 and the learning rates (i.e. α1, α2) are tuned for each dataset. This model is implemented by a custom built function. The learning process is terminated when the AUROC on the validation data ceases to increase for the purpose of preventing the overfitting [29].

In the experiment of comparing the computation time of these five models using the real-world datasets, all models are implemented with the same data structures used in the custom built function of the learnable model to eliminate the effect caused by the different data structures used in the Scikit-Learn python package and custom built functions.

4.2. Experimental results

Models on each dataset are compared based on the statistics of AUROCs of 100 runs of 10-fold stratified cross validation, including 95% confidence interval, mean and standard deviation. For the real-world datasets, the training time is also computed. Tables 46 list the results from the real-world and simulated datasets, respectively. The insights from the results are listed below in Sections 4.2.1 and 4.2.2.

Table 4.

Real-world data results of logistic regression models.

Dataset Model 95% confidence interval Mean Std Training time per run (s)
abalone_19 Standard (0.8168, 0.8279) 0.8224 0.0889 39.1791
abalone_19 Balanced (0.8396, 0.8490) 0.8443 0.0764 45.4924
abalone_19 Weighted (0.8396, 0.8490) 0.8443 0.0764 39.3664
abalone_19 Kernel (0.7018, 0.7139) 0.7078 0.0969 1435.3485
abalone_19 Learnable (0.8595, 0.8681) 0.8638 0.0690 30.5765
arrhythmia Standard (0.8452, 0.8568) 0.8510 0.0935 19.8571
arrhythmia Balanced (0.8582, 0.8695) 0.8639 0.0908 22.4839
arrhythmia Weighted (0.8582, 0.8695) 0.8639 0.0908 22.4354
arrhythmia Kernel (0.5553, 0.5648) 0.5600 0.0771 141.7217
arrhythmia Learnable (0.8765, 0.8867) 0.8816 0.0823 15.3639
ecoli Standard (0.9161, 0.9272) 0.9216 0.0895 12.1102
ecoli Balanced (0.9077, 0.9196) 0.9136 0.0957 14.6242
ecoli Weighted (0.9196, 0.9302) 0.9249 0.0851 14.0671
ecoli Kernel 0.9405, 0.9458) 0.9431 0.0428 27.4736
ecoli Learnable (0.9404, 0.9473) 0.9439 0.0553 7.3705
oil Standard (0.9329, 0.9396) 0.9362 0.0544 19.1043
oil Balanced (0.9167, 0.9263) 0.9215 0.0771 21.9376
oil Weighted (0.9167, 0.9263) 0.9215 0.0771 21.6915
oil Kernel 0.8832, 0.8922) 0.8877 0.0718 170.3713
oil Learnable (0.9472, 0.9518) 0.9495 0.0377 13.9312
ozone_level Standard (0.8936, 0.9007) 0.8971 0.0573 34.8695
ozone_level Balanced (0.8725, 0.8813) 0.8769 0.0708 38.9631
ozone_level Weighted (0.9006, 0.9069) 0.9038 0.0509 37.2606
ozone_level Kernel (0.4838, 0.4967) 0.4903 0.1041 1635.0841
ozone_level Learnable (0.9162, 0.9221) 0.9191 0.0474 23.1777
solar_flare_m0 Standard (0.7701, 0.7818) 0.7759 0.0948 21.5702
solar_flare_m0 Balanced (0.7610, 0.7732) 0.7671 0.0983 24.9768
solar_flare_m0 Weighted (0.7669, 0.7790) 0.7730 0.0970 24.1436
solar_flare_m0 Kernel (0.7341, 0.7447) 0.7394 0.0852 293.0257
solar_flare_m0 Learnable (0.8191, 0.8282) 0.8236 0.0731 11.9129
us_crime Standard (0.9173, 0.9217) 0.9195 0.0351 41.2519
us_crime Balanced (0.9085, 0.9130) 0.9107 0.0363 51.1435
us_crime Weighted (0.9159, 0.9202) 0.9180 0.0351 37.8568
us_crime Kernel (0.7410, 0.7485) 0.7447 0.0608 966.8329
us_crime Learnable (0.9290, 0.9328) 0.9309 0.0306 16.5772
wine_quality Standard (0.7792, 0.7866) 0.7829 0.0600 52.7785
wine_quality Balanced (0.7797, 0.7870) 0.7834 0.0588 57.7737
wine_quality Weighted (0.7807, 0.7880) 0.7843 0.0593 44.4346
wine_quality Kernel (0.8592, 0.8641) 0.8617 0.0395 2013.2618
wine_quality Learnable (0.7861, 0.7933) 0.7897 0.0581 54.7964
yeast_me2 Standard (0.8679, 0.8792) 0.8736 0.0909 24.8167
yeast_me2 Balanced (0.8687, 0.8793) 0.8740 0.0851 27.2143
yeast_me2 Weighted (0.8704, 0.8816) 0.8760 0.0899 23.5347
yeast_me2 Kernel (0.8804, 0.8915) 0.8859 0.0891 259.5875
yeast_me2 Learnable (0.8937, 0.9022) 0.8979 0.0684 19.5705
yeast_ml8 Standard (0.5648, 0.5728) 0.5688 0.0649 58.8315
yeast_ml8 Balanced (0.5548, 0.5630) 0.5589 0.0657 64.3612
yeast_ml8 Weighted (0.5572, 0.5653) 0.5613 0.0656 45.2029
yeast_ml8 Kernel 0.5467, 0.5483) 0.5475 0.0123 167.4668
yeast_ml8 Learnable (0.6279, 0.6350) 0.6315 0.0571 63.2463

Table 6.

Simulation results of logistic regression models (part 2).

Dataset Model 95% confidence interval Mean Std
4 slope(s), event rate: 1% Standard (0.9683, 0.9724) 0.9704 0.0325
4 slope(s), event rate: 1% Balanced (0.9662, 0.9705) 0.9684 0.0350
4 slope(s), event rate: 1% Weighted (0.9676, 0.9719) 0.9698 0.0343
4 slope(s), event rate: 1% Kernel (0.9000, 0.9103) 0.9051 0.0830
4 slope(s), event rate: 1% Learnable (0.9757, 0.9784) 0.9771 0.0212
4 slope(s), event rate: 5% Standard (0.9482, 0.9518) 0.9500 0.0286
4 slope(s), event rate: 5% Balanced (0.9476, 0.9511) 0.9494 0.0285
4 slope(s), event rate: 5% Weighted (0.9481, 0.9517) 0.9499 0.0290
4 slope(s), event rate: 5% Kernel (0.9438, 0.9475) 0.9456 0.0299
4 slope(s), event rate: 5% Learnable (0.9516, 0.9554) 0.9535 0.0308
4 slope(s), event rate: 10% Standard (0.9327, 0.9357) 0.9342 0.0246
4 slope(s), event rate: 10% Balanced (0.9322, 0.9354) 0.9338 0.0257
4 slope(s), event rate: 10% Weighted (0.9326, 0.9357) 0.9342 0.0247
4 slope(s), event rate: 10% Kernel (0.9311, 0.9342) 0.9326 0.0256
4 slope(s), event rate: 10% Learnable (0.9354, 0.9379) 0.9367 0.0196
4 slope(s), event rate: 20% Standard (0.9399, 0.9421) 0.9410 0.0173
4 slope(s), event rate: 20% Balanced (0.9400, 0.9421) 0.9410 0.0169
4 slope(s), event rate: 20% Weighted (0.9401, 0.9422) 0.9412 0.0171
4 slope(s), event rate: 20% Kernel (0.9396, 0.9418) 0.9407 0.0176
4 slope(s), event rate: 20% Learnable (0.9411, 0.9430) 0.9420 0.0158
8 slope(s), event rate: 1% Standard (0.9976, 0.9981) 0.9978 0.0036
8 slope(s), event rate: 1% Balanced (0.9969, 0.9974) 0.9971 0.0044
8 slope(s), event rate: 1% Weighted (0.9975, 0.9980) 0.9977 0.0037
8 slope(s), event rate: 1% Kernel (0.9643, 0.9680) 0.9662 0.0294
8 slope(s), event rate: 1% Learnable (0.9970, 0.9976) 0.9973 0.0046
8 slope(s), event rate: 5% Standard (0.9918, 0.9926) 0.9922 0.0062
8 slope(s), event rate: 5% Balanced (0.9912, 0.992) 0.9916 0.0061
8 slope(s), event rate: 5% Weighted (0.9915, 0.9923) 0.9919 0.0066
8 slope(s), event rate: 5% Kernel (0.9889, 0.9898) 0.9894 0.0079
8 slope(s), event rate: 5% Learnable (0.9919, 0.9926) 0.9922 0.006
8 slope(s), event rate: 10% Standard (0.9899, 0.9905) 0.9902 0.005
8 slope(s), event rate: 10% Balanced (0.9899, 0.9905) 0.9902 0.0048
8 slope(s), event rate: 10% Weighted (0.9898, 0.9905) 0.9901 0.0051
8 slope(s), event rate: 10% Kernel (0.9883, 0.9890) 0.9887 0.0058
8 slope(s), event rate: 10% Learnable (0.9906, 0.9911) 0.9908 0.004
8 slope(s), event rate: 20% Standard (0.9861, 0.9870) 0.9866 0.007
8 slope(s), event rate: 20% Balanced (0.9862, 0.9870) 0.9866 0.0068
8 slope(s), event rate: 20% Weighted (0.9861, 0.9870) 0.9866 0.0069
8 slope(s), event rate: 20% Kernel (0.9836, 0.9846) 0.9841 0.0081
8 slope(s), event rate: 20% Learnable (0.9858, 0.9867) 0.9863 0.0068

4.2.1. Real-world data results

  1. On 9 of all 10 datasets, the learnable models produce a higher 95% confidence interval, higher mean and smaller standard deviation of AUROCs, as highlighted in bold in Table 4, compared with the standard models, balanced models and weighted models.

  2. Only on the dataset wine_quality, the kernel model generates a higher 95% confidence interval and higher mean of AUROCs than other models. The reason is that the nonlinear relationship in the dataset wine_quality is captured by the kernel model quite well. The kernel model is a nonlinear model, where Kw is restricted to small values to ensure the model nonlinearity in the experiment setting, while other models are linear models. If the relationships between independent variables and the dependent variable are nonlinear, the kernel model captures patterns better. Otherwise, it will be worse. Take the dataset wine_quality as an example. By evaluating the empirical logit plots in Figure A1 in the Appendix, most independent variables have nonlinear relationship with the dependent variable. This leads the kernel model to perform the best.

  3. On the datasets abalone_19, arrhythmia and oil, the balanced models are identical to the weighted models with τ=0.5.

  4. On the datasets abalone_19, arrhythmia and ozone_level, the weighted models have a higher 95% confidence interval and higher mean of AUROCs than standard models. On the datasets oil, solar_flare_m0 and yeast_ml8, the standard models have a higher 95% confidence interval and higher mean of AUROCs than the weighted models. On the datasets ecoli, us_crime, wine_quality and yeast_me2, the standard models and weighted models are similar regarding the 95% confidence interval and mean of AUROCs.

  5. On all datasets, kernel models take the largest training time. On the datasets abalone_19, arrhythmia, ecoli, oil, ozone_level, solar_flare_m0, us_crime and yeast_me2, the learnable models take the least training time. For the dataset wine_quality and yeast_ml8, the training time of the learnable model is smaller than the balanced model, although it is slightly greater than the standard model and the weighted model.

4.2.2. Simulation results

  1. On 8 of 16 datasets, the learnable models produce a higher mean of AUROCs than other models, as shown in Tables 5 and 6.

  2. On 12 of 16 datasets, the learnable models produce a lower standard deviation of AUROCs than other models.

  3. On 8 of 16 datasets, the learnable models produce both a higher mean and lower standard deviation of AUROCs than other models.

  4. On 11 of 16 datasets, the learnable model produces a statistically significant higher 95% confidence interval of AUROCs than at least one other model, as highlighted in bold in Tables 5 and 6. The italicized models indicate the model(s) that the learnable model outperforms.

  5. The learnable model outperforms all other models in three of the simulated datasets (i.e. three slopes, event rate: 1%; four slopes, event rate: 1%; eight slopes, event rate 10%). The learnable model is the only model to out-perform all models on a given dataset.

  6. On the simulated data with three slopes, generated with a non-linear relationship, the learnable model out-performs all models at all event rates except the kernel model. Interestingly, the learnable model also out-performs the kernel model for three slopes when the event rate is most extreme at 1%. The kernel model performance for non-linear data is consistent with the findings in Section 4.2.1.

  7. All models perform equally well on all datasets with only 1 attribute value.

Table 5.

Simulation results of logistic regression models (part 1).

Dataset Model 95% confidence interval Mean Std
1 slope(s), event rate: 1% Standard (0.6965, 0.7154) 0.7059 0.1523
1 slope(s), event rate: 1% Balanced (0.6983, 0.7175) 0.7079 0.1548
1 slope(s), event rate: 1% Weighted (0.6980, 0.7169) 0.7075 0.1523
1 slope(s), event rate: 1% Kernel (0.6996, 0.7180) 0.7088 0.1481
1 slope(s), event rate: 1% Learnable (0.6954, 0.7166) 0.7060 0.1717
1 slope(s), event rate: 5% Standard (0.7270, 0.7354) 0.7312 0.0679
1 slope(s), event rate: 5% Balanced (0.7266, 0.7353) 0.7309 0.0701
1 slope(s), event rate: 5% Weighted (0.7266, 0.7352) 0.7309 0.0696
1 slope(s), event rate: 5% Kernel (0.7269, 0.7353) 0.7311 0.0678
1 slope(s), event rate: 5% Learnable (0.7268, 0.7340) 0.7304 0.0582
1 slope(s), event rate: 10% Standard (0.7252, 0.7321) 0.7286 0.0551
1 slope(s), event rate: 10% Balanced (0.7255, 0.732) 0.7287 0.0524
1 slope(s), event rate: 10% Weighted (0.7255, 0.7318) 0.7286 0.0511
1 slope(s), event rate: 10% Kernel (0.7255, 0.7319) 0.7287 0.0518
1 slope(s), event rate: 10% Learnable (0.7266, 0.7318) 0.7292 0.0423
1 slope(s), event rate: 20% Standard (0.7392, 0.7442) 0.7417 0.0409
1 slope(s), event rate: 20% Balanced (0.7389, 0.7441) 0.7415 0.0422
1 slope(s), event rate: 20% Weighted (0.7391, 0.7444) 0.7418 0.0422
1 slope(s), event rate: 20% Kernel (0.7392, 0.7442) 0.7417 0.0400
1 slope(s), event rate: 20% Learnable (0.7392, 0.7441) 0.7417 0.0391
3 slope(s), event rate: 1% Standard (0.5233, 0.5510) 0.5371 0.2239
3 slope(s), event rate: 1% Balanced (0.5033, 0.5311) 0.5172 0.2241
3 slope(s), event rate: 1% Weighted (0.5225, 0.5504) 0.5364 0.2247
3 slope(s), event rate: 1% Kernel (0.6622, 0.6820) 0.6721 0.1593
3 slope(s), event rate: 1% Learnable (0.6908, 0.7065) 0.6987 0.1265
3 slope(s), event rate: 5% Standard (0.5860, 0.5979) 0.592 0.0963
3 slope(s), event rate: 5% Balanced (0.5811, 0.5929) 0.587 0.0952
3 slope(s), event rate: 5% Weighted (0.5873, 0.5993) 0.5933 0.0968
3 slope(s), event rate: 5% Kernel (0.6234, 0.6326) 0.628 0.0736
3 slope(s), event rate: 5% Learnable (0.6106, 0.6199) 0.6153 0.075
3 slope(s), event rate: 10% Standard (0.5266, 0.5339) 0.5303 0.0588
3 slope(s), event rate: 10% Balanced (0.5288, 0.5360) 0.5324 0.0585
3 slope(s), event rate: 10% Weighted (0.5281, 0.5356) 0.5319 0.0603
3 slope(s), event rate: 10% Kernel (0.5921, 0.5999) 0.596 0.0624
3 slope(s), event rate: 10% Learnable (0.5645, 0.5704) 0.5675 0.0476
3 slope(s), event rate: 20% Standard (0.5048, 0.5107) 0.5077 0.0477
3 slope(s), event rate: 20% Balanced (0.5008, 0.5066) 0.5037 0.0476
3 slope(s), event rate: 20% Weighted (0.5054, 0.5113) 0.5084 0.0478
3 slope(s), event rate: 20% Kernel (0.5981, 0.6038) 0.6009 0.0463
3 slope(s), event rate: 20% Learnable (0.5517, 0.5561) 0.5539 0.0352

In summary, compared with the balanced and weighted models, the learnable models perform better based on AUROCs as well as the training time in most cases on the real-world datasets. Compared with the standard models, the learnable models perform better based on AUROCs on the real-world datasets, but their training time performance depends on the datasets. For the simulated datasets, the learnable model consistently has the highest AUROC mean and it is the only model to significantly outperform all other models for a given dataset. Compared with the kernel models, the learnable models perform better based on training time in all cases as well as AUROCs in linear cases on the real-world data. For simulated data, the learnable model outperforms the kernel model for all linear cases and the non-linear case when the event rate is 1%. The higher performance of the kernel model for most non-linear cases is consistent with the findings from the real-world datasets. Experimental results are consistent with characteristics of penalized log-likelihood functions summarized in Table 2.

5. Estimated probability distribution and additional performance analysis of resulting models

To further analyze the models in respect to the estimated probability distribution and additional performance measurements including type I error, type II error, and accuracy, a more detailed study is conducted on the dataset GiveMeSomeCredit from Kaggle. The problem is to predict whether a client will experience financial distress in the next two years or not using their biographical and financial information (e.g. monthly income, number of dependents, number of open credit lines and loans, revolving utilization of unsecured lines, number of time 30–59 days past due not worse, number of time 60–89 days past due not worse and number of times 90 days late) [24]. The proportion of delinquency observations is 6.68%.

The detailed information, exploratory analysis and basic data processing can be found in our previously published study [46].

5.1. Estimated probability distribution

Four models are built, including standard, balanced, weighted and learnable, described in Section 4.1. The weighted model is not compared in the following text because the weighted and the balanced model are the same on this dataset. The estimated probability distributions of the models on the test data can be found in Figure 2(a,c,e), respectively, where the probabilities estimated for true non-event observations (i.e. Class 0) can be found under “Predicted Probability Distribution of Class 0” and the probabilities estimated for true event observations (i.e. Class 1) can be found under “Predicted Probability Distribution of Class 1”. We have the following insights:

  1. For the standard model, the estimated probabilities for the most non-event observations fall in the range [0,0.2], which is very similar to the estimated probabilities for the most event observations. They overlap together.

  2. For the balanced model, the estimated probabilities for the most non-event observations fall in the range [0.1,0.9], while the estimated probabilities for the most event observations fall in the range [0.2,1], which shift towards 1 a little bit compared with the non-event observations.

  3. For the learnable model, the estimated probabilities for the most non-event observations fall in the range [0.9,0.97], while the estimated probabilities for the most event observations fall in the range [0.95,1], which shift towards 1 much more compared with the non-event observations.

Figure 2.

Figure 2.

Predicted probabilities and probability cutoff on test data. (a) Standard: predicted probability; (b) standard: probability cutoff; (c) balanced: predicted probability; (d) balanced: probability cutoff; (e) learnable: predicted probability; and (f) learnable: probability cutoff.

It is reasonable and expected that the estimated probabilities by the learnable model are shifted towards 1 overall compared with the standard model and balanced model because more penalty weights are applied on the misclassifications of event observations during the training phase. This can be effectively addressed by choosing an appropriate probability cutoff to achieve the best performance, which will be illustrated in Section 5.2. The results suggest that the learnable model differentiates true events and true non-events better, compared with the standard model and the balanced model.

5.2. Performance measures under probability cutoff

The probability cutoff is set to be the probability at the intersection point of the sensitivity plot and the specificity plot in order to transform a probability to a binary decision [18,37] for the purpose of balancing type I error and type II error. The probability cutoff for three models (i.e. standard, balanced, learnable) on the test data is 0.0675, 0.4516, and 0.9749, as shown in Figure 2(b,d,f), respectively. Under corresponding probability cutoffs, the additional performance measurements are computed, including type I error, type II error, and accuracy, as well as AUROC, which can be found in Table 7. We have the following insights:

  1. Compared with the standard model, the learnable model decreases type I error by 9.01%, decreases type II error by 8.89%, increases accuracy by 9.01% and increases AUROC by 0.0899 on the test data.

  2. Compared with the balanced model, the learnable model decreases type I error by 1.78%, decreases type II error by 1.80%, increases accuracy by 1.79% and increases AUROC by 0.0179 on the test data.

Table 7.

Performance measures on validation data and test data.

Dataset Model Type I error (%) Type II error (%) Accuracy AUROC Probability cutoff
Validation data Standard 36.00 36.03 64.00% 0.6398 0.0672
  Balanced 27.14 27.13 72.86% 0.7287 0.4572
  Learnable 26.12 25.93 73.89% 0.7397 0.9751
Test data Standard 35.75 35.65 64.26% 0.6430 0.0675
  Balanced 28.52 28.47 71.48% 0.7150 0.4516
  Learnable 26.74 26.67 73.27% 0.7329 0.9749

The learnable model performs the best according to all these performance measures on both the validation data and the test data, followed by the balanced model and the standard model.

5.3. Estimated model coefficients

The estimated coefficients of three models (i.e. standard, balanced, learnable) in Section 5.2 are examined. As shown in Table 8, the models have different estimated values for each independent variable, as well as the sign of the variable numberoftime6089dayspastduenotworse , which is negative in the standard model and positive in the balanced model and the learnable model. Its empirical logit plot in Figure 3 shows positive relationship. Based on its variance inflation factor, the multicollinearity exists with the variables numberoftime3059dayspastduenotworse and numberoftimes90dayslate, which causes the sign change [46]. Because all of their information values are above 0.1, none of them should be dropped. Despite the multicollinearity, the balanced model and the learnable model generate the positive estimation that is consistent with the univariate effect.

Table 8.

Estimated model coefficients.

Variable Standard model Balanced model Learnable model
intercept −2.77 −0.19 3.62
age −0.03 −0.02 −0.02
Revolving utilization of unsecured lines −0.40 −0.40 −0.22
numberoftime3059dayspastduenotworse 1.77 2.05 1.34
numberoftimes90dayslate 1.58 2.30 1.86
numberoftime6089dayspastduenotworse −3.19 0.87 1.19

Figure 3.

Figure 3.

Empirical logit plot of numberoftime6089dayspastduenotworse.

5.4. Discussions

One potential issue of this method is the overfitting because local penalty weights are learned from the training data along with the model coefficients. This is a common issue for cost-sensitive learning methods [29], as there are more parameters determined from the training data. The strategy we have adopted for preventing overfitting is to stop the training process when the AUROC on the validation data ceases to increase, instead of the convergence, although the gradient descent-based algorithm is guaranteed to converge. As shown by the performance measurements on the validation data and test data of the credit dataset in Table 7, the generalization ability of the logistic regression model trained by the proposed log-likelihood function is good.

6. Conclusions

To improve the classification on the imbalanced data, we propose a novel penalized log-likelihood function by including penalty weights as decision variables for event observations and learning them from data along with model coefficients. Its advantages in the discrimination ability and computation efficiency are demonstrated in its application to train logistic regression models and in a comprehensive comparison study with logistic regression models trained by other penalized log-likelihood functions on 10 public datasets from multiple domains. Their performance is evaluated by the statistics (i.e. 95% confidence interval, mean, standard deviation) of AUROCs over 100 runs of 10-fold stratified cross validation, as well as their training time. A detailed study is also conducted on an imbalanced credit dataset to examine the distributions of estimated probabilities, the additional performance measurements under the chosen probability cutoff (i.e. type I error, type II error, accuracy) and the estimated model coefficients.

7. Future work

In the future, to further improve the proposed algorithm, we would like to incorporate the intelligent tuning of the learning rate into the modified gradient descent algorithm and use the simulated data to study more properties of the estimated parameters. In practice, we plan to apply the proposed penalized log-likelihood function to improve neural network and deep learning on classifying the imbalanced data and extend this method to multi-class classification problems.

Appendix. Empirical logit plots of the dataset wine_quality.

Figure A1.

Figure A1.

Empirical logit plots of the dataset wine_quality. (a) var0, (b) var1, (c) var2, (d) var3, (e) var4, (f) var5, (g) var6, (h) var7, (i) var8 and (j) var9.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Arora S., Cohen N., Golowich N., and Hu W., A convergence analysis of gradient descent for deep linear neural networks, preprint (2018). arXiv:1810.02281.
  • 2.Bahnsen A.C., Aouada D., and Ottersten B., Example-dependent cost-sensitive logistic regression for credit scoring, in 2014 13th International Conference on Machine Learning and Applications (ICMLA), IEEE, Detroit, MI, 2014, pp. 263–269.
  • 3.Baird III L.C. and Moore A.W., Gradient descent for general reinforcement learning, in Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 1999, pp. 968–974.
  • 4.Barandela R., Valdovinos R.M., Sánchez J.S., and Ferri F.J., The imbalanced training sample problem: under or over sampling? in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, Lisbon, 2004, pp. 806–814.
  • 5.Brown I. and Mues C., An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Syst. Appl. 39 (2012), pp. 3446–3453. [Google Scholar]
  • 6.Buitinck L., Louppe G., Blondel M., Pedregosa F., Mueller A., Grisel O., Niculae V., Prettenhofer P., Gramfort A., Grobler J., Layton R., VanderPlas J., Joly A., Holt B., and Varoquaux G., API design for machine learning software: experiences from the scikit-learn project, in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, Prague, 2013, pp. 108–122.
  • 7.Buitinck L., Louppe G., Blondel M., Pedregosa F., Mueller A., Grisel O., Niculae V., Prettenhofer P., Gramfort A., Grobler J., Layton R., VanderPlas J., Joly A., Holt B., and Varoquaux G., Scikit learn documentation in logistic regression (2018). Available at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Accessed 2018-09-30.
  • 8.Canu S. and Smola A., Kernel methods and the exponential family, Neurocomputing 69 (2006), pp. 714–720. [Google Scholar]
  • 9.Chizat L. and Bach F., On the global convergence of gradient descent for over-parameterized models using optimal transport, in Advances in Neural Information Processing Systems, Montreal, 2018, pp. 3036–3046.
  • 10.Collell G., Prelec D., and Patil K.R., A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data, Neurocomputing 275 (2018), pp. 330–340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Cox D.R., The regression analysis of binary sequences, J. R. Stat. Soc. Ser. B (Methodol.) 20 (1958), pp. 215–242. [Google Scholar]
  • 12.Cox D.R., Some procedures associated with the logistic qualitative response curve, Research papers in statistics: Festschrift for J. Neyman, John Wiley & Sons, London, 1966, pp. 55–71.
  • 13.Demir G., Aytekin M., and Akgun A., Landslide susceptibility mapping by frequency ratio and logistic regression methods: an example from Niksar–Resadiye (Tokat, Turkey), Arabian J. Geosci. 8 (2015), pp. 1801–1812. [Google Scholar]
  • 14.Deng K., Omega: on-line memory-based general purpose system classifier, Ph.D. diss., Carnegie Mellon University, 1998.
  • 15.Ding J. and Xiong W., A new estimator for a population proportion using group testing, Commun. Stat. Simul. Comput. 45 (2016), pp. 101–114. [Google Scholar]
  • 16.Drummond C. and Holte R.C., C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, in Workshop on Learning from Imbalanced Datasets II, Vol. 11, Citeseer, Washington, DC, 2003, pp. 1–8.
  • 17.Reza Z., Streaming Stochastic Gradient Descent for Generalized Linear Models (2015). Available at https://stanford.edu/∼rezab/classes/cme323/S15/notes/lec11.pdf, Accessed 2018-04-02.
  • 18.Habibzadeh F., Habibzadeh P., and Yadollahie M., On determining the most appropriate test cut-off value: the case of tests with continuous results, Biochem. Med. 26 (2016), pp. 297–307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Han H., Wang W.Y., and Mao B.H., Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, International Conference on Intelligent Computing, Springer, Hefei, 2005, pp. 878–887.
  • 20.He H. and Garcia E.A., Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (2008), pp. 1263–1284. [Google Scholar]
  • 21.He H. and Ma Y., Imbalanced Learning: Foundations, Algorithms, and Applications, John Wiley & Sons, Hoboken, 2013. [Google Scholar]
  • 22.Hong C., Ghosh R., and Srinivasan S., Dealing with class imbalance using thresholding, preprint (2016). arXiv:1607.02705.
  • 23.Hosmer Jr D.W., Lemeshow S., and Sturdivant R.X., Applied Logistic Regression, Vol. 398, John Wiley & Sons, Hoboken, 2013. [Google Scholar]
  • 24.Kaggle , Give me some credit (2011). Available at https://www.kaggle.com/c/GiveMeSomeCredit/data, Accessed 2018-02-01.
  • 25.Karsmakers P., Pelckmans K., and Suykens J.A., Multi-class kernel logistic regression: a fixed-size implementation, in 2007 International Joint Conference on Neural Networks, IJCNN, IEEE, Orlando, 2007, pp. 1756–1761.
  • 26.King G. and Zeng L., Logistic regression in rare events data, Polit. Anal. 9 (2001), pp. 137–163. [Google Scholar]
  • 27.Krawczyk B., Galar M., Jeleń Ł., and Herrera F., Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy, Appl. Soft Comput. 38 (2016), pp. 714–726. [Google Scholar]
  • 28.Kubat M., Holte R.C., and Matwin S., Machine learning for the detection of oil spills in satellite radar images, Mach. Learn. 30 (1998), pp. 195–215. [Google Scholar]
  • 29.Kukar M. and Kononenko I., Cost-sensitive learning with neural networks, in ECAI, Brighton, 1998, pp. 445–449.
  • 30.Laitinen E.K. and Laitinen T., Bankruptcy prediction: application of the Taylor's expansion in logistic regression, Int. Rev. Financial Anal. 9 (2000), pp. 327–349. [Google Scholar]
  • 31.Maalouf M. and Trafalis T.B., Robust weighted kernel logistic regression in imbalanced and rare events data, Comput. Stat. Data Anal. 55 (2011), pp. 168–183. [Google Scholar]
  • 32.Maalouf M., Trafalis T.B., and Adrianto I., Kernel logistic regression using truncated newton method, Comput. Manage. Sci. 8 (2011), pp. 415–428. [Google Scholar]
  • 33.Moayedikia A., Ong K.L., Boo Y.L., Yeoh W.G., and Jensen R., Feature selection for high dimensional imbalanced class data using harmony search, Eng. Appl. Artif. Intell. 57 (2017), pp. 38–49. [Google Scholar]
  • 34.Özöür-akyüz S., Ünay D., and Smola A., Guest editorial: model selection and optimization in machine learning, Mach. Learn. 85 (2011), p. 1. [Google Scholar]
  • 35.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., and Duchesnay E., Scikit-learn: machine learning in Python, J. Mach. Learn. Res. 12 (2011), pp. 2825–2830. [Google Scholar]
  • 36.Phua C., Alahakoon D., and Lee V., Minority report in fraud detection: classification of skewed data, ACM Sigkdd Explor. Newslett. 6 (2004), pp. 50–59. [Google Scholar]
  • 37.Pramono L.A., Setiati S., Soewondo P., Subekti I., Adisasmita A., Kodim N., and Sutrisna B., Prevalence and predictors of undiagnosed diabetes mellitus in Indonesia, Age 46 (2010), pp. 100–100. [PubMed] [Google Scholar]
  • 38.Provost F., Machine learning from imbalanced data sets 101, Proceedings of the AAAI'2000 Workshop on Imbalanced Data Sets, Austin, 2000, pp. 1–3.
  • 39.Shu B., Zhang H., Li Y., Qu Y., and Chen L., Spatiotemporal variation analysis of driving forces of urban land spatial expansion using logistic regression: a case study of port towns in Taicang City, China, Habitat Int. 43 (2014), pp. 181–190. [Google Scholar]
  • 40.Springer D.B., Tarassenko L., and Clifford G.D., Logistic regression-HSMM-based heart sound segmentation, IEEE Trans. Biomed. Eng. 63 (2016), pp. 822–832. [DOI] [PubMed] [Google Scholar]
  • 41.Sra S., Nowozin S., and Wright S.J., Optimization for Machine Learning, MIT Press, Cambridge, 2012. [Google Scholar]
  • 42.Wahba G., Gu C., Wang Y., and Campbell R., Soft classification, aka risk estimation, via penalized log likelihood and smoothing spline analysis of variance, The Mathematics of Generalization, CRC Press, Boca Raton, 2018, pp. 331–359.
  • 43.Walker S.H. and Duncan D.B., Estimation of the probability of an event as a function of several independent variables, Biometrika 54 (1967), pp. 167–179. [PubMed] [Google Scholar]
  • 44.Weiss G.M. and Provost F., Learning when training data are costly: the effect of class distribution on tree induction, J. Artif. Intell. Res. 19 (2003), pp. 315–354. [Google Scholar]
  • 45.Zhang L., Priestley J., and Ni X., Influence of the event rate on discrimination abilities of bankruptcy prediction models, Int. J. Database Manage. Syst. 10 (2018), pp. 1–14. [Google Scholar]
  • 46.Zhang L., Ray H., Priestley J., and Tan S., A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data, J. Appl. Stat. 47 (2019), pp. 568–581. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Zheng S., Strasser S., Holt N., Quinn M., Liu Y., and Morrell C., Stratified multilevel logistic regression modeling for risk factors of adolescent obesity in Tennessee, Int. J. High Risk Behav. Addict. 7 (2018), p. e58597. [Google Scholar]
  • 48.Zheng Z., Wu X., and Srihari R., Feature selection for text categorization on imbalanced data, ACM Sigkdd Explor. Newslett. 6 (2004), pp. 80–89. [Google Scholar]
  • 49.Zhu J. and Hastie T., Kernel logistic regression and the import vector machine, Journal of Computational and Graphical Statistics. 14 (2005), pp. 185–205.

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES