Binary particle swarm optimization as a detection tool for influential subsets in linear regression

G Deliorman; D Inan

doi:10.1080/02664763.2020.1779196

. 2020 Jun 14;48(13-15):2441–2456. doi: 10.1080/02664763.2020.1779196

Binary particle swarm optimization as a detection tool for influential subsets in linear regression

G Deliorman ^a,^CONTACT, D Inan ^b

PMCID: PMC9041898 PMID: 35707100

Abstract

An influential observation is any point that has a huge effect on the coefficients of a regression line fitting the data. The presence of such observations in the data set reduces the sensitivity and validity of the statistical analysis. In the literature there are many methods used for identifying influential observations. However, many of those methods are highly influenced by masking and swamping effects and require distributional assumptions. Especially in the presence of influential subsets most of these methods are insufficient to detect these observations. This study aims to develop a new diagnostic tool for identifying influential observations using the meta-heuristic binary particle swarm optimization algorithm. This proposed approach does not require any distributional assumptions and also not affected by masking and swamping effects as the known methods. The performance of the proposed method is analyzed via simulations and real data set applications.

Keywords: Influential subsets, binary particle swarm optimization, heuristic algorithms, linear regression, diagnostics

1. Introduction

An influential observation is one that has a big effect on regression coefficients. The reasons for such observations in a data set might be an incorrect measurement, errors in the data entry, problems with the measuring device, or rare events. Regression modeling techniques are very sensitive to such observations. It is possible for a single observation to have a major influence on the results of a regression analysis. It is therefore significant to be alert to the possibility of influential observations and to take them into consideration when interpreting the results.

In regression analysis, there are many measures of the influence of observations on the least squares estimate of the parameter vector. Some of these measurements can be suggested in [8,10]. Most of these measures are based on the principle of removing a single observation from the data set and then examine the changes in the regression coefficients or regression fitting. Besides, the biggest problem of the methods based on single case deletion is swamping and masking effects. In other words, a non-influential observation might be identified as an influential observation or an influential observation might be masked by another influential observation in the data. In addition, these methods require many distributional assumptions.

One of the most popular influence measure Cook's is based on the distribution of the change in the least squares estimator of the regression coefficients due to removing a single observation from the data set [9]. In fact, this change does not reflect any type of distributional property and thus it can change the influence of an observation on the least squares estimate. Consequently, the use of Cook's distance is likely to underestimate the impact of an observation on the least squares estimate or overestimate it. Therefore, the utilize of Cook's distance may cause misperception of influential observations especially in large samples [13].

Another popular measure is DFFITS introduced by Belsley and Kulh in 1980 [5]. For this measure cutoff values depend on sample size and the number of explanatory variables, therefore it is so challenging to determine the correct cutoff values for all cases. Cutoff values, recommended in literature are only guidelines and are not suitable for all data structures [15].

Likelihood Displacement (LD) measure is another diagnostic tool introduced by Cook and Weisberg in 1982 [10]. This measure is assumed to $χ^{2}$ distributed. However, this assumption does not work on large samples. Instead, as the sample size increases, it converges to 0 [21]. Another drawback of the LD statistic is similar to the other single case deletion methods, it is affected by swamping and masking effects and the results are often misleading.

In recent years, interest in algorithms inspired by natural phenomenon observation has increased. Those algorithms as Particle Swarm Optimization (PSO), artificial bee colony algorithms, ant colony optimization, tabu search and genetic algorithms are employed to solve various optimization issues. PSO is one of the commonly used nature-inspired algorithms, developed thanks to the inclusion of bird herds behaviors [11]. PSO has superiority over other meta-heuristic algorithms because of its good run times, simplicity, low parameter requirements, improved convergence, reduced dependency on initial points, sufficient global search and robustness features [4,6,17]. Binary Particle Swarm Optimization (BPSO) is considered as a powerful solution scheme, which is able to help PSO to be applied to binary optimization problems [12]. BPSO has been used in many applications like variable selection, feature selection, scheduling, lost sizing, routing, selection of optimal input subset for SVM, design of dual-band dual-polarized planar antenna [1,3,7,14,16,19,22] and has shown effective performance. When recent studies on this method are examined, it is seen that Sancar and Inan in 2019 [18] use BPSO for determining the influential observations in survival data.

The purpose of this study is to determine a new detection method that does not require any distributional features, performs well even in the presence of influential subsets and not affected by masking and swamping effects as the other detection methods. In this study, whether observations are influential or not is considered as a combinatorial problem. Considering the success of BPSO in solving such combinatorial problems in the literature, it is tought to be effective in the solution of this combinatorial problems. For this purpose, BPSO is used as a detection tool. The proposed method performance was analyzed with the help of various simulation studies and real data sets.

This paper is arranged as follows: In Section 2, general background and notation are presented. In Section 3, some standard measures of influence are reviewed. The information about PSO and BPSO is provided in Section 4 and Section 5. The proposed BPSO-based method is introduced in Section 6. The performance of the method is compared with various methods on different simulated and real data sets in Section 7 and 8 and eventually, a conclusion is presented in Section 9.

2. Background and notation

Consider the matrix model for linear regression $Y = X β + ϵ$ , where X is an nxp full rank matrix of covariates, Y is an nx1 vector of observations, β is a px1 vector of unknown parameters and ε is an nx1 vector of randomly distributed errors such that $E (ϵ) = 0$ and $V a r (ϵ) = σ^{2} I$ . The fitted values of Y, $\hat{Y}$ are expressed as

\hat{Y} = X (X^{'} X)^{- 1} X^{'} Y .

The hat matrix is defined as

H = X (X^{'} X)^{- 1} X^{'} .

The diagonal elements $h_{i i}$ of H are the leverages of the points. These values are the measures of the distance between the observations x values and the center of the data. A large value of $h_{i i}$ shows that the i-th observation is distant from center. Residual of an observation $e_{i} = y_{i} - {\hat{y}}_{i}$ shows the distance between an observation $y_{i}$ with predicted value ${\hat{y}}_{i}$ . The variances of the residuals are predicted by the diagonal elements of $(I - H) s^{2}$ . Dividing each residual by its standard deviation gives a Standardized residual, represented by $r_{i}$ ,

r_{i} = \frac{e_{i}}{\sqrt{s^{2} (1 - h_{i i})}},

where $s^{2}$ is the estimation of error variance $σ^{2}$ and I is the nxn identity matrix. The standardized residuals behave much like a Student's t random variable except for the fact that the numerator and denominator of $r_{i}$ are not independent. Considering this problem Studentized residuals are suggested as

r_{i}^{*} = \frac{e_{i}}{\sqrt{s_{(i)}^{2} (1 - h_{i i})}},

where $s_{(i)}^{2} = \frac{(n - p) s^{2} - e_{i}^{2} / (1 - h_{i i})}{n - p - 1}$ and the subscript in parentheses is the case that the ith observation has been removed for the estimate of $σ^{2}$ .

3. Review of measures of influence

Some standard measures of influence are discussed in this chapter. These are Cook's distance, DFFITS and LD.

3.1. Cook's distance

The Cook's distance statistics suggested by Cook in 1977 [9] is measured the Euclidean distance between $\hat{β}$ and ${\hat{β}}_{(i)}$ which is parameter estimation when i-th observation removed from the data set. Cook's distance is defined as

D_{i} = \frac{({\hat{β}}_{(i)} - \hat{β})^{'} X^{'} X ({\hat{β}}_{(i)} - \hat{β})}{p s^{2}} .

(1)

$D_{i}$ can be written simply as

D_{i} = \frac{r_{i}^{2}}{p} \frac{h_{i i}}{(1 - h i i)}, i = 1, 2, \dots, n .

(2)

To use $D_{i}$ diagnostically, Cook [9] suggested using the 50th percentile from the central F distribution with degrees of freedom $(p, n - p)$ as a benchmark for identifying influential subsets.

3.2. DFFITS

DFFITS is proposed by Belsley, Kuh and Welsch in 1980 [5] and it depends on the difference between the $\hat{y}$ and ${\hat{y}}_{(i)}$ . It is given as

\begin{aligned} D F F I T S_{i} & = \frac{\hat{y} - {\hat{y}}_{(i)}}{s_{(i)} \sqrt{h_{i i}}}, i = 1, 2, \dots, n \end{aligned}

(3)

\begin{aligned} = \sqrt{(\frac{h_{i i}}{1 - h_{i i}})} r_{i}^{*}, \end{aligned}

(4)

where ${\hat{y}}_{(i)}$ is fitted value when the i-th observation is deleted. DFFITS is highly influenced by both student-type residuals and high leverage points. Belsley, Kuh and Welsch [5] suggested that $| D F F I T S |$ values greater than $2 \sqrt{p / n}$ can be utilized to flag influential points.

3.3. Likelihood displacement

LD is determined by Cook and Weisberg in 1982 [10] and widely used to detect influential observations in data analysis. In influence analysis, it aims to compare the full data likelihood of $\hat{β}$ and ${\hat{β}}_{(i)}$ . LD statistic is defined as

L D_{i} = 2 | L (\hat{β}) - L ({\hat{β}}_{(i)}) |, i = 1, 2, \dots, n .

(5)

Cook and Weisberg [10] suggested that the LD can be transformed to a more familiar scale by comparing it to percentiles of a $χ^{2}$ with p degrees of freedom.

4. Particle swarm optimization (PSO)

PSO is determined by Eberhart and Kennedy in 1995 [11]. This algorithm was inspired by the social behavior of the flock and the bird's own behavior. Each potential solution indicates a ‘particle’ and a population of particles is called ‘swarm’ in this method. In PSO, each particle begins to search with random velocities and positions and is updated to find the best solution at each iteration. At each iteration, particle positions are updated according to a predefined quality measure (objective function). At each step, the velocity of each particle is updated and refined based on the particle's inertia as well as personal experience and swarm's experience. The objective function is used to find the best solution. During the flight, the experience of each particle is kept in the best position ever (Pbest). The swarm's experience is kept by the global best position (Gbest). The specific property of PSO is that it simultaneously searches different points in different areas of the problem space in order to determine a global optimum solution. Due to this feature, it mitigates the risk of local optimum traps. The steps of the original PSO approach are as follows.

Step 1. Define the tune parameters of PSO; K, w, $c_{1}$ , $c_{2}, v_{m a x}$ (K: number of particles in the swarm, w: inertia parameter, $c_{1}$ : cognitive coefficient, $c_{2}$ : social coefficient, $v_{m a x}$ : maximum value for the elements of velocity).

Step 2. Randomly generated initial positions and velocities are determined as in Equation (6) and (7) respectively;

\begin{aligned} P & = ({\vec{p}}_{1}, \dots, {\vec{p}}_{i}, \dots, {\vec{p}}_{K}), \end{aligned}

(6)

\begin{aligned} V & = ({\vec{v}}_{1}, \dots, {\vec{v}}_{i}, \dots, {\vec{v}}_{K}), \end{aligned}

(7)

where ${\vec{p}}_{i}$ is the ith particle of the population with $i = 1, 2, \dots, K$ and ${\vec{v}}_{i}$ is the corresponding velocity vector.

Step 3. The fitness values of all particles in the swarm are computed according to the objective function: $F (P) = (F ({\vec{p}}_{1}), \dots, F ({\vec{p}}_{i}), \dots, F ({\vec{p}}_{K}))$

Step 4. Keep the positions where each particle had its highest fitness value:

P_{b e s t} = ({\vec{p}}_{1}^{b e s t}, \dots, {\vec{p}}_{i}^{b e s t}, \dots, {\vec{p}}_{K}^{b e s t}) .

(8)

Step 5. Keep the position which has the global best fitness:

{\vec{p}}^{b e s t}_{g} = \underset{\vec{p} \in P}{max (o r min)} (F (\vec{p})),

(9)

where $P_{b e s t}$ is best experienced positions matrix obtained for the particles and ${\vec{p}}_{g}^{b e s t}$ is the best position of swarm.

Step 6. Velocities and positions are updated by using Equation (10) and (11), respectively;

\begin{aligned} {\vec{v}}_{i}^{t + 1} & = w + {\vec{v}}_{i}^{t} + \overset{c o g n i t i v e c o m p o n e n t}{\overset{⏞}{c_{1} r_{1} ({\vec{p}}_{i}^{b e s t} - {\vec{p}}_{i})}} + \overset{s o c i a l c o m p o n e n t}{\overset{⏞}{c_{2} r_{2} ({\vec{p}}_{g}^{b e s t} - {\vec{p}}_{i})}} \end{aligned}

(10)

\begin{aligned} {\vec{p}}_{i}^{t + 1} & = {\vec{p}}_{i}^{t} + {\vec{v}}_{i}^{t + 1} \end{aligned}

(11)

where t represents the current iteration number and $r_{1}$ and $r_{2}$ are uniformly distributed random numbers in the interval [0,1]. These random numbers provide a stochastic feature to the algorithm. In Equation (10), evaluation new velocity of the particle is given according to the previous velocity and the calculation of the best position (Pbest) and the best position (Gbest) of the swarm so far has been given. In Equation (11), the position is updated depending on the velocity.

Step 7. The previous steps are repeated respectively until the maximum number of iterations is achieved.

5. Binary particle swarm optimization (BPSO)

PSO was initially used to solve optimization problems in continuous space. However, there were many discrete (or binary) optimization issues. Therefore, the binary version of PSO has been improved to solve combinatorial optimization problems in binary space. BPSO was determined by Eberhart and Kennedy in 1997 [12]. The elements of i-th particle, ${\vec{p}}_{i} = (p_{i 1}, p_{i 2}, \dots, p_{i j}, \dots)$ are indicated by binary values where $p_{i j} \in {0, 1}$ in the BPSO approach. The main difference between PSO and BPSO, is in the defining of velocities. The velocities of each particle, are defined in terms of the probabilities of the corresponding element in the particle taking the value 1. Thus, the positions and velocities of each particle in the t-th iteration are updated by the following Equations (12) and (13), respectively;

\begin{aligned} {\vec{v}}_{i j}^{t + 1} & = w + {\vec{v}}_{i j}^{t} + \overset{c o g n i t i v e c o m p o n e n t}{\overset{⏞}{c_{1} r_{1} ({\vec{p}}_{i}^{b e s t} - {\vec{p}}_{i})}} + \overset{s o c i a l c o m p o n e n t}{\overset{⏞}{c_{2} r_{2} ({\vec{p}}_{g}^{b e s t} - {\vec{p}}_{i})}} \end{aligned}

(12)

\begin{aligned} p_{i j}^{t + 1} & = {\begin{cases} 1 & i f r_{3} < s i g m ({\vec{v}}_{i j}^{t + 1}) \\ 0 & o t h e r w i s e \end{cases} \end{aligned}

(13)

where ${\vec{v}}_{i j}$ is the jth element of ith velocity vector and $s i g m ({\vec{v}}_{i}^{t + 1}) = \frac{1}{1 + e^{- {\vec{v}}_{i}^{t + 1}}}$ is the sigmoid limiting transformation function, $r_{1}$ , $r_{2}$ and $r_{3}$ are uniformly distributed random numbers in the interval [0,1], respectively, and t indicates the current iteration number.

The flow chart of BPSO is shown in Figure 1.

6. Proposed method

As mentioned before, single case deletion methods have many drawbacks. They produce misleading results especially in the existence of influential subsets. Considering observations as influential and non-influential, identification of the influential subsets could be thought as a combinatorial optimization problem. The strength of BPSO originates from the fact that the approach is efficient and can effectively deal with a wide variety of combinatorial optimization problems, including those that are so challenging to solve using other methods. Given these properties it is considered that BPSO approach might be appropriate for solving this optimization problem.

6.1. Structure of particles

The positions of each particle, ${\vec{p}}_{i} = (p_{i 1}, p_{i 2}, \dots, p_{i n})$ , were described as binary values, 0 or 1. Dimension of search space n is the number of observations. $p_{i j} = 0$ shows the potential influential observations and $p_{i j} = 1$ shows the non-influential observations in the i-th particle. In final Gbest 0 s are assigned as influential observations.

6.2. Definition of objective function

The most important part of the BPSO is choosing the objective function for simultaneous detection of influential subsets. The objective function should be appropriate for the purpose of optimization problem. In this study the following LD-based objective function has been used for detection of influential subsets.

{LD}_{0_{s}} = \frac{2 [l n L (\hat{β}, {\hat{σ}}^{2}) - l n L ({\hat{β}}_{(- 0_{s})}, {\hat{σ}}_{(- 0_{s})}^{2})]}{m + 1},

(14)

where $0_{s}$ is the set of observations identified as potential influential observations in the particle and m is the number of potential influential observations, i.e. number of 0's in the particle. $(\hat{β}, {\hat{σ}}^{2})$ is the set of parameter estimates obtain from full data likelihood and $({\hat{β}}_{(- 0_{s})}, {\hat{σ}}_{(- 0_{s})}^{2})$ is the set of parameter estimates while eliminating the m observations identified as potential influential observations. Likelihood function of linear regression parameters are

L (β, σ^{2}) = \sum_{i = 1}^{n} (- \frac{1}{2} l n (2 π) - \frac{1}{2} l n (σ^{2}) - \frac{1}{2 σ^{2}} (y - x β_{(i)})^{2}) .

(15)

LDi might be effective method to identifying influential observations in the linear regression model; but it is affected by the masking and swamping issues, similar to drawbacks of other single case deletion methods. Therefore, it would be more suitable to compute the LD by eliminating a set of observations rather than eliminating one by one. However, since each data point has a tendency to make difference in LD, if the LD statistic computed for a set of observations are not divided by m + 1, the proposed BPSO-based approach would tend to eliminate all data points and would not have the ability for determine influential data points. Thus, identification of non-influential observations as influential is prevented and accurate identification of influential observations are provided by dividing the LD by m + 1.

7. Simulations

The performance of the BPSO-based approach has been assessed using simulated data sets in this part. This proposed approach has been compared with traditional diagnostic methods previously discussed in Section 3, according to two evaluation criteria; masking probability (MP) and swamping probability (SP). Data generation and model fitting have been replicated 100 times for each simulation setting and mean of masking and swamping probabilities have been reported for each method. We considered different sample sizes (n), different contamination rates ( $c . r .$ ) and three different error distributions (t(3) heavy-tailed, normal $ϵ \sim N (0, 0.5)$ , centered log normal $ϵ \sim \exp N (0, 0.5) - \exp (0.5 \cdot σ^{2})$ ) and three different regression models. Firstly, for data generation clean data set (without influential observations) was generated. After the number of influential observations was determined considering the sample size and contamination rate, to contaminate these observations, a random variable value was determined from the Bernoulli distribution. If the corresponding random value is 1 related observation was changed with an extreme value from right tail, if it is 0 related observation was changed with an extreme value from left tail. Continuous explanatory variables and error terms were contaminated in this way. Thus, it was made possible to have influential observations in different directions.

Tune parameters of BPSO method are determined as given below:

inertia weight w = 1,
acceleration coefficients $c_{1} = c_{2} = 0.5$
Maximum value of velocity $v_{m a x} = 4$

Simulations result for 3 different scenarios are given below.

7.1. Simulation 1

The multiple linear regression model is $y = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + β_{3} x_{3 i} + β_{4} x_{4 i} + ϵ$ ,
$X_{1} \sim N (2, 1)$ , $X_{2} \sim N (- 1, 0.5)$ , $X_{3} \sim U n i (0, 1)$ , $X_{4} \sim B (1, 0.5)$ and $ϵ \sim N (0, 0.5),$
The parameter values are set to be $β = (2, 3, 6, 1, 2)$

7.2. Simulation 2

The multiple linear regression model is $y = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + ϵ$ ,
$X_{1} \sim N (3, 0.5)$ , $X_{2} \sim N (2, 0.5)$ and $ϵ \sim \exp (N (0, 0.5)) - \exp (0.5 \cdot σ^{2})$ ,
The parameter values are set to be $β = (2, 1, 1.2)$

7.3. Simulation 3

The multiple linear regression model is $y = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + β_{3} x_{3 i} + β_{4} x_{4 i} + ϵ$ ,
$X_{1} \sim N (2, 1)$ , $X_{2} \sim N (- 1, 0.5)$ , $X_{3} \sim U n i (0, 1)$ , $X_{4} \sim B (1, 0.5)$ and $ϵ \sim t (3),$
The parameter values are set to be $β = (2, 3, 6, 1, 2)$

For all scenarios as can be seen from Tables 1 to 3, when the c.r. = 0.02 and n = 50 (single influential observation case) all methods perform well. With the increase of n and c.r. the performance of traditional methods decreases. In the presence of influential subsets, especially Cook's distance and LD can not flag almost any of the influential observations. In order to observe the pure effect of the sample size on the performance of the methods, the case of single influential observation at increasing sample sizes is also examined for Simulation 1. Unfortunately, the increase in sample size increases the computing time of the proposed method. Aiming to solve this problem, the number of occurrences in Gbest particles is recorded and then dividing these records by the iteration number of BPSO, the degree of influence (d.i.) is obtained for each of the observation. For saving the computing time observations d.i. $> 0.90$ are flagged as influential. Thus, proper results can be obtained with less particles and iterations. In addition, the d.i. of observations are determined. The results are presented in Table 4. As expected, the performance of Cook's distance and LD deteriorates considerably with increasing sample size. Although DFFITS is not as influenced as others, it is worse in terms of SP compared to proposed method.

Table 2. Simulation 2 results.

		n = 50		n = 100		n = 200
c.r.	Influential measures	MP	SP	MP	SP	MP	SP
0.02	Cook	0.013	0.000	0.307	0.000	0.804	0.000
	DFFITS	0.000	0.071	0.003	0.062	0.008	0.059
	LD	0.093	0.000	0.662	0.000	0.948	0.000
	BPSO	0.000	0.044	0.001	0.003	0.002	0.015
0.04	Cook	0.401	0.000	0.860	0.000	0.995	0.000
	DFFITS	0.032	0.071	0.047	0.065	0.062	0.062
	LD	0.754	0.000	0.972	0.000	0.999	0.000
	BPSO	0.002	0.026	0.002	0.003	0.002	0.022
0.06	Cook	0.770	0.000	0.982	0.000	0.999	0.000
	DFFITS	0.104	0.072	0.142	0.067	0.163	0.064
	LD	0.945	0.000	0.996	0.000	0.999	0.000
	BPSO	0.001	0.043	0.001	0.007	0.004	0.033

Open in a new tab

Table 4. Single influential observation in different sample sizes.

Influential	n = 200 and c.r. = 0.05		n = 500 and c.r. = 0.002
Measures	MP	SP	MP	SP
Cook	0.050	0.000	0.550	0.000
DFFITS	0.000	0.063	0.000	0.067
LD	0.600	0.000	1.000	0.000
BPSO	0.000	0003	0.000	0.004
	n = 800 and c.r. = 0.00125		n = 1000 and c.r. = 0.001
Cook	0.980	0.000	1.000	0.000
DFFITS	0.000	0.069	0.000	0.070
LD	1.000	0.000	1.000	0.000
BPSO	0.000	0.004	0.000	0.004

Open in a new tab

Table 1. Simulation 1 results.

		n = 50		n = 100		n = 200
c.r.	Influential measures	MP	SP	MP	SP	MP	SP
0.02	Cook	0.000	0.000	0.283	0.000	0.962	0.000
	DFFITS	0.000	0.063	0.000	0.054	0.000	0.050
	LD	0.000	0.000	0.727	0.000	0.999	0.000
	BPSO	0.000	0.006	0.000	0.000	0.000	0.015
0.04	Cook	0.726	0.000	0.997	0.000	1.000	0.000
	DFFITS	0.042	0.067	0.110	0.058	0.143	0.053
	LD	0.937	0.000	1.000	0.000	1.000	0.000
	BPSO	0.000	0.031	0.000	0.001	0.000	0.028
0.06	Cook	0.988	0.000	1.000	0.000	1.000	0.000
	DFFITS	0.311	0.070	0.409	0.061	0.444	0.056
	LD	1.000	0.000	1.000	0.000	1.000	0.000
	BPSO	0.008	0.031	0.007	0.007	0.029	0.048

Open in a new tab

Table 3. Simulation 3 results.

		n = 50		n = 100		n = 200
c.r.	Influential measures	MP	SP	MP	SP	MP	SP
0.02	Cook	0.007	0.000	1.000	0.000	0.0847	0.000
	DFFITS	0.000	0.066	0.316	0.064	0.099	0.043
	LD	0.043	0.001	1.000	0.003	0.902	0.000
	BPSO	0.003	0.071	0.050	0.063	0.007	0.014
0.04	Cook	0.684	0.000	0.879	0.000	0.540	0.000
	DFFITS	0.156	0.060	0.299	0.046	0.068	0.051
	LD	0.817	0.001	0.917	0.000	0.746	0.000
	BPSO	0.003	0.090	0.024	0.030	0.007	0.019
0.06	Cook	0.858	0.000	0.934	0.000	0.971	0.000
	DFFITS	0.366	0.554	0.485	0.041	0.566	0.034
	LD	0.900	0.000	0.942	0.000	0.968	0.000
	BPSO	0.022	0.137	0.088	0.045	0.171	0.027

Open in a new tab

8. Real data set application

In this chapter the proposed BPSO method's performance has been demonstrated through Hills and CYG OB1 real datasets.

8.1. Hills data set

The Scottish Hills Races data is taken from the 1984 list of the fixtures Scottish Hill Runners Association [2] and it is available in the MASS package in R. The data consist of record times, in hours (as a response variable) against the climb in feet and the distance in miles (two explanatory variables) for 35 races in Scotland.

The influential observations are determined by traditional methods and proposed method and the results are presented in Table 6. The observations found as influential observations are written in bold in the table. The cutoff values for traditional methods are given in Table 5 and the last column of Table 6 shows d.i. values for proposed method. Cook's distance is identified only the Bens of Jura Fell (7th observation) as influential. However, Cook's distance values of observations 18 and 33 appear to be greater than others. When the 7th observation is removed from the data set and the Cook's distance is calculated again, this method cannot find any other influential observation. Bens of Jura Fell (7th observation), Lairig Ghru Fun Run (11th observation) and Knock Hill (18th observation) are determined as influential by DFFITS. For the data set which the 7th observation removed, DFFITS flags the 18th and 33rd observations as influence. In addition to this 11th observation is not flagged as influential anymore. This is an indication that 33rd observation is masked and 11th observation is swamped. LD is only detected Knock Hill (18th observation). When this observation removed from data set 7th observation flagged as influential and for the data set both 7th and 18th observations removed LD was able to flagged 33rd observation as influential. This is an evidence of LD's deficiency in influential subset case. The proposed method is determined Bens of Jura Fell (7th observation), Knock Hill (18th observation) and Two Breweries (33rd observation) as 100 % influential observations. In addition, to these three observations, there are no other observations with remarkable high d.i. Regression results on the data sets obtain by without removing any observations, removing the observations flagged as influential by DFFITS and proposed method, are given in Table 7. The 3rd model has a higher adjusted $R^{2}$ value and the error variance is reduced by %40 compared to the 2nd model. The results obtained by examining the classical methods and comments on this data in literature [2] confirm the results of the proposed method. Different from others the proposed method found all of the influential observations simultaneously.

Table 7. Regression models.

	Observations deleted	Regression model	Adjusted $R^{2}$	$σ^{2}$
1	None	$y = - 0.15 + 0.1036 x_{1} + 0.0002 x_{2}$	0.914	0.0599
2	7,11,18	$y = - 0.18 + 0.1138 x_{1} + 0.0001 x_{2}$	0.9721	0.0105
3	7, 18, 33	$y = - 0.14 + 0.1105 x_{1} + 0.0001 x_{2}$	0.9855	0.0062

Open in a new tab

8.2. Star cluster CYG OB1 data

The Hertzsprung-Russell (H-R) diagram forms the basis of the star evolution theory. The diagram is essentially a scatter plot of the energy output of stars drawn against surface temperatures. In this study, data from the H-R diagram of the Star Cluster CYG OB1 calibrated according to Vanisma and De Greve in 1972 [20] were used. Also, it is available in the HSAUR package in R. The data consists of two variables, logarithm of the effective temperature at the surface of the star (logst) and logarithm of its light intensity (logli). There are 47 observations in the CYG OB1 data set.

Table 5. Cutoff values.

Influence measure	Cutoff points	Cutoff values
Cook's Distance	$F_{(α; p, n - p)}$	0.8057312
DFFITS	$2 \sqrt{p / n}$	$\pm 0.58554$
LD	$χ_{(α, p)}^{2}$	0.10

Open in a new tab

Table 6. Influence measure.

	Cook	DFFITS	LD	BPSO
1	8.134912e−04	0.048655302	9.985856e−01	0.006666667
2	4.898145e−03	−0.119841802	9.989890e−01	0.003333333
3	1.379338e−03	−0.063417046	9.987327e−01	0.320000000
4	6.721553e−05	−0.013977456	9.985182e−01	0.483333333
5	1.476689e−02	−0.209658654	9.994299e−01	0.010000000
6	4.924224e−05	0.011963432	9.985164e−01	0.793333333
7	1.894671e+00	2.700267176	1.063450e−01	1.000000000
8	4.547035e−05	−0.011496146	9.985173e−01	0.003333333
9	9.953011e−05	−0.017009519	9.985261e−01	0.010000000
10	1.389065e−03	0.063623681	9.986821e−01	0.256666667
11	2.141871e−01	0.792567587	9.507970e−01	0.313333333
12	1.003818e−04	−0.017082009	9.985244e−01	0.503333333
13	4.714510e−03	−0.117835726	9.993219e−01	0.003333333
14	9.295217e−03	0.165701135	9.992633e−01	0.710000000
15	4.858681e−03	−0.119507194	9.991816e−01	0.166666667
16	1.491993e−02	−0.211443081	9.997143e−01	0.000000000
17	2.382076e−03	−0.083306655	9.985838e−01	0.196666667
18	4.063631e−01	1.837974703	6.705799e−05	1.000000000
19	1.030265e−02	−0.175164254	9.996485e−01	0.000000000
20	3.881560e−05	0.010621505	9.985153e−01	0.006666667
21	8.494582e−04	0.049719433	9.985835e−01	0.000000000
22	1.788072e−04	−0.022799788	9.985314e−01	0.000000000
23	1.662892e−03	−0.069649493	9.987626e−01	0.000000000
24	2.070977e−03	0.077702579	9.986730e−01	0.003333333
25	4.842645e−04	0.037531204	9.985620e−01	0.006666667
26	1.322339e−02	−0.198161215	9.993759e−01	0.000000000
27	5.202016e−04	−0.038904505	9.985889e−01	0.000000000
28	1.031178e−03	−0.054811489	9.986844e−01	0.39666666
29	3.856609e−06	−0.003347895	9.985113e−01	0.000000000
30	3.698622e−03	−0.104034328	9.989000e−01	0.000000000
31	6.414452e−02	−0.441461308	9.969861e−01	0.006666667
32	5.411661e−04	−0.039678286	9.985749e−01	0.000000000
33	3.780872e−02	0.334357794	9.970176e−01	1.000000000
34	6.434007e−06	−0.004324246	9.985113e−01	0.023333333
35	5.213423e−02	−0.393344183	9.958046e−01	0.756666667

Open in a new tab

The results of the traditional and the proposed methods are presented in Table 8. Cutoff values for traditional methods are given in Table 10. Although Cook's distance and LD methods could not find any influential observations, DFFITS was able to determine the observations 14, 20, 30 and 34 as influential observations. On the other hand, observation 11 is masked. The proposed method flagged observations 11, 20, 30, 34 and also 7, 9, 14 and is not affected by masking and swamping. As seen from the Table 9, d.i. values of proposed method coherent with Figure 2. While the d.i. values of observations 11, 20, 30 and 34 above %95, d.i. value of the observation 7 equals to %91 and the others have the values less than %90.

Table 8. Influence measure.

	Cook	DFFITS	LD	BPSO
1	2.144696e−03	0.064898025	0.9972255	0.555
2	4.366058e−02	0.299794676	0.9969756	0.010
3	3.796996e−04	−0.027259555	0.9966510	0.780
4	4.366058e−02	0.299794676	0.9969756	0.010
5	1.052847e−03	0.045423963	0.9969009	0.745
6	1.165485e−02	0.152395064	0.9985706	0.230
7	4.458315e−02	−0.298787936	0.9933631	0.910
8	8.754312e−03	0.131479394	0.9973763	0.060
9	1.037046e−02	0.143895305	0.9990088	0.885
10	6.410336e−04	0.035428120	0.9967389	0.305
11	6.731445e−02	0.365092909	0.9844358	0.960
12	9.791721e−03	0.139567767	0.9985757	0.445
13	1.090914e−02	0.147272727	0.9983248	0.000
14	8.997550e−02	−0.438768531	0.9834203	0.880
15	2.024341e−02	−0.203193189	0.9997022	0.005
16	6.007649e−03	−0.108973115	0.9980194	0.005
17	4.599398e−02	−0.313893857	0.9951976	0.000
18	2.486654e−02	−0.225560155	0.9994076	0.695
19	2.818711e−02	−0.241311168	0.9993977	0.005
20	1.361546e−01	0.522608792	0.9660431	0.995
21	1.435823e−02	−0.170068134	0.9994774	0.005
22	2.242922e−02	−0.214379731	0.9997019	0.000
23	1.200820e−02	−0.154903260	0.9989067	0.505
24	3.795101e−04	−0.027250022	0.9965969	0.330
25	5.010681e−05	0.009899299	0.9965311	0.015
26	3.778877e−03	−0.086254232	0.9975413	0.025
27	4.547379e−03	−0.094740111	0.9979581	0.085
28	2.560500e−04	−0.022382300	0.9966028	0.000
29	1.669607e−02	−0.183563370	0.9993900	0.180
30	2.336906e−01	0.690666047	0.9306623	1.000
31	1.173372e−02	−0.153221943	0.9990984	0.000
32	2.310193e−03	0.067303052	0.9968098	0.380
33	3.073919e−03	0.077728775	0.9972747	0.015
34	4.132486e−01	0.935330295	0.8410167	0.990
35	1.872580e−02	−0.194849300	0.9995194	0.055
36	4.291566e−02	0.295605066	0.9962718	0.000
37	1.810543e−03	0.059571816	0.9968092	0.010
38	3.073919e−03	0.077728775	0.9972747	0.000
39	3.793181e−03	0.086335328	0.9970997	0.000
40	1.520594e−02	0.174762084	0.9991128	0.620
41	4.879765e−03	−0.098149221	0.9979386	0.005
42	4.866670e−04	0.030862059	0.9966447	0.000
43	8.421488e−03	0.129099215	0.9978817	0.005
44	6.476989e−03	0.113148044	0.9979419	0.025
45	2.394800e−02	0.219550106	0.9981752	0.015
46	2.882967e−05	0.007508708	0.9965214	0.005
47	8.750756e−03	−0.131844820	0.9984920	0.350

Open in a new tab

Table 9. Influence measure.

Number of obs.	BPSO
7	0.910
9	0.885
11	0.960
14	0.880
20	0.995
30	1.000
34	0.990

Open in a new tab

Table 10. Cutoff values.

Influence measure	Cutoff points	Cutoff values
Cook's Distance	$F_{(α; p, n - p)}$	0.7039344
DFFITS	$2 \sqrt{p / n}$	$\pm 0.4125685$
LD	$χ_{(α, p)}^{2}$	0.10

Open in a new tab

Figure 2. — Scatter plot of CYG OB1 data.

9. Conclusion

Influential observation detection methods, which are frequently used in the literature, do not work effectively, especially in the presence of multiple influential observations and large sample sizes. In order to determine the cutoff points of these methods, some distributional assumptions have been made. However, these assumptions are highly dependent on the sample size and the number of explanatory variables and can easily be violated. As it can be also noticed from the simulation and real data set applications, the methods do not give reliable results in many cases. However, determining the influential observations reliably and determining the source of these observations is a very important issue in statistical analysis. Considering these shortcomings of the methods in the literature, a new method was proposed in this study. Unlike classical methods, this method does not require any distributional assumptions. Simulation and real data applications also show that even in the case of multiple influential observations, the method works properly and is not affected by masking and swamping effects as much as any other. Unlike other methods, the sample size does not affect the performance of the proposed method, but the computing time increases with the increase in sample size. For the solution of this problem, BPSO's memory is used. Thus, the computation time has been reduced and also the d.i. levels of the observations have been determined.

Funding Statement

The authors were supported by the Marmara University (Scientific Research Project Unit, Project Number: FEN-C-YLP-170419-0131).

Disclosure statement

No potential conflict of interest was reported by the authors.

References

1.Ab Wahab M.N., Nefti-Meziani S., and Atyabi A., A comprehensive review of swarm optimization algorithms, PLoS. ONE. 10 (2015), p. e0122827. doi: 10.1371/journal.pone.0122827 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Atkinson A., Comment: Aspects of diagnostic regression analysis, Stat. Sci. 1 (1986), pp. 379–416. doi: 10.1214/ss/1177013624 [DOI] [Google Scholar]
3.Ayob M.N., Yusof Z.M., Adam A., Abidin A.F.Z., Ibrahim I., Ibrahim Z., Sudin S., Shaikh-Husin N., and Hani M.K., A particle swarm optimization approach for routing in VLSI, in 2010 2nd International conference on computational intelligence, communication systems and networks. IEEE, 2010, pp. 49–53.
4.Beheshti Z. and Shamsuddin S.M.H., A review of population-based meta-heuristic algorithms, Int. J. Adv. Soft Comput. Appl 5 (2013), pp. 1–35. [Google Scholar]
5.Belsley D.A. and Kuh E., Welsch, re-1980. “sregression diagnostics. identifying influential data and sources of collinearity”, Uiley Ser. Probab. Math. Stat (1980).
6.Calazan R.M., Nedjah N., and Mourelle L.M., A hardware accelerator for particle swarm optimization, Appl. Soft. Comput. 14 (2014), pp. 347–356. doi: 10.1016/j.asoc.2012.12.034 [DOI] [Google Scholar]
7.Cervante L., Xue B., Zhang M., and Shang L., Binary particle swarm optimisation for feature selection: A filter based approach, in 2012 IEEE Congress on Evolutionary Computation. IEEE, 2012, pp. 1–8.
8.Chatterjee S. and Hadi A.S., Influential observations, high leverage points, and outliers in linear regression, Stat. Sci. 1 (1986), pp. 379–393. doi: 10.1214/ss/1177013622 [DOI] [Google Scholar]
9.Cook R.D., Detection of influential observation in linear regression, Technometrics 19 (1977), pp. 15–18. [Google Scholar]
10.Cook R.D. and Weisberg S., Residuals and Influence in Regression, Chapman and Hall, New York, 1982. [Google Scholar]
11.Eberhart R. and Kennedy J., Particle swarm optimization, in Proceedings of the IEEE international conference on neural networks, Vol. 4. Citeseer, 1995, pp. 1942–1948.
12.Kennedy J. and Eberhart R.C., A discrete binary version of the particle swarm algorithm, in 1997 IEEE international conference on systems, man, and cybernetics. Computational cybernetics and simulation, Vol. 5. IEEE, 1997, pp. 4104–4108.
13.Kim M.G., A cautionary note on the use of cook's distance, Commun. Stat. Appl. Methods. 24 (2017), pp. 317–324. [Google Scholar]
14.Marandi A., Afshinmanesh F., Shahabadi M., and Bahrami F., Boolean particle swarm optimization and its application to the design of a dual-band dual-polarized planar antenna, in 2006 IEEE international conference on evolutionary computation. IEEE, 2006, pp. 3212–3218.
15.Montgomery D.C., Peck E.A., and Vining G.G., Introduction to Linear Regression Analysis, Vol. 821, John Wiley & Sons, New York, 2012. [Google Scholar]
16.Pedrasa M.A.A., Spooner T.D., and MacGill I.F., Scheduling of demand side resources using binary particle swarm optimization, IEEE Trans. Power Syst. 24 (2009), pp. 1173–1181. doi: 10.1109/TPWRS.2009.2021219 [DOI] [Google Scholar]
17.Ren Y. and Bai G., Determination of optimal svm parameters by using ga/pso., JCP 5 (2010), pp. 1160–1168. [Google Scholar]
18.Sancar N. and Inan D., Identification of influential observations based on binary particle swarm optimization in the cox ph model, Commun. Stat. Simulati. Comput. 49 (2020), pp. 567–590. doi: 10.1080/03610918.2019.1682156 [DOI] [Google Scholar]
19.Shen Q., Jiang J.H., Jiao C.X., Shen G.l., and Yu R.Q., Modified particle swarm optimization algorithm for variable selection in mlr and pls modeling: Qsar studies of antagonism of angiotensin ii antagonists, Eur. J. Pharm. Sci. 22 (2004), pp. 145–152. doi: 10.1016/j.ejps.2004.03.002 [DOI] [PubMed] [Google Scholar]
20.Vanisma F. and De Greve J., Close binary systems before and after mass transfer, Astrophys. Space. Sci. 87 (1972), pp. 377–401. [Google Scholar]
21.Wang W., Chow S.C., and Wei W.W., On likelihood distance for outliers detection, J. Biopharm. Stat. 5 (1995), pp. 307–322. doi: 10.1080/10543409508835116 [DOI] [PubMed] [Google Scholar]
22.Zhang C. and Hu H., Using PSO algorithm to evolve an optimum input subset for a SVM in time series forecasting, in 2005 IEEE international conference on systems, man and cybernetics, Vol. 4. IEEE, 2005, pp. 3793–3796.

[CIT0001] 1.Ab Wahab M.N., Nefti-Meziani S., and Atyabi A., A comprehensive review of swarm optimization algorithms, PLoS. ONE. 10 (2015), p. e0122827. doi: 10.1371/journal.pone.0122827 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0002] 2.Atkinson A., Comment: Aspects of diagnostic regression analysis, Stat. Sci. 1 (1986), pp. 379–416. doi: 10.1214/ss/1177013624 [DOI] [Google Scholar]

[CIT0003] 3.Ayob M.N., Yusof Z.M., Adam A., Abidin A.F.Z., Ibrahim I., Ibrahim Z., Sudin S., Shaikh-Husin N., and Hani M.K., A particle swarm optimization approach for routing in VLSI, in 2010 2nd International conference on computational intelligence, communication systems and networks. IEEE, 2010, pp. 49–53.

[CIT0004] 4.Beheshti Z. and Shamsuddin S.M.H., A review of population-based meta-heuristic algorithms, Int. J. Adv. Soft Comput. Appl 5 (2013), pp. 1–35. [Google Scholar]

[CIT0005] 5.Belsley D.A. and Kuh E., Welsch, re-1980. “sregression diagnostics. identifying influential data and sources of collinearity”, Uiley Ser. Probab. Math. Stat (1980).

[CIT0006] 6.Calazan R.M., Nedjah N., and Mourelle L.M., A hardware accelerator for particle swarm optimization, Appl. Soft. Comput. 14 (2014), pp. 347–356. doi: 10.1016/j.asoc.2012.12.034 [DOI] [Google Scholar]

[CIT0007] 7.Cervante L., Xue B., Zhang M., and Shang L., Binary particle swarm optimisation for feature selection: A filter based approach, in 2012 IEEE Congress on Evolutionary Computation. IEEE, 2012, pp. 1–8.

[CIT0008] 8.Chatterjee S. and Hadi A.S., Influential observations, high leverage points, and outliers in linear regression, Stat. Sci. 1 (1986), pp. 379–393. doi: 10.1214/ss/1177013622 [DOI] [Google Scholar]

[CIT0009] 9.Cook R.D., Detection of influential observation in linear regression, Technometrics 19 (1977), pp. 15–18. [Google Scholar]

[CIT0010] 10.Cook R.D. and Weisberg S., Residuals and Influence in Regression, Chapman and Hall, New York, 1982. [Google Scholar]

[CIT0011] 11.Eberhart R. and Kennedy J., Particle swarm optimization, in Proceedings of the IEEE international conference on neural networks, Vol. 4. Citeseer, 1995, pp. 1942–1948.

[CIT0012] 12.Kennedy J. and Eberhart R.C., A discrete binary version of the particle swarm algorithm, in 1997 IEEE international conference on systems, man, and cybernetics. Computational cybernetics and simulation, Vol. 5. IEEE, 1997, pp. 4104–4108.

[CIT0013] 13.Kim M.G., A cautionary note on the use of cook's distance, Commun. Stat. Appl. Methods. 24 (2017), pp. 317–324. [Google Scholar]

[CIT0014] 14.Marandi A., Afshinmanesh F., Shahabadi M., and Bahrami F., Boolean particle swarm optimization and its application to the design of a dual-band dual-polarized planar antenna, in 2006 IEEE international conference on evolutionary computation. IEEE, 2006, pp. 3212–3218.

[CIT0015] 15.Montgomery D.C., Peck E.A., and Vining G.G., Introduction to Linear Regression Analysis, Vol. 821, John Wiley & Sons, New York, 2012. [Google Scholar]

[CIT0016] 16.Pedrasa M.A.A., Spooner T.D., and MacGill I.F., Scheduling of demand side resources using binary particle swarm optimization, IEEE Trans. Power Syst. 24 (2009), pp. 1173–1181. doi: 10.1109/TPWRS.2009.2021219 [DOI] [Google Scholar]

[CIT0017] 17.Ren Y. and Bai G., Determination of optimal svm parameters by using ga/pso., JCP 5 (2010), pp. 1160–1168. [Google Scholar]

[CIT0018] 18.Sancar N. and Inan D., Identification of influential observations based on binary particle swarm optimization in the cox ph model, Commun. Stat. Simulati. Comput. 49 (2020), pp. 567–590. doi: 10.1080/03610918.2019.1682156 [DOI] [Google Scholar]

[CIT0019] 19.Shen Q., Jiang J.H., Jiao C.X., Shen G.l., and Yu R.Q., Modified particle swarm optimization algorithm for variable selection in mlr and pls modeling: Qsar studies of antagonism of angiotensin ii antagonists, Eur. J. Pharm. Sci. 22 (2004), pp. 145–152. doi: 10.1016/j.ejps.2004.03.002 [DOI] [PubMed] [Google Scholar]

[CIT0020] 20.Vanisma F. and De Greve J., Close binary systems before and after mass transfer, Astrophys. Space. Sci. 87 (1972), pp. 377–401. [Google Scholar]

[CIT0021] 21.Wang W., Chow S.C., and Wei W.W., On likelihood distance for outliers detection, J. Biopharm. Stat. 5 (1995), pp. 307–322. doi: 10.1080/10543409508835116 [DOI] [PubMed] [Google Scholar]

[CIT0022] 22.Zhang C. and Hu H., Using PSO algorithm to evolve an optimum input subset for a SVM in time series forecasting, in 2005 IEEE international conference on systems, man and cybernetics, Vol. 4. IEEE, 2005, pp. 3793–3796.

PERMALINK

Binary particle swarm optimization as a detection tool for influential subsets in linear regression

G Deliorman

D Inan

Abstract

1. Introduction

2. Background and notation

3. Review of measures of influence

3.1. Cook's distance

3.2. DFFITS

3.3. Likelihood displacement

4. Particle swarm optimization (PSO)

5. Binary particle swarm optimization (BPSO)

Figure 1.

6. Proposed method

6.1. Structure of particles

6.2. Definition of objective function

7. Simulations

7.1. Simulation 1

7.2. Simulation 2

7.3. Simulation 3

Table 2. Simulation 2 results.

Table 4. Single influential observation in different sample sizes.

Table 1. Simulation 1 results.

Table 3. Simulation 3 results.

8. Real data set application

8.1. Hills data set

Table 7. Regression models.

8.2. Star cluster CYG OB1 data

Table 5. Cutoff values.

Table 6. Influence measure.

Table 8. Influence measure.

Table 9. Influence measure.

Table 10. Cutoff values.

Figure 2.

9. Conclusion

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases