ABSTRACT
Neuroscience is a combination of different scientific disciplines which investigate the nervous system for understanding of the biological basis. Recently, applications to the diagnosis of neurodegenerative diseases like Parkinson’s disease have become very promising by considering different statistical regression models. However, well-known statistical regression models may give misleading results for the diagnosis of the neurodegenerative diseases when experimental data contain outlier observations that lie an abnormal distance from the other observation. The main achievements of this study consist of a novel mathematics-supported approach beside statistical regression models to identify and treat the outlier observations without direct elimination for a great and emerging challenge in humankind, such as neurodegenerative diseases. By this approach, a new method named as CMTMSOM is proposed with the contributions of the powerful convex and continuous optimization techniques referred to as conic quadratic programing. This method, based on the mean-shift outlier regression model, is developed by combining robustness of M-estimation and stability of Tikhonov regularization. We apply our method and other parametric models on Parkinson telemonitoring dataset which is a real-world dataset in Neuroscience. Then, we compare these methods by using well-known method-free performance measures. The results indicate that the CMTMSOM method performs better than current parametric models.
KEYWORDS: Neuroscience, regression, mean-shift outliers model, M-estimation, shrinkage, convex optimization
1. Introduction
Neuroscience [18] consists of a combination of different disciplines that work together cooperatively, for understanding of the structure and function of the normal and abnormal brain. It has a wide range of research, from the molecular biology of nerve cells to the biological basis of normal and disordered behavior, emotion, and cognition, and today, it is one of the most rapidly growing areas of science. Neurodegenerative diseases, such as Parkinson, Alzheimer and epilepsy, are research areas of Neuroscience, affecting deeply the lives of patients and their families. Nowadays, optimization and statistical learning techniques are used to solve problems in many fields of research area [14,32]. Therefore, in our study, we will try to contribute to Neuroscience with statistical regression and convex optimization methods by considering dataset related to Parkinson’s disease (PD), which is the second most common neurodegenerative disorder after Alzheimer’s and affects millions of people around the world adversely. All studies to date suggest that age is the single most important risk factor for the origin of this disease which increases steeply after the age of 50 [8]. However, medication and surgical intervention can hold back the progression of the disease and alleviate some of the symptoms, for which there is no available cure [40,41]. Thus, early diagnosis is very important in order to improve the patient’s quality of life and to prolong it [21].
Overall, to track the progress of symptoms associated with this disease, the unified Parkinson’s disease rating scale (UPDRS) is used for monitoring by trained medical staff. Tsanas et al. [47] dealt with remote replication of UPDRS assessment by using only simple, self-administered, and noninvasive speech tests, since symptom monitoring is costly and logistically inconvenient for patient and clinical staff. They characterized speech with signal processing algorithms, extracting clinically useful features of average Parkinson’s disease progression, suggesting that their assessments are clinically more useful in terms of accuracy than the clinicians’ estimates. Nowadays, since Parkinson’s disease remains in public discussion, even regarded as of growing importance, there is an increased interest in studies on PD that apply various statistical and data mining methods to evaluate data related to PD. Tsanas et al. [47] used linear and nonlinear regression techniques for mapping the selected subset of features to UPDRS, employing least-squares estimation, and nonparametric classification and regression trees. Ene [9] made an application study on a PD dataset by considering some probabilistic neural network variants for classifying between healthy people and those with PD. Sakar et al. [10] developed a generalization of the predictive model for telemonitoring by using well-known machine learning tools based on the PD dataset. Furthermore, several clustering studies were conducted based on datasets including PD and normal human voice. Bhattacharjee and Mukherjee [6] considered a multi-layer perceptron by using best ranked voice feature and associative rule generation for clustering and classification of PD. In their study, they investigated several clustering and classification methods like k-means and filtered clustering.
As it is understood from the above studies, regression and classification models are of great importance in Neuroscience by means of PD datasets. These models are constructed based on the Least-Squares (LS) or Maximum-Likelihood (ML) estimation of model parameters under some specific assumption, such as normality of the error distribution [27,33,34]. However, statistical analysis will not yield useful and accurate results if the dataset covers one or more outlier observations. Therefore, it is important to identify these observations and bound their influence using methods that are less sensitive against outliers, while the most appropriate model fit represents properties of the data, i.e. robust methods [16,17], or removal outliers from the dataset. In our study, we use an alternative for robust methods by providing stability for solution and selecting model features which are necessary to fit the PD data. Therefore, we will combine robustness of M-estimation, which was introduced by Huber [16], with the stability of Tikhonov regularization [46]. Tikhonov regularization addresses a particular form of a penalized least-squares criterion:
| (1) |
that is called Bridge estimators [12]. Therefore, we firstly construct a Tikhonov regularization problem for mean-shift outlier model (MSOM) based on the M-estimation with Huber-type function having the following objective function:
| (2) |
where is the ith component of the residual vector and is a Huber-type function [16,17]. Then, we call the solution for this problem as MTMSOM. Secondly, we treat our problem by continuous optimization, considered to become a complementary technology and alternative solution technique, which is called for this model. A problem that is closely related to quadratic programing is the second-order cone program (SOCP). SOCP was used in some of our works [42,45,44,52]. In these studies, we saw that SOCP has better performance or is very much competitive with respect to the other methods.
This paper is organized as follows: in Section 2, we give information about Parkinson’s disease dataset, and we review linear regression and some of direct outlier identification methods for it. In Section 3, we consider mean-shift outlier model and construct the Tikhonov regularization problem based on the M-estimation for it. Then, in Section 4, we propose to solve generated estimation problem by one of the continuous optimization methods. Finally, in Section 5, we give an application study to construct a numerical comparison on the performance of linear model (LM), MSOM and our new method, with respect to our PD telemonitoring dataset.
2. Materials and methods for outlier identification
2.1. Data description
The dataset underlying our study was taken from Machine Learning Repository (UCI) database; it was created in 2009 by Athanasios Tsanas of the University of Oxford, in collaboration with 10 medical centers in the US and Intel Corporation, which developed the telemonitoring device to record speech signals. The data were collected by Tsanas et al. [47] by using Intel AHTD, which is a telemonitoring system designed to facilitate remote, internet-enabled measurement of a variety of PD-related motor impairment symptoms. This dataset has 16 features and 5875 observations, and it is composed of a wide range of biomedical voice measurements from 42 people with early-stage Parkinson’s disease, recruited to a six-month trial of a telemonitoring device for remote symptom progression monitoring. The recordings were automatically captured in the patients’ homes. The attributes involved in the dataset are as follows: age, gender (0: male, 1: female), time interval from baseline recruitment date, clinician's motor UPDRS score (motor UPDRS), clinician’s total UPDRS score (total UPDRS), and 16 biomedical voice measures. These can be listed as follows: several measures of variation in fundamental frequency, such as Jitter (%), Jitter (Absolute), Jitter: RAP, Jitter: P, Jitter: DDP, also some measures of variation in amplitude which are Shimmer, Shimmer(dB), Shimmer: A, Shimmer: A, Shimmer: A, Shimmer: DDA, two measures of ratio of noise to tonal components in the voice which are NHR and HNR; and furthermore, a nonlinear dynamical complexity measure named RPDE, also signal fractal scaling exponent which is DFA, and finally a nonlinear measure of fundamental frequency variation named PPE.
2.2. Methods for outlier identification for PD dataset
Our main task is to construct a predictive model that has a maximally accurate result for UPDRS by using the method. There are several methods to obtain predictive models by using features of UPDRS. Linear regression model is one of them and the most famous one. The standard linear model, LM [33,34], with n observations or output variable for UPDRS and independent or input variables for PD dataset, is given by
| (3) |
where is an -data vector for response variable, is a full-rank -data matrix for independent variables with -column vectors, is a -vector of unknown parameters, and is an -vector of independent, identically distributed random errors. Here, the purpose is to estimate the -vector of unknown parameters based on the given large number of input and output values that minimize the error to determine a most accurate predictive model for UPDRS over the entire dataset. However, it would not be possible to obtain a predictive model that provides maximally accurate predictions for UPDRS, if the dataset contains one or more than one outlier observations. Therefore, we should define outlier observations for UPDRS and bound their influence through methods which are less sensitive against outliers or remove them from the dataset.
For LM, there are different approaches for outlier detection [2,31]. These approaches are partitioned into two categories: Direct approaches and Indirect approaches, involving residuals from the robust fit. Hadi and Simonoff [13] presented a method for direct approaches. For these approaches, they partitioned the data into a set of ‘clean’ data points with size and a set of points that contain the potential outliers, where L is the number of observations that are most likely candidate outliers. The potential outliers are then tested to see how extreme they are relative to the clean subset, using an appropriate diagnostic measure like the internally residual [38] or Cook distance [7]. According to the study Ruppert [39], the success of the procedure is based on the initial clean subset of data. The procedure works well for low-leverage outliers, but it may fail when the sample contains a set of several high-leverage outliers.
An indirect approach to outlier identification works through a robust regression estimate. The aim of robust regression is to provide resistant (stable) results in the presence of outliers. In order to achieve this stability, robust regression limits the influence of outliers. The most widely used robust methods for multivariate data analysis are M-estimates (ML type estimators) introduced by Huber [16,17], LMS-estimates (Least Median of Squares) [36], LTS estimates (Least Trimmed Squares) [37] and S-estimates [7,39], which are a generalized form of the LMS and LTS estimates. The common feature of all of these methods is to model fit without deleting outlying observations from the dataset. Mean-shift outlier [20] is one of these robust methods for which we proposed a new solution method based on Tikhonov regularization by CQP [45]. For the current study, we aim to obtain Huber’s M-estimator depending on Tikhonov regularization by solving CQP problem.
2.2.1. Mean-shift outlier model (MSOM)
The MSOM is a very important outlier-detection method, and it can be considered for the detection of outlier observation in PD dataset since it provides the same residual sum of squares as the model fitted after omitting the relevant observations. Therefore, that approach is very convenient for studying the regression model in the presence of outliers in the datasets, related not only to the PD, but also to all scientific and technological fields.
The MSOM is given by
| (4) |
where is the ith unit vector, i.e. with 1 at the ith position. In this model, it is assumed that really on exclusive or deviates systematically by some value from the model Then, the ith observation would contain a different bias term from the rest of the observations, and would hence be considered an outlier. For checking this fact, the hypothesis
| (5) |
is tested against the alternative hypothesis
| (6) |
using the likelihood-ratio test statistic:
| (7) |
where is the residual sum of squares in the model containing all the n observations, and is the residual sum of squares in the model [33].
3. Parameter estimation for MSOM
3.1. MSOM revisited by M-estimation
In this section, we will use the M-estimation method to assess the parameters in the MSOM for PD dataset. Let us assume that m observations , where m < n, deviate systematically from the corresponding model by using one of the direct methods as in Hadi and Simonoff [13], test statistic given by Equation (7), Cook’s distance [7] or Studentized residuals [4]; that is, observations are outliers. Then, MSOM can be written as
| (8) |
where is a full rank -matrix of independent variables, is an -matrix with indicator variables, and is an vector of the regression coefficients (intercepts) of the indicator variables. Then, MSOM can be stated as
| (9) |
where is an -block matrix constructed by matrices and and is a -vector constructed by vectors and . The M-estimator for MSOM is obtained by solving the minimization problem
| (10) |
| (11) |
In Equation (11), is the ith residual given as , and is a nonconstant convex function that gives the contribution of each residual to the objective function and has the following properties:
| (i) |
To determine M-estimation for MSOM, several nonconstant convex functions can be considered Tukey’s Bisquare function [3] or the following Huber-type function [16, 17], which will be used in this study:
| (12) |
where is a tuning constant and its value is selected to achieve high efficiency. For Huber-type functions, the value producing, efficiency at the 95% level, provides a protection against outliers in the case of errors are normally distributed. Here, is the standard deviation of errors. Preferably, a strong measure of spread is used for the standard deviation of residuals. For this measure, generally, the value
| (13) |
is chosen, where Med:= middle value [11].
The basic algorithm for computing M-estimator for regression is iteratively least-squares (IRLS) [15]. Weighted least-squares (WLS) fit is carried out inside an iteration loop. For each iteration, a set of weights for the observations are used in the least-squares fit. The weights are constructed by applying a weight function to the current residuals. Initial weights are based on residuals from an initial fit. To solve the problem in Equation (10), two cases are considered:
- (i) Let . Then, the problem
is obtained, where is the Euclidean norm. This problem is solved by the least-squares estimation method [33]. When , the M-estimator of the parameter vector also is the LS estimator, and this estimator is the vector which is the solution of n equations, as shown below depending on the function :(14)
Then, the least-squares estimate of is chosen as the value which minimizes the residual sum of squares (RSS):(15)
This optimal vector of value for and the estimate of are obtained as follows:(16) (17) - (ii) Let . Then, the problem in Equation (10) for Huber-type M-estimation turns into the problem
(18)
This problem can be solved by the IRLS method. Hence, we get and . This problem can be written as , where is the weight matrix with diagonal elements . Hence, we find the estimate of as
3.2. Tikhonov regularization problem for M-estimator of MSOM (MTMSOM)
The purpose of robust regression is to produce solutions not affected by outliers. M-estimator is one of the most important robust methods, as mentioned in Subsection 3.1. However, one may encounter an ill-posed problem (irregular or unstable) while trying to find the M-estimators for MSOM. Therefore, Tikhonov regularization problem [1,3] is considered to make an ill-posed problem well-posed, and get a more powerful solution against outliers, while searching M-estimators of the parameters in the MSOM regression model based on the related dataset. Furthermore, we should note that a Tikhonov regularization problem considers model features which are necessary to fit the data, whereas model features, which are unnecessary to fit the data, will conversely be removed by the regularization. This feature of Tikhonov regularization allows us to obtain easily interpretable models.
To obtain M-estimator for MSOM by using Huber-type function, one of the options that can be chosen to construct the Tikhonov regularization for the case is the damped least-squares problem given subsequently:
| (20) |
where is considered to be a penalty parameter that establishes a trade-off between both accuracy, i.e. a small sum of error squares, and not too high a complexity. There are different criteria to obtain the value just if possible: (a) The discrepancy principle [24] between and , (b) the plotting of versus on a log–log scale where the curve of optimal values of and often takes on a pronounced L-shape [1], and (c) L-curve criterion, in which the value of that gives the solution closest to the corner of the L-curve, is selected.
As mentioned previously, the purpose of robust regression is to provide results limiting the impact of outliers or resistant (stable) to outliers. Therefore, we shall consider Tikhonov regularization problem to obtain M-estimation for MSOM; then we will solve them by CQP. Tikhonov regularization problem is generated in both cases when
In the case : we may obtain as
| (21) |
In the case : we may obtain as
| (22) |
If < 0, then Equation (22) appears as follows:
| (23) |
where and are given as
| (24) |
respectively. Let be a matrix with ith diagonal elements ; then we obtain objective function as follows:
| (25) |
where is an -vector with ith element connected to , expressed as Equation (24), and an -vector and an -matrix, respectively. Then, a Tikhonov regularization problem to get an M-estimator for MSOM can be set as
| (26) |
We regard as a penalty parameter mentioned above.
4. On CQP and its application to M-estimation for MSOM ( )
4.1. Conic quadratic programing
Convex programing deals with problems that consist of minimizing a convex function over a convex set. Such programs arise frequently in many different application fields and have many important properties, such as strong duality theory and the fact that any local minimum is a global minimum. These programs are not only computationally tractable but also allow for theoretically efficient solution methods. Convex programing consists of several important specially structured classes of problems, such as semidefinite programing, conic quadratic programing, and geometric programing. Let us give some information about conic quadratic programing by benefiting from [5].
Geometrically, a convex program can be represented in the conic form as
| (27) |
where is a cone (closed, pointed, convex and with a nonempty interior), and , is a linear embedding. A conic quadratic problem is a conic problem,
| (28) |
for which the cone K is a direct product of several so-called ice-cream cones or second-order cones or Lorentz cone given as
| (29) |
where is an -dimensional second-order cone defined by
| (30) |
The optimization problem, based on the cone, can be solved by primal-dual interior point methods.
Equation (29) shows that a conic quadratic program is an optimization problem with a linear objective function and finitely many ice-cream constraints
| (31) |
where
| (32) |
is the partition of the data matrix corresponding to the partition of in Equation (26). Thus, conic quadratic program can be written as
| (33) |
Partitioning the data matrix by
| (34) |
with being of the type , the problem can be written as
| (35) |
This is a most explicit form of the conic problem and the one which we will use. In this form, are matrices of the same row dimension as . Furthermore, the lengths of the column vectors are the column dimensions of the matrices , and are column vectors of the same dimension as ; finally, are reals [5]
Consequently, the problem dual to Equation (28) is
| (36) |
If we write with -dimensional blocks , then the dual problem can be stated as follows:
| (37) |
If it is taken with a scalar component , it can be shown that the following form is the problem dual to Equation (35):
| (38) |
The design variables in Equation (38) are column vectors , having the same dimensions as the vectors , and reals (i = 1,2, … , k). The problems in Equations (35) and (38) are standard forms of a conic quadratic problem and of its dual.
4.2. Application of CQP to M-Estimation to MSOM ( )
In this subsection, we solve the problem in Equation (26) by using Conic Quadratic Problems with their Interior Point Methods, IPMs, as an alternative to Exterior or Penalty Methods which can be applied to the formulation of Tikhonov Regularization. CQP and IPMs [25,35] are very well understood and numerically very much developed now, since they possess the form of well-structured convex optimization problems [5] with their advanced theory and powerful methods. Furthermore, our more mathematical approach is a model-based one which is an alternative to more model-free approaches from statistics which, e.g. employ generalized cross-validation. In fact, we are on a way toward a unified treatment of our data mining problem with the help of modern mathematical optimization.
Let us deal with the Tikhonov regularization problem in Equation (26) through CQP. We can easily formulate Equation (26) as a CQP problem. Indeed, based on an appropriate choice of a bound S, we state the following optimization problem:
| (39) |
Let us underline that this choice of S should be the outcome of a careful learning process, with the help of method-free or method-based methods [14]. In our study, this bound is selected by a trial and error approach as an initial treatment. In future research, S might also be regarded as a decision variable. In Equation (39), we have the least-squares objective function which is a nonlinear function and the inequality constraint function , which is requested to be nonnegative for feasibility. Therefore, we can pass from optimization problem in Equation (39) to an equivalent one with a linear objective. To that aim, it will be enough to add a new design or height variable, say, t, and rewrite the problem of Equation (39), equivalently as follows:
| (40) |
or, equivalently again,
| (41) |
In fact, the problem in Equation (41) is a CQP problem as in Equation (35) with
In order to state optimality conditions for this problem, we firstly reformulate the problem of Equation (41) as follows:
| (42) |
where are and dimensional second-order (or Lorentz) cones. The dual problem to our primal problem Equation (41) is given by
| (43) |
Moreover, is a primal-dual optimal solution if and only if [5,22,52]:
| (44) |
To solve linear programing, LP, and well-structured convex optimization problems, the classes, such as conic CQP, IPMs, which were introduced by Karmarkar [19], can be used [26,35]. These methods have been applied effectively since 1984 [19], and they are very efficient in solving CQP, semidefinite and geometric optimization problems. IPMs are exhibiting and exploiting different properties of LP and CQP that make them very attractive for large-scale optimization. These algorithms have the advantage of employing the structure of the program, of allowing better complexity bounds and displaying a much better practical performance.
5. Numerical application
The numerical application of the method is conducted by using Parkinson Telemonitoring Dataset from UCI: Machine Learning Repository (available at https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring). The attribute lists of this dataset are shown in Table 1, with their names used in the model.
Table 1.
The attributes of Parkinson’s telemonitoring dataset.
| Name for CMTMSOM | Variable | Name for CMTMSOM | Variable |
|---|---|---|---|
| Y1 | mUPDRS | Shimmer A | X8 |
| Y2 | tUPDRS | Shimmer A | X9 |
| X1 | Jitter (%) | Shimmer A | X10 |
| X2 | Jitter (Abs) | Shimmer DDA | X11 |
| X3 | Jitter RAP | NHR | X2 |
| X4 | Jitter P | HNR | X13 |
| X5 | Jitter DDP | RPDE | X14 |
| X6 | Shimmer | DFA | X15 |
| X7 | Shimmer (dB) | PPE | X16 |
Table 2 reports several descriptive statistics for the response (output) and predictor (response) variables. The results show that some attributes have different ranges of values. For that reason, all attributes are normalized in order to make them comparable while doing modeling.
Table 2.
Descriptive statistics of the selected variables.
| Mean | Standard Deviation | Median | Skewness | Kurtosis | |
|---|---|---|---|---|---|
| Y1 | 21.3 | 8.13 | 20.87 | 0.08 | −0.94 |
| Y2 | 29.02 | 10.7 | 27.58 | 0.27 | −0.36 |
| X1 | 0.01 | 0.01 | 0 | 6.45 | 67.41 |
| X 2 | 0 | 0 | 0 | 3.28 | 18.13 |
| X 3 | 0 | 0 | 0 | 6.94 | 78.44 |
| X 4 | 0 | 0 | 0 | 7.58 | 81.47 |
| X 5 | 0.01 | 0.01 | 0.01 | 6.94 | 78.44 |
| X 6 | 0.03 | 0.03 | 0.03 | 3.31 | 15.22 |
| X 7 | 0.31 | 0.23 | 0.25 | 3.1 | 13.07 |
| X 8 | 0.02 | 0.01 | 0.01 | 3.1 | 14.7 |
| X 9 | 0.02 | 0.02 | 0.02 | 3.7 | 19.22 |
| X 10 | 0.03 | 0.02 | 0.02 | 3.41 | 19.14 |
| X 11 | 0.05 | 0.04 | 0.04 | 3.1 | 14.71 |
| X 12 | 0.03 | 0.06 | 0.02 | 6.55 | 52.54 |
| X 13 | 21.68 | 4.29 | 21.92 | −0.81 | 2.5 |
| X 14 | 0.54 | 0.1 | 0.54 | −0.04 | −0.07 |
| X 15 | 0.65 | 0.07 | 0.64 | 0.28 | −0.88 |
In order to investigate the relation between the variables, correlation matrix is obtained for the given dataset. The results show that seven variables are highly correlated with other variables. All these variables are significantly linked with each other, since they belong to the measures of variation in fundamental frequency and in amplitude. Moreover, the correlation between and is very high (0.94), and they are positively correlated with each other. Therefore, we eliminate and by conducting a multicollinearity test in this study. After removing the variables that are correlated to each other by more than 0.9, the variables selected to model this dataset are determined, and their correlation matrix is introduced in Table 3. By this table, acceptable results have been obtained according to the statistical modeling theory.
Table 3.
The correlation among the variables selected into to the CMTMSOM model.
| Y1 | X2 | X3 | X10 | X11 | X12 | X13 | X14 | X15 | X16 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Y1 | 1 | |||||||||
| X2 | 0.0508 | 1 | ||||||||
| X3 | 0.0726 | 0.8446 | 1 | |||||||
| X10 | 0.1365 | 0.5899 | 0.6030 | 1 | ||||||
| X11 | 0.0842 | 0.6238 | 0.6502 | 0.8856 | 1 | |||||
| X12 | 0.0749 | 0.6999 | 0.7923 | 0.7115 | 0.7327 | 1 | ||||
| X13 | −0.1570 | −0.7064 | −0.6414 | −0.7779 | −0.7806 | −0.6844 | 1 | |||
| X14 | 0.1286 | 0.5470 | 0.3828 | 0.4807 | 0.4368 | 0.4166 | 0.6590 | 1 | ||
| X15 | −0.1162 | 0.3522 | 0.2148 | 0.1796 | 0.1307 | 0.0220 | −0.2905 | −0.1920 | 1 | |
| X16 | 0.1624 | 0.7878 | 0.6706 | 0.6234 | 0.5767 | 0.5646 | 0.7587 | −0.5660 | 0.3946 | 1 |
In this study, some of the variables, such as age and gender, will not be used in modeling; instead, they will be employed for interpretation of the final model. In fact, the number of people and age ranges are different according to gender. In this dataset, there are 28 males, and 4008 observations out of total 5876 within this dataset belong to males. The rest are obtained from females (14 people with 1867 observations). A pie chart of given dataset, according to gender, is provided in Figure 1. While the range of males’ age is between 49 and 78 years, the range of age for females is between 36 and 85. Corresponding age and gender patterns are given in Figure 2.
Figure 1.
Pie chart of Parkinson’s dataset according to gender.
Figure 2.
Age and gender patterns for Parkinson’s data set.
In order to find the outliers in this dataset for males and females, the following outlier detection procedure, demonstrated in Figure 3, is applied [13,23]. First of all, a multivariate linear model is constructed and the residuals are obtained. Outlier detection measures, which are studentized residuals, leverage, measures of influence, such as Cooks distance, and model performance measures, are calculated. Detailed information on these measures can be found in Montgomery and Peck [23]. After that, in order to find actual outliers, the potential outliers are removed from the dataset, and the corresponding model fit is checked. This procedure is repeated until all of the outlier observations are detected in the dataset.
Figure 3.
Flowchart for the outlier detection procedure.
For each gender, we obtained the number of potential outliers, as presented in Figure 4. Although the number of females is much lower than the number of males, most of the outlier observations belong to the group of females. After the detection of outliers within the given dataset, we constructed three models, LM, MSOM and , on our normalized dataset. In this study, the applications are performed by using MATLAB, and for the model parameters, a MATLAB code is running along with using the optimization software MOSEKTM.
Figure 4.
Number of outlier observation according to gender.
To evaluate the model efficiency, the model-free performance measures, such as Mean Absolute Error (MAE), Mean Squared Error (MSE), Correlation Coefficient (r), Multiple Coefficient of Determination , Adjusted , and Mean Absolute Percentage Error (MAPE), are calculated. The results presented are based on improvements of the model performances, as stated in Table 4.
Table 4.
Performances of the models based on improvements in percentage.
| LM MSOM | LM CMTMSOM | MSOM CMTMSOM | |
|---|---|---|---|
| MAE | 2% | 4% | 2% |
| MAPE | 2% | 4% | 2% |
| MSE | 3% | 5% | 2% |
| r | 10% | 18% | 6% |
| R2 | 22% | 39% | 13% |
| Adj-R2 | 11% | 28% | 15% |
The results show that the developed model performs fairly well for the given dataset according to all performance measures. This is due to the fact that our new model combines the power of MSOM and M-Estimation in modeling the data with the effective estimation of CQP. The second best model is MSOM. This model can be considered to be in an intermediate position with regard to method-free performance measures. The worse model is LM, when compared to the CTMSOM and MSOM, since this model does not handle the effects of outlier observations. It should be noted that ‘ ’ and ‘ ’ represent decreasing and increasing behavior of the measure values, respectively. In addition, the symbol ‘ ’ shows the transition based on improvements between both models. For example, and MAE mean that MSOM displays a better performance than LM based on MAE measure. Decrease in the values of MAE, MSE and MAPE shows a better result. On the other hand, and means that reveals a better performance than LM based on measure. Increase in the values of r, and displays a better result. Moreover, Figure 5 displays the relation between the number of parameters in LM, MSOM, models and some of the performance measures such as MAE and RMSE. Although the number of parameters for MSOM and is much more than LM, MSOM and perform much better than LM.
Figure 5.
Comparison of the number of parameters of LM, MSOM, models in terms of MAE and RMSE.
6. Concluding remarks
This paper has given a contribution to problems of outlier-detection for regression in Neuroscience, supported by mean-shift outlier regression, and by combining modern methods of continuous optimization, especially CQP, with M-estimation, which is a robust method. In the study, firstly, we represented mean-shift outlier regression problem as a CQP problem, based on M-estimates. Secondly, we employed mean-shift outlier model as a predictive model for the diagnosis of PD by means of UPDRS on the basis of PD dataset. We achieved to set up a bridge between methods of statistical learning, inverse problems and the powerful tools prepared for well-structured convex optimization.
In this paper, we studied on a relatively high-dimensional dataset for the setting . Since works effectively on such complex and sensitive datasets, for more complete and high dimensional Neuroscience datasets , the results should indicate a better performance when compared to other models which do not take into account outliers [48,49,50,51].
We plan to conduct a comparative study that will consider our method, generalized partial linear model [25] and conic generalized partial linear model (CGPLM) [43] on the basis of certain datasets, also from other dementia diseases, such as Alzheimer’s disease. Furthermore, we consider to conduct a further robustification by employing RCMARS [28], RMARS [29] and RCGPLM [30].
With this developed method, we have sought to make progress in science, especially through mathematics and statistics, to provide novel and powerful elements and modules in order to allow for better early warning system in our field of Neuroscience. We are using these notions with the highest care, well aware of the complexity of neurodegenerative disease and of any subject with very pronounced human factors. This method can gradually become parts of clinical and medical computer programs. What’s more, we can work out and advance further our codes, graphics and tables toward elements in a Graphical User Interface (GUI) which could be used very conveniently and meaningfully by any practitioner. Eventually, we can employ our whole methodology, modules and future GUIs for other highly emerging inverse problems, especially in medical ones, as well as in biosciences, earth sciences or environmental sciences.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Aster R., Borchers B., and Thurber C., Parameter Estimation and Inverse Problems, Academic Press, Oxford, 2004. [Google Scholar]
- 2.Barnett V., and Lewis T., Outliers in Statistical Data, Wiley, Great Britain, 1994. [Google Scholar]
- 3.Beaton A.E., and Tukey J.W., The fitting of power series, meaning polynomials, illustrated on band- spectroscopic data. Technometrics. 16 (1974), pp. 147–185. [Google Scholar]
- 4.Beckman R.J., and Trussell H.J., The distribution of an arbitrary studentized residual and effects of updating in multiple regression. J. Am. Stat. Assoc. 69 (1974), pp. 199–201. [Google Scholar]
- 5.Ben-Tal A., and Nemirovski A., Lectures on Modern Convex Optimization: Analysis, Algorithms and Engineering Applications. MPS-SIAM Series on Optimization, SIAM, Philadelphia, 2001. [Google Scholar]
- 6.Bhattacharjee A.K., and Mukherjee S., Parkinson’s disease clustering and classification with multi layer perception using best ranked voice feature and associative rule generation. IJCSI Int. J. Comput. Sci. Issues 14 (2017), pp. 1694–0784. [Google Scholar]
- 7.Cook R.D., Influential observations in linear regression. J. Am. Stat. Assoc. 74 (1979), pp. 169–174. [Google Scholar]
- 8.Elbaz A., Bower J.H., Maraganore D.M., McDonnell S.K., Peterson B.J., Ahlskog J.E., Schaid D.J., and Rocca W.A., Risk tables for Parkinsonism and Parkinson’s disease. J. Clin. Epidemiol. 55 (2002), pp. 25–31. [DOI] [PubMed] [Google Scholar]
- 9.Ene M., Neural network-based approach to discriminate healthy people from those with Parkinson’s disease. An. Univ. Craiova Ser. Mat. Inform 35 (2008), pp. 112–116. [Google Scholar]
- 10.Erdogdu-Sakar B., Isenkul M., Sakar C.O., Sertbas A., Gurgen F., Delil S., Apaydin H., and Kursun O., Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings. IEEE J. Biomed. Health. Inform 17 (2013), pp. 828–834. [DOI] [PubMed] [Google Scholar]
- 11.Fox J., Robust Regression, Appendix to An R and S-PLUS Companion to Applied Regression, 2002.
- 12.Frank I.E., and Friedman J.H., A statistical view of chemometrics tools. Technometrics. 35 (1993), pp. 109–148. [Google Scholar]
- 13.Hadi A.S., and Simonoff J.S., Procedures for the identification of multiple outliers in linear models. J. Am. Stat. Assoc 88 (1993), pp. 1264–1272. [Google Scholar]
- 14.Hastie T., Tibshirani R., and Friedman J.H., The Elements of Statistical Learning, Springer, New York, 2001. [Google Scholar]
- 15.Holand P.H., and Welsch R., Robust regression using iteratively reweighted least-squares. Comm. Statist. Theory Methods 6 (1977), pp. 813–827. [Google Scholar]
- 16.Huber P.J., Robust regression: Asymptotics, conjectures and Monte Carlo. Ann. Statist. 1 (1973), pp. 799–821. [Google Scholar]
- 17.Huber P.J., Robust Statistics, Wiley, New York, 1981. [Google Scholar]
- 18.Kandel E.R., and Squire L.R., Neuroscience: Breaking down scientific barriers to the study of brain and mind. Science 290 (2000), pp. 1113–1120. [DOI] [PubMed] [Google Scholar]
- 19.Karmarkar N., A new polynomial-time algorithm for linear programming. Combinatorica 4 (1984), pp. 373–395. [Google Scholar]
- 20.Kim S.S., Park S.H., and Krzanowski W.J., Simultaneous variable selection and outlier identification in linear regression using the mean-shift outlier model. J. Appl. Stat. 35 (1974), pp. 283–291. [Google Scholar]
- 21.King J., Ramig L., Lemke J.H., and Horii Y., Parkinson’s disease: longitudinal changes in acoustic parameters of phonation. J. Med. Speech Lang. Pathol. 2 (1994), pp. 29–42. [Google Scholar]
- 22.Lobo M.S., Vandenberghe L., Boyd S., and Lebret H., Applications of second-order cone programming. Linear. Algebra. Appl. 284 (1998), pp. 193–228. [Google Scholar]
- 23.Montgomery D.C., and Peck E.A., Introduction to Linear Regression Analysis, Wiley Interscience, New York, 1992. [Google Scholar]
- 24.Morozov V.A., The principle of discrepancy in the solution of inconsistent equations by Tikhonov regularization method. Comput. Math. Phys. 13 (1973), pp. 5. [Google Scholar]
- 25.Muller M., Estimation and testing in generalized partial linear models-A comparative study. Stat. Comput. 11 (2001), pp. 299–309. [Google Scholar]
- 26.Nesterov Y.E. and Nemirovski A.S., Interior Point Methods in Convex Programming: Theory and Applications. MPS-SIAM Series on Optimization, Philadelphia, SIAM. 1994.
- 27.Neter J., Kutner M., Wasserman W., and Nachtsheim C., Applied Linear Statistical Models, WCB McGraw Hill, Boston, 1996. [Google Scholar]
- 28.Özmen A., Robust Optimization of Spline Models and Complex Regulatory Networks: Theory and Applications, Springer, Switzerland, 2016. [Google Scholar]
- 29.Özmen A., and Weber G.W., RMARS: robustification of multivariate adaptive regression spline under polyhedral uncertainty. J. Comput. Appl. Math 259 (2014), pp. 914–924. [Google Scholar]
- 30.Özmen A., Weber G.W., Çavuşoğlu Z., and Defterli O., The new robust conic GPLM method with an application to finance: Prediction of credit default. J. Global Optim 56 (2013), pp. 233–249. [Google Scholar]
- 31.Pena D., and Yohai V.J., The detection of influential subsets in linear regression by using an influential matrix. J. R. Stat. Soc. Ser. B. Stat. Methodol 57 (1995), pp. 145–156. [Google Scholar]
- 32.Pukkala T., Optimized cellular automaton for stand delineation. J. For. Res 30 (2019), pp. 107–119. [Google Scholar]
- 33.Rao C.R., Toutenburg H., and Fieger A., Linear Models: Least-Squares and Alternatives, Springer, Berlin, 1999. [Google Scholar]
- 34.Rencher A.C., Linear Models in Statistics, Wiley, New York, 2000. [Google Scholar]
- 35.Renegar J.A., Mathematical View of Interior-Point Methods In Convex Programming, MOS-SIAM Series on Optimization, Philadelphia, SIAM, 2001.
- 36.Rousseeuw P.J., Least median of squares regression. J. Am. Stat. Assoc. 79 (1984), pp. 871–880. [Google Scholar]
- 37.Rousseeuw P.J., and Van Driessen K., Computing LTS regression for large datasets. Data Min. Knowl. Discov. 12 (2006), pp. 29–45. [Google Scholar]
- 38.Rousseeuw P.J. and Yohai V., Robust Regression by Means of S Estimators in Robust and Nonlinear Time Series Analysis, Lect. Notes Stat 26 Springer, New York, 1984.
- 39.Ruppert D., Computing S estimators for regression and multivariate location dispersion. J. Comput. Graph. Statist 1 (1992), pp. 253–270. [Google Scholar]
- 40.Sapir S., Spielman J., Ramig L., Story B., and Fox C., Effects of intensive voice treatment (LSVT) on vowel articulation in dysarthric individuals with idiopathic Parkinson disease Acoustic and perceptual findings. J. Speech. Lang Hear. Res. 50 (2007), pp. 899–912. [DOI] [PubMed] [Google Scholar]
- 41.Singh N., Pillay V., and Choonara Y.E., Advances in the treatment of Parkinson’s disease. Prog. Neurobiol. 81 (2007), pp. 29–44. [DOI] [PubMed] [Google Scholar]
- 42.Taylan P., Weber G.W., and Beck A., New approaches to regression by generalized additive models and continuous optimization for modern applications in finance science and technology. Optimization 56 (2007), pp. 1–24. [Google Scholar]
- 43.Taylan P., Weber G.W., Lian L., and Yerlikaya-Ozkurt F., On the foundations of paramete estimation for generalized partial linear models with B-splines and continuous optimization. Comput. Math. Appl. 60 (2010), pp. 134–143. [Google Scholar]
- 44.Taylan P., Weber G.W., and Yerlikaya F., A new approach to multivariate adaptive regression spline by using Tikhonov regularization and continuous optimization. TOP 18 (2010), pp. 377–395. [Google Scholar]
- 45.Taylan P., Yerlikaya-Ozkurt F., and Weber G.W., An approach to the mean shift outlier model by Tikhonov regularization and conic programming. Intell. Data Anal. 18 (2014), pp. 79–94. [Google Scholar]
- 46.Tikhonov A.N., and Arsenin V.Y., Solution of Ill-Posed Problems, V. H. Winston, Washington, DC, 1977. [Google Scholar]
- 47.Tsanas A., Little M.A., McSharry P.E., and Ramig L.O., Accurate telemonitoring of Parkinson’s disease progression by noninvasive speech tests. IEEE Trans. Biomed. Eng. 57 (2010), pp. 884–893. [DOI] [PubMed] [Google Scholar]
- 48.Wang T., Li Q., Chen B., and Li Z., Multiple outlier’s detection in sparse high-dimensional regression. J. Stat. Comput. Simul. 88 (2018), pp. 89–107. [Google Scholar]
- 49.Wang T., Li Q., Zang Q., and Li Z., A multiple-case deletion approach for detecting influential points in high-dimensional regression. Comm. Statist. Simulation Comput. 48 (2019), pp. 2065–2082. [Google Scholar]
- 50.Wang T., and Li Z., Outlier detection in high-dimensional regression model. Comm. Statist. Theory Methods 46 (2017), pp. 6947–6958. [Google Scholar]
- 51.Wang T., Zheng L., Li Z., and Liu H., A robust variable screening method for high-dimensional data. J. Appl. Stat. 44 (2017), pp. 1839–1855. [Google Scholar]
- 52.Weber G.W., Batmaz I., Köksal G., Taylan P., and Yerlikaya-Özkurt F., CMARS: A new contribution to nonparametric regression with multivariate adaptive regression splines supported by continuous optimisation. Inverse Probl. Sci. Eng. 20 (2012), pp. 371–400. [Google Scholar]





