Abstract
Correctly predicting up and down trends for stock prices is of immense important in the financial market. To further improve the prediction performance, in this paper we introduce five penalties: ridge, least absolute shrinkage and selection operator, elastic net, smoothly clipped absolute deviation and minimax concave penalty to logistic regressions with 19 technical indicators, and propose the five penalized logistic regressions to predict up and down trends for stock prices. Firstly, we translate the five penalized logistic log-likelihood functions into the five penalized weighted least squares functions and combine them with the tenfold cross-validation method to calculate the solution path to parameter estimators. Secondly, we combine the binomial deviation with cross-validation error as a risk measure to choose an appropriate tuning parameter for the penalty functions and apply the training set and the coordinate descent algorithm to obtain parameter estimators and probability estimators. Thirdly, we employ the testing set and the chosen optimal thresholds to construct two-class confusion matrices and receiver operating characteristic curves to assess the prediction performances to the five regressions. Finally, we compare the proposed five penalized logistic regressions with logistic regression, support vector machine and artificial neural network and found that the minimax concave penalty logistic regression performs the best in terms of the prediction performance to up and down trends for Google’s stock prices. Therefore, in this paper we propose the five new prediction methods to improve the prediction accuracy of stock returns and bring economic benefits for investors.
Keywords: Penalized logistic regressions, Up and down trends, Coordinate descent algorithm, Support vector machine, Artificial neural network
Introduction
Stock market exists some inherent characteristics such as model uncertainty, parameter instability and noise accumulation. These characteristics make the stock market prediction more complex. Different viewpoints spring up in economic and finance. For example, both efficient market hypothesis and random walk theory assumed that the stock market was unpredictable, whereas Dow theory and Murphy (1999) assumed that financial market was predictable. In particular, Murphy (1999) proposed many technical indicators and developed the technical analysis methods for finance market, whereas Elliott et al. (2013) systematically summarized the economic forecasting problems, emphasized the challenges from stock price forecasting and provided the strategies to improve the forecasting performances. In recent years, some machine learning methods have been proposed to predict stock market. For example, Wang and Zhu (2010) developed support vector regression and a two-step kernel learning method for financial time series prediction. Nair et al. (2011) proposed adaptive artificial neural network (ANN) to predict the second-day closing price of stock market index. Cavalcante et al. (2016) systematically reviewed the progress on artificial intelligence, neural network and support vector machine (SVM) in predicting the change of stock price or direction. Zhang et al. (2018) proposed a novel stock price trend prediction system that could predict both stock price movement and its interval of growth (or decline) rate within the predefined prediction durations. Wen et al. (2019) introduced a new method to simplify noisy-filled financial temporal series via sequence reconstruction by leveraging motifs (frequent patterns) and then utilized a convolutional neural network to predict up and down trends for stock prices. Nabipour et al. (2020) applied machine learning and deep learning algorithms to significantly reduce the risk of trend prediction. Shen and Shafiq (2020) proposed a comprehensive customization of feature engineering and deep learning-based model to predict price trends for China’s stock markets.
It is well known that public sentiment is closely linked to financial markets. In recent years, the impact of investor sentiment on stock returns has been investigated. For example, Joshi et al. (2016) predicted the future stock movements through news sentiment classification. Li et al. (2017) proposed a long short-term memory neural network by combining investor sentiment with market factors to improve the prediction performance. Xing et al. (2019) proposed a novel sentiment-aware volatility forecasting model to produce the more accurate estimation for temporal variances to asset returns by capturing the bi-directional interaction between movements of asset price and market sentiment. Khan et al. (2020) proposed machine learning methods with sentiment and situational features to predict future movements of stocks. Li et al. (2021) constructed the return distributions for the Shanghai Security Composite Index by adding sentiment-aware variables. In addition, market sentiment perspectives and public sentiment-driven portfolio or asset allocation has been also analyzed. For example, Malandri et al. (2018) discussed how the public sentiment would affect portfolio management. Xing et al. (2018) investigated the role of market sentiment in an asset allocation problem. Xing et al. (2018) proposed to formalize public sentiment as a market views and integrated it into modern portfolio theory. Picasso et al. (2019) combined technical analysis with sentiment analysis for news and constructed a portfolio return forecasting model by machine learning, etc..
Predicting up and down trends for stock prices is an important puzzle in the financial field. Even very small improvements in the prediction performance can be very profitable. For example, Hu and Jiang (2021) proposed logistic regression with 6 technical indicators to predict up and down trends for Google’s stock prices and obtain the higher prediction accuracy. In this paper we introduce the five penalties: ridge, least absolute shrinkage and selection operator (LASSO), elastic net, smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) to logistic regressions with 19 technical indicators, and propose the five penalized logistic regressions to further improve the prediction performance to stock returns. Firstly, we combine the iterative weighted least squares algorithm with the tenfold cross-validation method, calculate the overall solution path of model parameters and select a specific solution path from the overall solution path. Secondly, we combine the binomial deviation with cross-validation error as a risk measure to choose an appropriate tuning parameter and apply the training set and the coordinate descent algorithm to obtain parameter estimators and probability estimators. Thirdly, we employ the testing set and the chosen optimal thresholds to construct two-class confusion matrices and receiver operating characteristic (ROC) curves to assess the prediction performances to the five regressions. Finally, we compare the proposed five penalized logistic regressions with logistic regression, SVM and ANN, and found that the MCP logistic regression performs the best in terms of the prediction performance to stock returns. So we recommend investors to employ the MCP logistic regression to predict up and down trends for stock prices and gain the richer economic benefit.
The rest of this paper is organized as follows. In Sect. 2, we establish the five penalized logistic regressions with technical indicators. In Sect. 3, we apply the training set to learn the five penalized logistic regressions and obtain parameter estimators and probability estimators. In Sect. 4, we adopt the testing set to obtain two class confusion matrices and ROC curves for the five regressions to assess their prediction performances. In Sect. 5, we compare the proposed five prediction methods with logistic regression, SVM and ANN.
Penalized logistic regressions
Let be the closing price of a given stock at the end of the t-th trading day, be the stock excess return,
| 1 |
represents the direction indicator function, where represents up trends, and represents down trends. The main goal of this paper is to predict up and down trends for stock prices. In the following we apply a training set to learn up and down trends for stock prices and construct a two-category classification rule that may be hidden deeply in the raw dataset, where is the sample from the predictor vector whose distribution is usually unknown. It is well-known that logistic regression is a powerful two-category classification method. In this paper we combine logistic regression with technical analysis developed by Murphy (1999) and proposed the following logistic regression with 19 technical indicators:
| 2 |
| 3 |
where is an unknown intercept term, is an unknown parameter vector, and is the predictor vector composed of 19 technical indicators listed in Table 1. To avoid multi-collinearity and over-fitting, we introduce the five penalties for logistic regression to remove some technical indicators that are irrelevant to up and down trends for stock prices and construct the five penalized logistic regressions to predict up and down trends for stock prices. Let and be the observation samples for and , respectively. Given the training set , we obtain the following negative log-likelihood
| 4 |
and the penalized negative log-likelihood function
| 5 |
where is a function of the coefficients indexed by a tuning parameter that controls the trade-off between the loss function and penalty, and that also may be shaped by one or more regularization parameters . In this paper we choose the five penalty functions listed in Table 2.
Table 1.
Nineteen technical indicators and their formulae
| Indicators | Descriptions | Formulae |
|---|---|---|
| (WMA) | Weighted moving average | . |
| (DEMA) | Double exponential moving | , |
| average | . | |
| (ADX) | Average directional movement | , |
| Index measures the strength | , | |
| of a trend | . | |
| (MACD) | Moving average convergence | |
| divergence compares a fast | . | |
| exponential moving average | ||
| with a slow exponential | ||
| moving average | ||
| (CCI) | Commodity channel index | , |
| measures the current price | , , | |
| relative to an average price | . | |
| (MO) | Momentum provides the | . |
| difference of a series over | ||
| two observations | ||
| (RSI) | Relative strength index | , |
| measures velocity magnitude | ||
| of directional price | ||
| movements | ||
| (ATR) | Average true range | , |
| . | ||
| (CLV) | Close location value is | |
| a metric utilized in | . | |
| technical analysis to assess | ||
| where the closing price of a | ||
| security falls relative to | ||
| its day’s high and low prices | ||
| (CMF) | Chaiken money flow compares | , |
| the whole volume with regard | . | |
| to the close, high and low prices. | ||
| (CMO) | Chande momentum oscillator | . |
| (EMV) | Ease of movement value | ,, |
| . | ||
| (MFI) | Money flow index uses price | ,, |
| and volume data for identifying | , . | |
| overbought or oversold signals | ||
| in an asset. | ||
| (ROC) | Rate of change | . |
| (VHF) | Vertical horizontal filter | |
| can distinguish the types of | . | |
| market. | ||
| (SAR) | Parabolic stop-and-reverse is | or |
| used to determine the direction | . | |
| of a trend and the potential | ||
| reversal of a price | ||
| (TRIX) | Triple smoothed exponential | , |
| oscillator is to filter price noise | ||
| and insignificant price | . | |
| movements | ||
| (WPR) | William’s indicator is a | . |
| dynamic technical indicator | ||
| that determines whether the | ||
| Market is overbought or bought. | ||
| (SNR) | Signal to noise ratio can | |
| see the trend direction of | . | |
| the stock |
Table 2.
Penalized functions
| Penalties | Formulae |
|---|---|
| Ridge | . |
| LASSO | . |
| ENet | , . |
| MCP | , . |
| SCAD | . |
ENet represents elastic net
Parameter estimators and probability estimators
Negative log-likelihood function (4) is not differentiable. Hence if the current estimates of the parameters are , we transform (4) into the weighted least-squares function and form a quadratic approximation to negative log-likelihood function (4):
| 6 |
where
the estimator of add the estimator of the intercept as follows:
| 7 |
and is constant. Similarly, penalized negative log-likelihood function (5) is not differentiable. Therefore, we replace the negative log-likelihood function in (5) by the weighted least-squares function , run the coordinate descent algorithm to obtain the parameter estimator
| 8 |
where the intercept term does not be penalized. More details refer to Breheny and Huang (2011) on the coordinate descent algorithm for penalized logistic regressions. Table 3 lists three specific parameter estimators.
Table 3.
Penalized functions and parameter estimators for penalized logistic regressions
| Penalties | Estimators |
|---|---|
| LASSO | |
| MCP | |
| SCAD | |
| Symbols |
For j in , the coordinate descent algorithm partially optimizes a target function with respect to a single parameter with the remaining parameters fixed at their most recently updated values , then iteratively cycling through all the parameters until convergence or a maximum iteration number M is reached, and this process repeats over a grid of values for to produce a path of the solution. Usually, we are interested in obtaining not just for a single value of , but for a range of values extending from a maximum value for which all penalized coefficients are 0 down to or to a minimum value at which the model becomes excessively large or ceases to be identifiable. Thus, by starting at max with and proceeding toward , we can ensure that the initial values will never be far from the solution. For , we generally take . Here we take the different values for and found that for MCP and for SCAD are better. Algorithm 1 provides the specific pseudocode on how to apply the coordinate descent algorithm to calculate the parameter estimators for the MCP logistic regression. The coordinate descent algorithms to parameter estimators for the other four penalized logistic regressions are similar to Algorithm 1. We would not list them here for lack of space.
In this paper we apply the coordinate descent algorithm to the five penalized logistic regressions to obtain the final parameter estimators and , then compute the probability estimators
| 9 |
| 10 |
Remark
Compared with local linear/quadratic approximation algorithm, the coordinate descent algorithm has the following advantages: 1) The optimization over each single parameter has a single closed solution; 2) updating can be computed very rapidly; 3) initial values will never be far from the solutions and a few iterations are required.
Two-class prediction performance
Two-class confusion matrix is a contingency table of the true class and the predicted class that describes two-class classification results, see Table 4.
| 11 |
that is the simplest index to evaluate the prediction performance. However, it cannot reflect the losses from two types of errors. Therefore, a ROC curve is introduced to evaluate the prediction performance. Suppose that represents the true positive rate at the threshold c, and represents the false positive rate at the threshold c. By setting the different threshold c, we calculate or (Sensitivity, 1-Specificity) to draw a ROC curve, where
| 12 |
| 13 |
In Sect. 5 we adopt the R package pROC to draw a ROC curve and compute AUC (the area under the ROC curve, a summary indicator of classification performance). More details on ROC can refer to Chapter 7 in Hu and Liu (2020).
Table 4.
Two-class confusion matrix
| True class 1() | True class 2 () | |
|---|---|---|
| Predicted class 1() | TP | FP |
| Predicted class 2() | FN | TN |
TP True positive, FP False positive, TN True negative, FN False negative
Real data analysis
Technical indicators and variance inflation factors
The stock market fluctuates greatly during December 2019 because of the novel coronavirus pandemic. Therefore we select Google’s stock prices from January 2010 to November 2019 as the observation data with the sample size , choose the observation data as the training set with the sample size to learn up and down trends for stock prices and choose the remaining observation data as the test set with the sample size to predict up and down trends. In this paper we apply the R function getSymbols from the Yahoo Finance port to obtain opening price , highest price , lowest price , closing price , volume and adjusted price for Google corporation and then adopt the R package TTR to calculate the 19 technical indicators: WMA, DEMA, ADX, MACD, CCI, Mo, RSI, ATR, CLV, CMF, CMO, EMV, MFI, ROC, VHF, SAR, TRIX, WPR, SNR. In this paper we take as the response variable and 19 technical indicators as the predictor vector to construct the aforementioned five penalized logistic regressions for predicting up and down trends for Google’s stock prices. Table 5 lists five summary statistics to the 19 technical indicators and variance inflation factors (VIF) based on the training set , where summary statistics show the characteristics of the data, and VIF shows the collinearity relations among 19 technical indicators.
Two indicators and represent moving averages of stock prices and mainly show the fluctuation range and dispersion degree of stock prices. From Table 5, we observe that minimum, maximum, median, mean and standard deviation of , and are larger than those of the other indicators. The mean value of indicates that the average degree of trend change of Google stock is 40.1045. , , , , , , , , and have smaller range, mean and standard deviation. The mean value of momentum line at 1.9375 reflects the overall upward trend of Google stock price. The mean value of is 54.1628, and the maximum value is 98.7890 that is greater than 80 and corresponds to the selling period, whereas the minimum value is 5.5085 less than 10 and corresponds to the buying period. Through the analysis for median and mean to 19 indicators, we found that they are evenly distributed. However, indicators have different degrees of variation, and the values of some indicators differ greatly. Therefore, in order to eliminate the influence of scale variations, we standardize the data before modeling. In order to check whether collinearity exists among 19 indicators, we introduce VIF to check. It can be observed from Table 5 that the VIF for , and are far greater than 10, and the VIF for , , , and are also greater than 10. This indicates that there exists collinearity among 19 indicators. Thus, it is statistically significant to introduce the penalty functions for logistic regression to reduce collinearity and avoid over-fitting.
Table 5.
Summary statistics and VIF
| Indicators | Min | Max | Median | Mean | SD | (VIF) |
|---|---|---|---|---|---|---|
| 218.8191 | 1033.3873 | 519.5333 | 512.2369 | 216.9976 | 58264.2178 | |
| 215.8488 | 1039.1186 | 517.3452 | 512.7533 | 217.4342 | 57089.3227 | |
| 10.4160 | 85.4022 | 37.9686 | 40.1045 | 14.1180 | 1.2691 | |
| −4.4066 | 5.6060 | 0.3523 | 0.4140 | 1.5782 | 2.7680 | |
| −5.0000 | 5.0000 | 0.8121 | 0.3048 | 2.8384 | 9.3459 | |
| −86.5400 | 142.8000 | 1.8690 | 1.9375 | 16.8550 | 16.5401 | |
| 5.5085 | 98.7890 | 54.7101 | 54.1628 | 20.5363 | 20.6991 | |
| 2.6340 | 31.2840 | 7.6639 | 9.0489 | 4.4469 | 1.8849 | |
| −1.0000 | 1.0000 | 0.0780 | 0.0445 | 0.5999 | 2.4134 | |
| −0.9697 | 0.7999 | 0.0381 | 0.0397 | 0.2842 | 2.5886 | |
| −100.0000 | 100.0000 | 10.9761 | 8.3479 | 56.9043 | 14.4237 | |
| −168.6625 | 57.4554 | 0.0038 | −0.0362 | 4.0696 | 1.0212 | |
| 0.0000 | 100.0000 | 53.8371 | 52.5334 | 26.6619 | 4.5240 | |
| −0.1384 | 0.2385 | 0.0042 | 0.0034 | 0.0340 | 11.4282 | |
| 0.1232 | 0.9994 | 0.5736 | 0.5849 | 0.1898 | 1.2715 | |
| 216.0054 | 998.7722 | 506.7088 | 509.2556 | 215.7863 | 289.9360 | |
| −1.4159 | 2.5168 | 0.0721 | 0.0691 | 0.4112 | 8.1197 | |
| 0.0000 | 1.0000 | 0.4061 | 0.4475 | 0.3113 | 14.0201 | |
| 0.0000 | 4.9967 | 1.1242 | 1.3006 | 0.9341 | 1.5045 |
Tuning parameter selection
For ridge or LASSO or elastic net penalty, variable selection is determined by the tuning parameter . In order to select an appropriate , we apply a tenfold cross-validation method to calculate the full solution path to model parameters, select a specific solution path from the full solution path and take the binomial deviation as the risk measure. Then we get the mean cross-validation error curve and the one standard deviation band, see Fig. 1. The parameter estimators for MCP logistic regression and SCAD penalized logistic regression depend on the tuning parameter and the regularization parameter .
Fig. 1.
The relationships between binomial deviance/cross-validation error and
In this section we combine binomial deviation with the tenfold cross-validation method to choose an appropriate tuning parameter . Figure 1a, b, c, respectively, represents the binomial deviance curves for ridge, LASSO and elastic net that are drawn by the R function cv.glmnet, whereas Fig. 1d, e, respectively, represents the cross-validation error curves for SCAD and MCP that are drawn by the R function plot.cv.ncvreg. For Fig. 1, the numbers above each graph indicate the selected variable numbers. The left vertical line corresponds to when the minimum mean square error occurs, the right vertical line represents the corresponding when 1 times standard error occurs, and between the two vertical lines indicates that their errors are within a minimum standard error range (i.e., the “one-standard-error” rule). We often use the rule to select the relatively optimum model. From Fig. 1 we observe that the range of “one-standard-error” for ridge, LASSO and elastic net is , and , respectively. However, for MCP and SCAD, there is only one vertical line and corresponds to the when the average minimum error occurs, see Fig. 1d, e. We evaluate the prediction performance at each and value, select the relatively optimum model corresponding to and for MCP or and for SCAD and obtain the final five penalized regressions. We compare the five penalized regressions with logistic regression and found that ridge logistic regression preserves 19 variables without removing one variable, which is similar to logistic regression, whereas the other four penalized logistic regressions choose different variables, more details see Table 6.
Table 6.
Parameter estimators for logistic regression and five penalized logistic regressions
| Coefficient | LR | Ridge | LASSO | ENet | MCP | SCAD |
|---|---|---|---|---|---|---|
| 0.4918 | −0.1008 | −0.1049 | −0.1050 | −0.1137 | −0.1520 | |
| 0.2401 | 0.0159 | 0.0311 | ||||
| −0.2343 | 0.0087 | |||||
| 0.0016 | 0.0031 | |||||
| −0.0192 | −0.0383 | |||||
| −0.1959 | −0.3103 | −0.5788 | −0.5451 | −0.5068 | −0.4751 | |
| 0.0373 | 0.0869 | |||||
| −0.0313 | −0.1191 | −0.1234 | −1.0464 | −1.0114 | ||
| 0.0214 | 0.0485 | 0.0260 | 0.050 | 0.0015 | 0.0956 | |
| −0.2859 | −0.1568 | −0.1622 | −0.1761 | −0.0992 | ||
| 0.3466 | 0.1977 | 0.0917 | 0.1443 | 0.1333 | ||
| 0.0208 | 0.4683 | 0.8422 | 0.8287 | 1.4888 | 1.6237 | |
| 0.0507 | 0.0258 | 0.0488 | ||||
| 0.0145 | 0.3582 | 0.3496 | 0.3981 | 0.4470 | 0.3876 | |
| −9.7227 | 0.1061 | −0.2134 | ||||
| 0.6910 | 0.0715 | 0.0101 | 0.0361 | 0.0122 | 0.0975 | |
| −0.0057 | 0.0033 | |||||
| 1.3665 | 0.1115 | 0.0368 | 0.4117 | 0.4375 | ||
| −0.9247 | 0.1081 | |||||
| −0.0983 | −0.0250 | −0.0918 |
LR represents logistic regression
For the five penalized logistic regressions, we calculate their VIF values, see Table 7. From Table 5, we found that the VIF of , and are 58264.2178, 57089.3227 and 289.9360, respectively, whereas the VIFs of , , , and are greater than 10, which indicates that the strong multicollinearity relations among these indicators exist. From Table 7, we observe that the VIFs of the remaining indicators after the LASSO penalty are all less than 10, after the elastic net, MCP and SCAD penalty, only the VIFs of are greater than 10, which are 14.1372, 11.7272 and 15.1485, respectively. Therefore, penalized logistic regressions can greatly weaken or eliminate collinearity relations among technical indicators.
Table 7.
VIF for the remaining variables
| Variables | VIF (LASSO) | VIF (ENet) | VIF (MCP) | VIF (SCAD) |
|---|---|---|---|---|
| 1.7888 | ||||
| 2.5307 | 4.4068 | 4.3680 | 4.6557 | |
| 14.1372 | 11.7272 | 15.1485 | ||
| 1.0041 | 1.0078 | 1.7588 | 1.0141 | |
| 1.3427 | 1.6880 | 1.7260 | ||
| 2.2507 | 2.3057 | 2.3101 | ||
| 6.1593 | 7.9904 | 7.3606 | 10.8813 | |
| 1.0134 | ||||
| 4.3145 | 4.4449 | 4.2970 | 4.4678 | |
| 5.2576 | ||||
| 1.0081 | 1.0082 | 1.0106 | 1.2629 | |
| 4.1359 | 3.4627 | 5.3538 | ||
| 1.3270 |
The prediction performance
We take advantage of the training set to learn up and down trends for Google’s stock price and apply the testing set and the ROC curve to evaluate the prediction performance. According to the predicted class from the training set and the actual class from the testing set, we establish the following two-class confusion matrix, see Table 8.
Table 8.
Two-class confusion matrix
| Actual 1() | Actual 2 () | |
|---|---|---|
| Predicted 1() | 191 | 84 |
| Predicted 2() | 51 | 164 |
From Table 8 we calculate accuracy, sensitivity and specificity for logistic regression as follows:
Similarly, we calculate accuracy, sensitivity and specificity for the five penalized logistic regressions. Their specific values are listed in Table 9.
Table 9.
The prediction performances for the six methods
| LR | Ridge | LASSO | ENet | MCP | SCAD | |
|---|---|---|---|---|---|---|
| Sensitivity | 0.789 | 0.625 | 0.681 | 0.749 | 0.781 | 0.773 |
| Specificity | 0.661 | 0.766 | 0.720 | 0.678 | 0.678 | 0.686 |
| Accuracy | 0.724 | 0.694 | 0.705 | 0.712 | 0.732 | 0.731 |
From Table 9 we observe the following facts: (1) For elastic net and LASSO, accuracy is higher than that of ridge, but is lower than that of logistic regression; (2) accuracy for MCP is higher than that of SCAD, whereas accuracy for SCAD is higher than that of elastic net and logistic regression. However, accuracy is the simplest index to evaluate the prediction, and it cannot fully reflect the corresponding loss of two kinds of errors. Therefore, in the following we first compute sensitivity and specificity corresponding to different thresholds for the six methods and then apply them to draw the ROC curve to evaluate accuracy, see Fig. 2.
Fig. 2.
The ROC curves for the six models
In Fig. 2, the AUC corresponding to logistic regression, ridge, LASSO, elastic net, MCP and SCAD is 0.776, 0.752, 0.757, 0.760,0.778 and 0.777, respectively. Combined with accuracy listed in Table 9, it can be concluded that among the six methods, the MCP logistic regression with technical indicators performs the best in terms of in terms of accuracy. In order to further explain the superiority to the MCP logistic regression in predicting stock prices trends movement, we compare the prediction performances for the MCP logistic regression with those for SVM and ANN, see Table 10.
Table 10.
Sensitivity, specificity, accuracy and AUC for MCP, SVM and ANN
| MCP | SVM | ANN | |
|---|---|---|---|
| Sensitivity | 0.781 | 0.705 | 0.725 |
| Specificity | 0.678 | 0.653 | 0.732 |
| Accuracy | 0.732 | 0.686 | 0.729 |
| AUC | 0.778 | 0.679 | 0.759 |
From Table 10, we can observe that among the aforementioned three methods, MCP performs the best in terms of sensitivity, accuracy and AUC. The reason that SVM performs the worse may be that Gaussian kernel function is a typical local kernel function, and it only affects the data points in a small area near the test point and has strong learning ability and weak generalization performance. In addition, ANN is unstable, so we choose the average of the 10 predicted results as the final values, and they are worse than MCP. Obviously, the MCP logistic performs best in predicting the trend of stock price ups and downs. Therefore, we recommend the MCP logistic regressions to predict the stock price trend movements.
Discussion
Methodologically, we introduce the five penalty functions to logistic regression with 19 technical indicators and propose the five penalized logistic regressions to predict up and down trends for Google’s stock prices. These prediction methods not only can provide classification probability estimation and class index information, but also improve the prediction accuracy by shrinking regression coefficients and avoiding multicollinearity and overfitting. Computationally, we combine the iteration weighted least squares, the coordinate descent algorithm and the tenfold cross-validation method for the five penalized logistic regressions to obtain their parameter estimations and probability estimations. According to the VIF analysis in Table 5, we found that there exists collinearity among the different technical indicators. Thus, it is statistically significant to introduce the different penalty functions to reduce collinearity relations in logistic regression with 19 technical indicators. Therefore, we propose the five efficient penalized logistic regressions to predict stock price trend movement. Wen et al. (2019) and Khan et al. (2020) predicted Google stock trend movements, whose accuracy is 0.636 and 0.641, respectively. From Table 9 we observe that the prediction accuracies of the five penalized logistic regressions are higher than 0.693. In particular, the prediction accuracies of MCP and SCAD are 0.732 and 0.731, respectively. The AUCs of MCP and SCAD are 0.778 and 0.777, respectively. Obviously, MCP and SCAD penalized logistic regressions outperform logistic regression in terms of the prediction performance. Furthermore, compared MCP and SCAD with SVM and ANN, we found that the proposed MCP and SCAD penalized logistic regression performs better than SVM and ANN. Therefore, in this paper we provide the new methods to predict stock market trends movement. Moreover, the proposed methods help investors to better understand the internal mechanism of stock market trends movement.
Conclusion
Based on Murphy’s technical analysis method, we combine technical indicators with five penalized logistic regressions and propose the five penalized logistic regressions to predict the up and down trends of Google’s stock price. The prediction results show that the MCP logistic regression with technical indicators is superior to logistic regression, the other four penalized logistic regressions, SVM and ANN. Therefore, in this paper we combine technical indicators with MCP logistic regression and provide the new effective prediction method to further improve the prediction performance to stock returns. For other stock price trends prediction problems, we can also apply statistical charts, data analysis, empirical knowledge and the penalized method to extract some important technical indicators that may affect stock price trends movement, establish some penalized logistic regressions with different technical indicators to predict up and down trends for stock prices and apply the two-class confusion matrixes and ROC curves to assess their prediction performances.
Author Contributions
XH provided the basic idea and improved the writing to the manuscript. HJ collected data, provided the figures and tables, and finished the basic writing. HJ improved the program.
Funding
This research was supported by the Fifth Batch of Excellent Talent Support Program of Chongqing Colleges and University (68021900601), the Natural Science Foundation of CQ CSTC (2018jcyjA2073), Science and Technology Research Program of Chongqing Education Commission (KJZD-M202100801), the Program for the Chongqing Statistics Postgraduate Supervisor Team (yds183002), Chongqing Social Science Plan Project (2019WT59,2020YBTJ102), Open Project from Chongqing Key Laboratory of Social Economy and Applied Statistics (KFJJ2018066) and Mathematic and Statistics Team from Chongqing Technology and Business University (ZDPTTD201906).
Data availability
The datasets analyzed during the current study are available in the Yahoo Finance, uk.finance.yahoo.com.
Declarations
Conflict of interests
The author declares that they have no relevant financial or non-financial interests to disclose.
Ethical approval
This article does not contain any studies with human participants or animals performed by the author.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Huifeng Jiang, Email: jianghuifeng0221@163.com.
Xuemei Hu, Email: huxuem@163.com.
Hong Jia, Email: jh9829@ctbu.edu.cn.
References
- Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics. 2011;5(1):232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cavalcante RC, Brasileiro RC, Souza VLF, Nobrega JP, Oliveira ALI. Computational intelligence and financial markets: a survey and future directions. Expert Systems with Applications. 2016;55(15):194–211. doi: 10.1016/j.eswa.2016.02.006. [DOI] [Google Scholar]
- Elliott G, Granger C, Timmermann A (2013) Handbook of economic forecasting. North Holland Elsevier
- Hu XM, Jiang HF. Logistic regression model with technical indicators predicts ups and downs for google stock prices. System Science and Mathematics. 2021;41(3):1–22. [Google Scholar]
- Hu XM, Liu F. Estimation theory and model recognition for high-dimensional statistical models. Beijing: Higher Education Press; 2020. [Google Scholar]
- Joshi K, Bharathi HN, Rao J. Stock trend prediction using news sentiment analysis. International Journal of Computer Science and Information Technology. 2016;8(3):67–76. doi: 10.5121/ijcsit.2016.8306. [DOI] [Google Scholar]
- Khan W, Malik U, Ghazanfar MA, Azam MA, Alyoubi K, Alfakeeh A. Predicting stock market trends using machine learning algorithms via public sentiment and political situation analysis. Soft Computing. 2020;24(15):11019–11043. doi: 10.1007/s00500-019-04347-y. [DOI] [Google Scholar]
- Li JH, Bu H, Wu JJ. Sentiment-aware stock market prediction: a deep learning method. International Conference on Service Systems and Service Management. 2017;202:1–6. [Google Scholar]
- Li S, Ning K, Zhang T (2021)?Sentiment-aware jump forecasting. Knowledge-Based Systems 228: 107292
- Malandri L , Xing F Z , Orsenigo C , Vercellis C (2018) Public moodC-driven asset allocation: the importance of financial sentiment in portfolio management. Cognitive Computation 10: 1167C1176
- Murphy J J (1999) Technical analysis of the financial markets. New York Prentice Hall Press
- Nabipour M, Nayyeri P, Jabani H, Shahab S, Mosavi A (2020) Predicting stock market trends using machine learning and deep learning algorithms via continuous and binary data; a comparative analysis on the Tehran stock exchange. IEEE Access 99(8):150199–150212
- Nair B, Sai SG, Naveen AN, Lakshmi A, Venkatesh GS, Mohandas V. A ga-artificial neural network hybrid system for financial time series forecasting. Information Technology and Mobile Communication. 2011;147(2):499–506. doi: 10.1007/978-3-642-20573-6_91. [DOI] [Google Scholar]
- Picasso A, Merello S, Ma YK, Oneto L, Cambria E. Technical analysis and sentiment embeddings for market trend prediction. Expert Systems with Applications. 2019;135:60–70. doi: 10.1016/j.eswa.2019.06.014. [DOI] [Google Scholar]
- Shen JY, Shafiq MO. Short-term stock market price trend prediction using a comprehensive deep learning system. Journal Of Big Data. 2020;7(1):66–98. doi: 10.1186/s40537-020-00333-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang L, Zhu J. Financial market forecasting using a two-step kernel learning method for the support vector regression. Annals of Operations Research. 2010;174(2):103–120. doi: 10.1007/s10479-008-0357-7. [DOI] [Google Scholar]
- Wen M, Li P, Zhang LF, Chen Y. Stock market trend prediction using high-order information of time series. IEEE Access. 2019;7:28299–28308. doi: 10.1109/ACCESS.2019.2901842. [DOI] [Google Scholar]
- Xing F Z , Cambria E , Malandri L , Vercellis C (2018) Discovering bayesian market views for intelligent asset allocation. Machine Learning and Knowledge Discovery in Data bases 9(2): 120C135
- Xing FZ, Cambria E, Welsch RE. Intelligent asset allocation via market sentiment views. IEEE Computational Intelligence Magazine. 2018;13(4):25–34. doi: 10.1109/MCI.2018.2866727. [DOI] [Google Scholar]
- Xing F Z, Cambria E, Zhang Y (2019) Sentiment-aware volatility forecasting. Knowledge-Based Systems 176(JUL.15):68-76
- Zhang J, Cui SC, Xu Y. A novel data-driven stock price trend prediction system. Expert Systems with Applications. 2018;97(1):60–69. doi: 10.1016/j.eswa.2017.12.026. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets analyzed during the current study are available in the Yahoo Finance, uk.finance.yahoo.com.



