Table 8.
A synthesis of the work of the most cited 10 articles in our dataset.
| Title | Method/model | Used data | Algorithms/techniques | Main findings |
|---|---|---|---|---|
| Toward Detection of Phishing Websites on Client-Side Using Machine Learning Based Approach (Jain and Gupta, 2018b) | ML on multiple datasets | Phishtank, OpenPhish, Alexa, payment gateways, banks | RF, SVM, Neural Nets, Logistic Regression, Naive Bayes | Improved accuracy using client-side data extraction |
| Detection of Phishing Websites Using an Efficient Feature-Based Machine Learning Framework (Rao and Pais, 2019) | Feature extraction from URL + source code + 3rd parties | Diverse data sets | 8 ML algorithms | Better than CANTINA/CANTINA+, detects zero-day phishing |
| Phishing Website Detection Based on Multidimensional Features Driven by Deep Learning (Yang et al., 2019) | CNN for phishing detection | ~2M URLs (1,021,758 phishing + 989,021 legitimate) | CNN | High performance and fast processing speed |
| Machine Learning Based Phishing Detection from URLs (Sahingoz et al., 2019) | Custom dataset + NLP | 73,575 URLs (36,400 legitimate, 37,175 phishing) | DT, AdaBoost, K-star, kNN, RF, SMO, Naive Bayes | Scalable, real-time, detects new phishing attempts |
| PhishStorm: Detecting Phishing with Streaming Analytics (Marchal et al., 2014) | PhishStorm – real-time detection | PhishTank, DMOZ: URLs + search engine queries | Classical ML on URL components | 94.91% accuracy, 1.44% false positives (FP) |
| A Machine Learning Based Approach for Phishing Detection Using Hyperlinks Information (Jain and Gupta, 2019) | HTML hyperlinks analysis | PhishTank, OpenPhish, Alexa: Hyperlinks from source code | Logistic Regression + 12 hyperlink features | Achieved 98.4% accuracy, language-independent |
| A New Hybrid Ensemble Feature Selection Framework for Machine Learning-Based Phishing Detection System (Chiew et al., 2019) | HEFS + CDF-g for optimal feature selection | Multiple sources | Ensemble framework | Improves accuracy through optimal feature selection |
| A Stacking Model Using URL and HTML Features for Phishing Webpage Detection (Li et al., 2019) | Stacking model on URL + HTML features | Phishtank (2k webpages) + Alexa (49,947 webpages) | Combined SVM, NN, DT, RF | High accuracy, stacking outperforms individual models |
| CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites (Xiang et al., 2011) | Extraction of 15 high-level webpage characteristics from URLs, HTML DOM, 3rd party services, search engines | Diverse Web resources | SVM, Logistic Regression, Bayesian Network, J48, Random Forest, AdaBoost | Good TP/FP rate, competitive solution |
| A Comprehensive Survey of AI-Enabled Phishing Attacks Detection Techniques (Basit et al., 2021) | Review on phishing | Diverse datasets | RF, SVM, kNN | ML and DL have up to 99% accuracy, much better than heuristics and data mining approaches |