The advent of the internet, as we all know, has brought about a significant change in human interaction and business operations around the world; yet, this evolution has also been marked by security issues, including phishing attacks that represent one of the biggest problems to internet users, leading to financial loss and identity theft. The ability of Machine learning and ensemble learning models to process large datasets and complex relationships, and to learn from data have made it easier to detect phishing websites, which have become one of the major problems in modern-day security findings. In this study, a comprehensive analysis of various ensemble techniques is carried out, particularly focusing on algorithms like Random Forest, Gradient Boosting, and AdaBoost, in addition to traditional classification techniques like Logistic Regression, Decision Trees, and Support Vector Machines (SVM). In order to evaluate the effectiveness of these machine learning and ensemble models, the benchmarks dataset having phishing and normal site samples, the study assesses the performance of the mentioned models using distinct evaluation metrics, including accuracy, precision, recall, F1-score, and AUC-ROC. The study focuses its attention on the performance of the Random Forest and Gradient Boosting ensemble models compared to their single classifier counterparts. The findings revealed that ensemble techniques have a better performance in terms of true positive rate, false positive rate, and overall performance. Consequently, the research reinforces that these ensemble learning methods possess the capability of providing strength, flexibility, and efficiency under practical conditions of application. However, there are still some areas for improvement in developing and applying more advanced algorithms.
Cite this paper
Budoen, A. T. , Zhang, M. and Jr., L. Z. E. (2025). A Comparative Study of Ensemble Learning Techniques and Classification Models to Identify Phishing Websites. Open Access Library Journal, 12, e3566. doi: http://dx.doi.org/10.4236/oalib.1113566.
Jari, M. (2022) A Comprehensive Survey of Phishing Attacks and Defences: Human Factors, Training and the Role of Emo-tions. International Journal of Network Security & Its Applications, 14, 11-24. https://doi.org/10.5121/ijnsa.2022.14502
Alkhalil, Z., Hewage, C., Nawaf, L. and Khan, I. (2021) Phishing Attacks: A Recent Comprehensive Study and a New Anatomy. Frontiers in Computer Science, 3, Article 563060. https://doi.org/10.3389/fcomp.2021.563060
Putra, F.P.E., Ubaidi, U., Zulfikri, A., Arifin, G. and Ilhamsyah, R.M. (2024) Analysis of Phishing Attack Trends, Impacts and Prevention Methods: Literature Study. Brilliance: Research of Arti-ficial Intelligence, 4, 413-421. https://doi.org/10.47709/brilliance.v4i1.4357
Sharma, K., Rai, P. and Chandel, J. (2023) Review Paper Real-Time Phishing Website with Machine Learning. 2023 11th International Conference on Intelli-gent Systems and Embedded Design (ISED), Dehradun, 15-17 December 2023, 1-5. https://doi.org/10.1109/ised59382.2023.10444574
Tang, L. and Mahmoud, Q.H. (2021) A Survey of Machine Learning-Based Solutions for Phishing Website Detection. Machine Learning and Knowledge Extraction, 3, 672-694. https://doi.org/10.3390/make3030034
Garapati, D.P., Maddipati, L.V.A.P., Swaroop, K.P., Samyuktha, B., Sowmya, G.H. and Valli, B.H.N. (2024) A Comparative Analysis of Logistic Regression, Support Vector Machines, and Random Forest for Phishing Website Identification. 2024 International Conference on Computational Intelligence for Green and Sustainable Technologies (ICCIGST), Vijayawada, 18-19 July 2024, 1-5. https://doi.org/10.1109/iccigst60741.2024.10717628
Alharbi, A.A. (2024) Classification Performance Analysis of Decision Tree-Based Algorithms with Noisy Class Variable. Discrete Dynamics in Nature and Society, 2024, Article ID: 6671395. https://doi.org/10.1155/2024/6671395
Jain, A.K. and Gupta, B.B. (2021) A Survey of Phishing Attack Techniques, Defence Mechanisms and Open Research Challenges. Enterprise Information Systems, 16, 527-565. https://doi.org/10.1080/17517575.2021.1896786
Asadi, M., Jamali, M.A.J., Heidari, A. and Navimipour, N.J. (2024) Botnets Unveiled: A Comprehensive Survey on Evolving Threats and Defense Strategies. Transactions on Emerging Tele-communications Technologies, 35, e5056. https://doi.org/10.1002/ett.5056
Mallick, M.A.I. and Nath, R. (2024) Navigating the Cyber Security Landscape: A Comprehensive Review of Cyber-Attacks, Emerging Trends, and Recent Devel-opments. World Scientific News, 190, 1-69.
Abroshan, H., Devos, J., Poels, G. and Laermans, E. (2021) Phishing Hap-pens Beyond Technology: The Effects of Human Behaviors and Demographics on Each Step of a Phishing Process. IEEE Ac-cess, 9, 44928-44949. https://doi.org/10.1109/access.2021.3066383
Ogutu, R.V.A., Rimiru, R.M. and Otieno, C. (2022) Target Sentiment Analysis Ensemble for Product Review Classification. Journal of Information Technology Research, 15, 1-13. https://doi.org/10.4018/jitr.299382
Salman, H.A., Kalakech, A. and Steiti, A. (2024) Random Forest Algo-rithm Overview. Babylonian Journal of Machine Learning, 2024, 69-79. https://doi.org/10.58496/bjml/2024/007
Ganaie, M.A., Hu, M., Malik, A.K., Tanveer, M. and Suganthan, P.N. (2022) Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 115, Article ID: 105151. https://doi.org/10.1016/j.engappai.2022.105151
Naidu, G., Zuva, T. and Sibanda, E.M. (2023) A Review of Evaluation Metrics in Machine Learning Algorithms. In: Silhavy, R. and Silhavy, P., Eds., Artificial Intelligence Application in Networks and Systems, Springer, 15-25. https://doi.org/10.1007/978-3-031-35314-7_2
Alahmed, Y., Abadla, R. and Ansari, M.J.A. (2024) Exploring the Potential Implications of AI-Generated Content in Social Engineering Attacks. 2024 International Conference on Multimedia Computing, Networking and Applications (MCNA), Valencia, 17-20 September 2024, 64-73. https://doi.org/10.1109/mcna63144.2024.10703950
Sahingoz, O.K., Buber, E., Demir, O. and Diri, B. (2019) Ma-chine Learning Based Phishing Detection from URLs. Expert Systems with Applications, 117, 345-357. https://doi.org/10.1016/j.eswa.2018.09.029
Mohd Ariffin, N.H., Mohamed Iqbal, M.I., Yusoff, M. and Mohd Zulkefli, N.A. (2025) A Study on the Best Classification Method for an Intelligent Phishing Website Detection System. ASEAN Artifi-cial Intelligence Journal, 1, 20-33. https://doi.org/10.37934/aaij.1.1.2033
Kavya, S. and Sumathi, D. (2024) Staying Ahead of Phishers: A Review of Recent Advances and Emerging Methodologies in Phishing Detection. Artificial Intelligence Review, 58, Article No. 50. https://doi.org/10.1007/s10462-024-11055-z
Villanueva, A., Atibagos, C., De Guzman, J., Dela Cruz, J.C., Rosales, M. and Francisco, R. (2022) Application of Natural Language Processing for Phishing Detection Us-ing Machine and Deep Learning Models. 2022 International Conference on ICT for Smart Society (ICISS), Bandung, 10-11 August 2022, 1-6. https://doi.org/10.1109/iciss55894.2022.9915037
Ozcan, A., Catal, C., Donmez, E. and Senturk, B. (2023) A Hybrid DNN-LSTM Model for Detecting Phishing URLs. Neural Computing and Applications, 35, 4957-4973.
Shah, M., Gandhi, K., Patel, K.A., Kantawala, H., Patel, R. and Kothari, A. (2023) Theoretical Evaluation of Ensemble Machine Learning Techniques. 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, 23-25 January 2023, 829-837. https://doi.org/10.1109/icssit55814.2023.10061139
Jawad, S.K. and Alnajjar, S.H. (2024) Optimizing Phishing Threat Detection: A Comprehensive Study of Advanced Bagging Tech-niques and Optimization Algorithms in Machine Learning. Al-Iraqia Journal for Scientific Engineering Research, 3, 64-74.
Baliyan, H. and Prasath, A.R. (2024) Enhancing Phishing Website Detection Using Ensemble Machine Learning Models. 2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.0, Raigarh, 5-7 June 2024, 1-8. https://doi.org/10.1109/otcon60325.2024.10687754
Adane, K., Beyene, B. and Abebe, M. (2023) Sin-gle and Hybrid-Ensemble Learning-Based Phishing Website Detection: Examining Impacts of Varied Nature Datasets and Informative Feature Selection Technique. Digital Threats: Research and Practice, 4, 1-27. https://doi.org/10.1145/3611392
Li, J. (2024) Area under the ROC Curve Has the Most Consistent Evaluation for Binary Classification. PLOS ONE, 19, e0316019. https://doi.org/10.1371/journal.pone.0316019
Ovi, M.S.I., Rahman, M.H. and Hossain, M.A. (2024) PhishGuard: A Multi-Layered Ensemble Model for Optimal Phishing Website Detection. arXiv: 2409.19825.
Bentéjac, C., Csörgő, A. and Martínez-Muñoz, G. (2020) A Comparative Analysis of Gradient Boosting Algorithms. Artificial Intelligence Review, 54, 1937-1967. https://doi.org/10.1007/s10462-020-09896-5
Talekar, B. (2020) A Detailed Review on Decision Tree and Random Forest. Bioscience Biotechnology Research Communications, 13, 245-248. https://doi.org/10.21786/bbrc/13.14/57
Utubor, S. (2023) Improving Detection of Attacks in Cyber-Physical Sys-tems: Applying Gradient Boosting Based Machine Learning Techniques. Ph.D. Thesis, The George Washington Universi-ty.
Ashar Ahmed Fazal, and Maryam Daud, (2023) Detecting Phishing Websites Using Decision Trees: A Machine Learning Approach. International Journal for Electronic Crime Investigation, 7, 232-250. https://doi.org/10.54692/ijeci.2023.0702155
Kara, I., Ok, M. and Ozaday, A. (2022) Characteristics of Understand-ing URLs and Domain Names Features: The Detection of Phishing Websites with Machine Learning Methods. IEEE Access, 10, 124420-124428. https://doi.org/10.1109/access.2022.3223111
Gopal, R.D., Hojati, A. and Patterson, R.A. (2022) Analysis of Third-Party Request Structures to Detect Fraudulent Websites. Decision Support Systems, 154, Article ID: 113698. https://doi.org/10.1016/j.dss.2021.113698
Pandey, N., Patnaik, P.K. and Gupta, S. (2020) Data Pre Processing for Machine Learning Models Using Python Libraries. International Journal of Engineering and Advanced Tech-nology, 9, 1995-1999. https://doi.org/10.35940/ijeat.d9057.049420
Tiu, E.S.K., Huang, Y.F., Ng, J.L., AlDahoul, N., Ahmed, A.N. and Elshafie, A. (2021) An Evaluation of Various Data Pre-Processing Techniques with Machine Learning Models for Water Level Prediction. Natural Hazards, 110, 121-153. https://doi.org/10.1007/s11069-021-04939-8
Zhu, W., Qiu, R. and Fu, Y. (2024) Comparative Study on the Per-formance of Categorical Variable Encoders in Classification and Regression Tasks. arXiv: 2401.09682.
Fazil, A.W., Ha-kimi, M., Akbari, R., Quchi, M.M. and Khaliqyar, K.Q. (2023) Comparative Analysis of Machine Learning Models for Data Classification: An In-Depth Exploration. Journal of Computer Science and Technology Studies, 5, 160-168. https://doi.org/10.32996/jcsts.2023.5.4.16
Levy, J.J. and O’Malley, A.J. (2020) Don’t Dismiss Logistic Regression: The Case for Sensible Extraction of Interactions in the Era of Machine Learning. BMC Medical Research Methodology, 20, Article No. 171. https://doi.org/10.1186/s12874-020-01046-3
Priyanka, N.A. and Kumar, D. (2020) Decision Tree Classifier: A Detailed Survey. International Journal of Information and Decision Sciences, 12, 246-269. https://doi.org/10.1504/ijids.2020.108141
Khan, S.N., Khan, S.U., Aznaoui, H., şahin, C.B. and Dinler, ö.B. (2023) Generalization of Linear and Non-Linear Support Vector Machine in Multiple Fields: A Review. Computer Science and Infor-mation Technologies, 4, 226-239. https://doi.org/10.11591/csit.v4i3.p226-239
Cahyana, N.H., Fauziah, Y. and Ari-bowo, A.S. (2021) The Comparison of Tree-Based Ensemble Machine Learning for Classifying Public Datasets. RSF Confer-ence Series: Engineering and Technology, 1, 407-413. https://doi.org/10.31098/cset.v1i1.412
Pagano, T.P., Lourei-ro, R.B., Lisboa, F.V.N., Peixoto, R.M., Guimarães, G.A.S., Cruz, G.O.R., et al. (2023) Bias and Unfairness in Machine Learning Models: A Systematic Review on Datasets, Tools, Fairness Metrics, and Identification and Mitigation Methods. Big Data and Cognitive Computing, 7, Article 15. https://doi.org/10.3390/bdcc7010015
Lalor, J.P., Abbasi, A., Oketch, K., Yang, Y. and Forsgren, N. (2024) Should Fairness Be a Metric or a Model? A Model-Based Framework for Assessing Bias in Machine Learning Pipelines. ACM Transactions on Information Systems, 42, 1-41. https://doi.org/10.1145/3641276
Haghish, E.F. and Czajkowski, N. (2023) Reconsidering False Positives in Machine Learning Binary Classification Models of Suicidal Behavior. Current Psychology, 43, 10117-10121. https://doi.org/10.1007/s12144-023-05174-z
Kopparaju, S.T., Chavarriaga, C., Galarreta, E. and Bhatia, S. (2024) Natural Language Processing-Enhanced Machine Learning Framework for Comprehensive Phishing Email Identification. 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, 24-28 June 2024, 1-6. https://doi.org/10.1109/icccnt61001.2024.10723950
Riyanto, S., Sitanggang, I.S., Djatna, T. and Atikah, T.D. (2023) Comparative Analysis Using Various Performance Metrics in Imbalanced Data for Multi-Class Text Classification. International Journal of Advanced Computer Science and Applications, 14, 1082-1090. https://doi.org/10.14569/ijacsa.2023.01406116
Li, J. (2023) An Exploration of Relationships between Prevalence, TPR, TNR and Model Performance Metrics. SSRN Electronic Journal, 152, 1549-1556. https://doi.org/10.2139/ssrn.4530905
Hossain, M.R. and Timmer, D. (2021) Machine Learning Model Optimization with Hyper Parameter Tuning Approach. Global Journal of Computer Science & Technology, 21, 31. https://gjcst.com/index.php/gjcst/article/view/2059
Ding, X., Liu, J., Yang, F. and Cao, J. (2021) Random Radial Basis Function Kernel-Based Support Vector Machine. Journal of the Franklin Institute, 358, 10121-10140. https://doi.org/10.1016/j.jfranklin.2021.10.005