全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

A Comparative Study of Ensemble Learning Techniques and Classification Models to Identify Phishing Websites

DOI: 10.4236/oalib.1113566, PP. 1-22

Subject Areas: Network Modeling and Simulation

Keywords: Ensemble Learning, Phishing Detection, Classification Models, Cybersecurity, Website Security

Full-Text   Cite this paper   Add to My Lib

Abstract

The advent of the internet, as we all know, has brought about a significant change in human interaction and business operations around the world; yet, this evolution has also been marked by security issues, including phishing attacks that represent one of the biggest problems to internet users, leading to financial loss and identity theft. The ability of Machine learning and ensemble learning models to process large datasets and complex relationships, and to learn from data have made it easier to detect phishing websites, which have become one of the major problems in modern-day security findings. In this study, a comprehensive analysis of various ensemble techniques is carried out, particularly focusing on algorithms like Random Forest, Gradient Boosting, and AdaBoost, in addition to traditional classification techniques like Logistic Regression, Decision Trees, and Support Vector Machines (SVM). In order to evaluate the effectiveness of these machine learning and ensemble models, the benchmarks dataset having phishing and normal site samples, the study assesses the performance of the mentioned models using distinct evaluation metrics, including accuracy, precision, recall, F1-score, and AUC-ROC. The study focuses its attention on the performance of the Random Forest and Gradient Boosting ensemble models compared to their single classifier counterparts. The findings revealed that ensemble techniques have a better performance in terms of true positive rate, false positive rate, and overall performance. Consequently, the research reinforces that these ensemble learning methods possess the capability of providing strength, flexibility, and efficiency under practical conditions of application. However, there are still some areas for improvement in developing and applying more advanced algorithms.

Cite this paper

Budoen, A. T. , Zhang, M. and Jr., L. Z. E. (2025). A Comparative Study of Ensemble Learning Techniques and Classification Models to Identify Phishing Websites. Open Access Library Journal, 12, e3566. doi: http://dx.doi.org/10.4236/oalib.1113566.

References

[1]  Jari, M. (2022) A Comprehensive Survey of Phishing Attacks and Defences: Human Factors, Training and the Role of Emo-tions. International Journal of Network Security & Its Applications, 14, 11-24. https://doi.org/10.5121/ijnsa.2022.14502
[2]  Stojnic, T., Vatsalan, D. and Arachchilage, N.A.G. (2021) Phishing Email Strategies: Understanding Cybercriminals’ Strategies of Crafting Phishing Emails. Security and Privacy, 4, e165. https://doi.org/10.1002/spy2.165
[3]  Alkhalil, Z., Hewage, C., Nawaf, L. and Khan, I. (2021) Phishing Attacks: A Recent Comprehensive Study and a New Anatomy. Frontiers in Computer Science, 3, Article 563060. https://doi.org/10.3389/fcomp.2021.563060
[4]  Putra, F.P.E., Ubaidi, U., Zulfikri, A., Arifin, G. and Ilhamsyah, R.M. (2024) Analysis of Phishing Attack Trends, Impacts and Prevention Methods: Literature Study. Brilliance: Research of Arti-ficial Intelligence, 4, 413-421. https://doi.org/10.47709/brilliance.v4i1.4357
[5]  Sharma, K., Rai, P. and Chandel, J. (2023) Review Paper Real-Time Phishing Website with Machine Learning. 2023 11th International Conference on Intelli-gent Systems and Embedded Design (ISED), Dehradun, 15-17 December 2023, 1-5. https://doi.org/10.1109/ised59382.2023.10444574
[6]  Tang, L. and Mahmoud, Q.H. (2021) A Survey of Machine Learning-Based Solutions for Phishing Website Detection. Machine Learning and Knowledge Extraction, 3, 672-694. https://doi.org/10.3390/make3030034
[7]  Garapati, D.P., Maddipati, L.V.A.P., Swaroop, K.P., Samyuktha, B., Sowmya, G.H. and Valli, B.H.N. (2024) A Comparative Analysis of Logistic Regression, Support Vector Machines, and Random Forest for Phishing Website Identification. 2024 International Conference on Computational Intelligence for Green and Sustainable Technologies (ICCIGST), Vijayawada, 18-19 July 2024, 1-5. https://doi.org/10.1109/iccigst60741.2024.10717628
[8]  Alharbi, A.A. (2024) Classification Performance Analysis of Decision Tree-Based Algorithms with Noisy Class Variable. Discrete Dynamics in Nature and Society, 2024, Article ID: 6671395. https://doi.org/10.1155/2024/6671395
[9]  Jain, A.K. and Gupta, B.B. (2021) A Survey of Phishing Attack Techniques, Defence Mechanisms and Open Research Challenges. Enterprise Information Systems, 16, 527-565. https://doi.org/10.1080/17517575.2021.1896786
[10]  Asadi, M., Jamali, M.A.J., Heidari, A. and Navimipour, N.J. (2024) Botnets Unveiled: A Comprehensive Survey on Evolving Threats and Defense Strategies. Transactions on Emerging Tele-communications Technologies, 35, e5056. https://doi.org/10.1002/ett.5056
[11]  Mallick, M.A.I. and Nath, R. (2024) Navigating the Cyber Security Landscape: A Comprehensive Review of Cyber-Attacks, Emerging Trends, and Recent Devel-opments. World Scientific News, 190, 1-69.
[12]  Abroshan, H., Devos, J., Poels, G. and Laermans, E. (2021) Phishing Hap-pens Beyond Technology: The Effects of Human Behaviors and Demographics on Each Step of a Phishing Process. IEEE Ac-cess, 9, 44928-44949. https://doi.org/10.1109/access.2021.3066383
[13]  Ogutu, R.V.A., Rimiru, R.M. and Otieno, C. (2022) Target Sentiment Analysis Ensemble for Product Review Classification. Journal of Information Technology Research, 15, 1-13. https://doi.org/10.4018/jitr.299382
[14]  Salman, H.A., Kalakech, A. and Steiti, A. (2024) Random Forest Algo-rithm Overview. Babylonian Journal of Machine Learning, 2024, 69-79. https://doi.org/10.58496/bjml/2024/007
[15]  Mendonça, F., Mostafa, S.S., Morgado-Dias, F., Ravelo-García, A.G. and Figueiredo, M.A.T. (2022) ProBoost: A Boosting Method for Probabilistic Classifiers. arXiv: 2209.01611.
[16]  Ganaie, M.A., Hu, M., Malik, A.K., Tanveer, M. and Suganthan, P.N. (2022) Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 115, Article ID: 105151. https://doi.org/10.1016/j.engappai.2022.105151
[17]  Naidu, G., Zuva, T. and Sibanda, E.M. (2023) A Review of Evaluation Metrics in Machine Learning Algorithms. In: Silhavy, R. and Silhavy, P., Eds., Artificial Intelligence Application in Networks and Systems, Springer, 15-25. https://doi.org/10.1007/978-3-031-35314-7_2
[18]  Alahmed, Y., Abadla, R. and Ansari, M.J.A. (2024) Exploring the Potential Implications of AI-Generated Content in Social Engineering Attacks. 2024 International Conference on Multimedia Computing, Networking and Applications (MCNA), Valencia, 17-20 September 2024, 64-73. https://doi.org/10.1109/mcna63144.2024.10703950
[19]  Sahingoz, O.K., Buber, E., Demir, O. and Diri, B. (2019) Ma-chine Learning Based Phishing Detection from URLs. Expert Systems with Applications, 117, 345-357. https://doi.org/10.1016/j.eswa.2018.09.029
[20]  Mohd Ariffin, N.H., Mohamed Iqbal, M.I., Yusoff, M. and Mohd Zulkefli, N.A. (2025) A Study on the Best Classification Method for an Intelligent Phishing Website Detection System. ASEAN Artifi-cial Intelligence Journal, 1, 20-33. https://doi.org/10.37934/aaij.1.1.2033
[21]  Kavya, S. and Sumathi, D. (2024) Staying Ahead of Phishers: A Review of Recent Advances and Emerging Methodologies in Phishing Detection. Artificial Intelligence Review, 58, Article No. 50. https://doi.org/10.1007/s10462-024-11055-z
[22]  Villanueva, A., Atibagos, C., De Guzman, J., Dela Cruz, J.C., Rosales, M. and Francisco, R. (2022) Application of Natural Language Processing for Phishing Detection Us-ing Machine and Deep Learning Models. 2022 International Conference on ICT for Smart Society (ICISS), Bandung, 10-11 August 2022, 1-6. https://doi.org/10.1109/iciss55894.2022.9915037
[23]  Ozcan, A., Catal, C., Donmez, E. and Senturk, B. (2023) A Hybrid DNN-LSTM Model for Detecting Phishing URLs. Neural Computing and Applications, 35, 4957-4973.
[24]  Shah, M., Gandhi, K., Patel, K.A., Kantawala, H., Patel, R. and Kothari, A. (2023) Theoretical Evaluation of Ensemble Machine Learning Techniques. 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, 23-25 January 2023, 829-837. https://doi.org/10.1109/icssit55814.2023.10061139
[25]  Jawad, S.K. and Alnajjar, S.H. (2024) Optimizing Phishing Threat Detection: A Comprehensive Study of Advanced Bagging Tech-niques and Optimization Algorithms in Machine Learning. Al-Iraqia Journal for Scientific Engineering Research, 3, 64-74.
[26]  Varriale, L. (2024) Predictive Model for Humanitarian Aid-Research on a Conflict Early Warning System for the Sahel Region. Politecnico di Torino.
[27]  Beja-Battais, P. (2023) Overview of AdaBoost: Reconciling Its Views to Better Un-derstand Its Dynamics. arXiv: 2310.18323.
[28]  Mehta, A.A., Padaria, A.A., Bavisi, D.J., Ukani, V., Thakkar, P., Geddam, R., et al. (2024) Securing the Future: A Comprehensive Review of Security Challenges and Solutions in Advanced Driver Assis-tance Systems. IEEE Access, 12, 643-678. https://doi.org/10.1109/access.2023.3347200
[29]  Baliyan, H. and Prasath, A.R. (2024) Enhancing Phishing Website Detection Using Ensemble Machine Learning Models. 2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.0, Raigarh, 5-7 June 2024, 1-8. https://doi.org/10.1109/otcon60325.2024.10687754
[30]  Adane, K., Beyene, B. and Abebe, M. (2023) Sin-gle and Hybrid-Ensemble Learning-Based Phishing Website Detection: Examining Impacts of Varied Nature Datasets and Informative Feature Selection Technique. Digital Threats: Research and Practice, 4, 1-27. https://doi.org/10.1145/3611392
[31]  Li, J. (2024) Area under the ROC Curve Has the Most Consistent Evaluation for Binary Classification. PLOS ONE, 19, e0316019. https://doi.org/10.1371/journal.pone.0316019
[32]  Ovi, M.S.I., Rahman, M.H. and Hossain, M.A. (2024) PhishGuard: A Multi-Layered Ensemble Model for Optimal Phishing Website Detection. arXiv: 2409.19825.
[33]  Bentéjac, C., Csörgő, A. and Martínez-Muñoz, G. (2020) A Comparative Analysis of Gradient Boosting Algorithms. Artificial Intelligence Review, 54, 1937-1967. https://doi.org/10.1007/s10462-020-09896-5
[34]  Talekar, B. (2020) A Detailed Review on Decision Tree and Random Forest. Bioscience Biotechnology Research Communications, 13, 245-248. https://doi.org/10.21786/bbrc/13.14/57
[35]  Utubor, S. (2023) Improving Detection of Attacks in Cyber-Physical Sys-tems: Applying Gradient Boosting Based Machine Learning Techniques. Ph.D. Thesis, The George Washington Universi-ty.
[36]  Ashar Ahmed Fazal, and Maryam Daud, (2023) Detecting Phishing Websites Using Decision Trees: A Machine Learning Approach. International Journal for Electronic Crime Investigation, 7, 232-250. https://doi.org/10.54692/ijeci.2023.0702155
[37]  Kara, I., Ok, M. and Ozaday, A. (2022) Characteristics of Understand-ing URLs and Domain Names Features: The Detection of Phishing Websites with Machine Learning Methods. IEEE Access, 10, 124420-124428. https://doi.org/10.1109/access.2022.3223111
[38]  Gopal, R.D., Hojati, A. and Patterson, R.A. (2022) Analysis of Third-Party Request Structures to Detect Fraudulent Websites. Decision Support Systems, 154, Article ID: 113698. https://doi.org/10.1016/j.dss.2021.113698
[39]  Pandey, N., Patnaik, P.K. and Gupta, S. (2020) Data Pre Processing for Machine Learning Models Using Python Libraries. International Journal of Engineering and Advanced Tech-nology, 9, 1995-1999. https://doi.org/10.35940/ijeat.d9057.049420
[40]  Tiu, E.S.K., Huang, Y.F., Ng, J.L., AlDahoul, N., Ahmed, A.N. and Elshafie, A. (2021) An Evaluation of Various Data Pre-Processing Techniques with Machine Learning Models for Water Level Prediction. Natural Hazards, 110, 121-153. https://doi.org/10.1007/s11069-021-04939-8
[41]  Zhu, W., Qiu, R. and Fu, Y. (2024) Comparative Study on the Per-formance of Categorical Variable Encoders in Classification and Regression Tasks. arXiv: 2401.09682.
[42]  Mohammed, M.A. (2024) Effect of Using Numerical Data Scaling on Supervised Machine Learning Performance.
[43]  Fazil, A.W., Ha-kimi, M., Akbari, R., Quchi, M.M. and Khaliqyar, K.Q. (2023) Comparative Analysis of Machine Learning Models for Data Classification: An In-Depth Exploration. Journal of Computer Science and Technology Studies, 5, 160-168. https://doi.org/10.32996/jcsts.2023.5.4.16
[44]  Levy, J.J. and O’Malley, A.J. (2020) Don’t Dismiss Logistic Regression: The Case for Sensible Extraction of Interactions in the Era of Machine Learning. BMC Medical Research Methodology, 20, Article No. 171. https://doi.org/10.1186/s12874-020-01046-3
[45]  Priyanka, N.A. and Kumar, D. (2020) Decision Tree Classifier: A Detailed Survey. International Journal of Information and Decision Sciences, 12, 246-269. https://doi.org/10.1504/ijids.2020.108141
[46]  Khan, S.N., Khan, S.U., Aznaoui, H., şahin, C.B. and Dinler, ö.B. (2023) Generalization of Linear and Non-Linear Support Vector Machine in Multiple Fields: A Review. Computer Science and Infor-mation Technologies, 4, 226-239. https://doi.org/10.11591/csit.v4i3.p226-239
[47]  Cahyana, N.H., Fauziah, Y. and Ari-bowo, A.S. (2021) The Comparison of Tree-Based Ensemble Machine Learning for Classifying Public Datasets. RSF Confer-ence Series: Engineering and Technology, 1, 407-413. https://doi.org/10.31098/cset.v1i1.412
[48]  Pagano, T.P., Lourei-ro, R.B., Lisboa, F.V.N., Peixoto, R.M., Guimarães, G.A.S., Cruz, G.O.R., et al. (2023) Bias and Unfairness in Machine Learning Models: A Systematic Review on Datasets, Tools, Fairness Metrics, and Identification and Mitigation Methods. Big Data and Cognitive Computing, 7, Article 15. https://doi.org/10.3390/bdcc7010015
[49]  Lalor, J.P., Abbasi, A., Oketch, K., Yang, Y. and Forsgren, N. (2024) Should Fairness Be a Metric or a Model? A Model-Based Framework for Assessing Bias in Machine Learning Pipelines. ACM Transactions on Information Systems, 42, 1-41. https://doi.org/10.1145/3641276
[50]  Haghish, E.F. and Czajkowski, N. (2023) Reconsidering False Positives in Machine Learning Binary Classification Models of Suicidal Behavior. Current Psychology, 43, 10117-10121. https://doi.org/10.1007/s12144-023-05174-z
[51]  Kopparaju, S.T., Chavarriaga, C., Galarreta, E. and Bhatia, S. (2024) Natural Language Processing-Enhanced Machine Learning Framework for Comprehensive Phishing Email Identification. 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, 24-28 June 2024, 1-6. https://doi.org/10.1109/icccnt61001.2024.10723950
[52]  Riyanto, S., Sitanggang, I.S., Djatna, T. and Atikah, T.D. (2023) Comparative Analysis Using Various Performance Metrics in Imbalanced Data for Multi-Class Text Classification. International Journal of Advanced Computer Science and Applications, 14, 1082-1090. https://doi.org/10.14569/ijacsa.2023.01406116
[53]  Jin, J., Shen, Y., Fu, Z. and Yang, J. (2024) Few-Shot Open-Set Recognition via Pairwise Discriminant Aggregation. Neurocomputing, 602, Article ID: 128214. https://doi.org/10.1016/j.neucom.2024.128214
[54]  Li, J. (2023) An Exploration of Relationships between Prevalence, TPR, TNR and Model Performance Metrics. SSRN Electronic Journal, 152, 1549-1556. https://doi.org/10.2139/ssrn.4530905
[55]  Hossain, M.R. and Timmer, D. (2021) Machine Learning Model Optimization with Hyper Parameter Tuning Approach. Global Journal of Computer Science & Technology, 21, 31. https://gjcst.com/index.php/gjcst/article/view/2059
[56]  Yu, T. and Zhu, H. (2020) Hyper-Parameter Optimization: A Review of Algorithms and Applications. arXiv: 2003.05689.
[57]  Ding, X., Liu, J., Yang, F. and Cao, J. (2021) Random Radial Basis Function Kernel-Based Support Vector Machine. Journal of the Franklin Institute, 358, 10121-10140. https://doi.org/10.1016/j.jfranklin.2021.10.005

Full-Text


Contact Us

[email protected]

QQ:3279437679

WhatsApp +8615387084133