1. LogiTriBlend: A Novel Hybrid Stacking Approach for Enhanced Phishing Email Detection Using ML Models and Vectorization Approach
- Author
-
Aqsa Khalid, Maria Hanif, Abdul Hameed, Zeeshan Ashraf, Mrim M. Alnfiai, and Salma M. Mohsen Alnefaie
- Subjects
TF-IDF ,Word2Vec ,Doc2Vec ,LogiTriBlend ,SVM ,XGBoost ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Email phishing remains a prevalent and sophisticated cyber threat, targeting individuals and organizations by disguising malicious intent in seemingly legitimate communications. Effective classification of phishing and legitimate emails is crucial for cybersecurity. In this study, we investigated various text vectorization techniques and machine learning models to address the challenge of email classification. We utilized three vectorization techniques: TF-IDF, Word2Vec, and Doc2Vec. These techniques were applied to traditional machine learning algorithms, and their performance was evaluated against a proposed stacking model, LogiTriBlend. The dataset comprised 501 phishing and 4090 legitimate emails, undergoing preprocessing steps like stemming, lemmatization, and noise removal. To handle the dataset’s imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was employed. The model combines multiple base learners, including Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost, with a Logistic Regression meta-learner. The experimental results indicated that the LogiTriBlend model achieved an accuracy of 99.34% using Doc2Vec, outperforming Word2Vec and TF-IDF feature extraction methods, which obtained accuracies of 99.12% and 98.80%, respectively. The Doc2Vec method resulting in superior email classification performance. Among the models tested, the proposed stacking model, LogiTriBlend, demonstrated robust results; however, the highest accuracy was consistently achieved using Doc2Vec.
- Published
- 2024
- Full Text
- View/download PDF