Start Over

Phishing detection algorithm based on attention and feature fusion

Authors :: Sirui ZHANG, Zhiwei YAN, Kejun DONG, Xuebiao YUCHI
Source :: 网络与信息安全学报, Vol 10, Iss 4, Pp 123-131 (2024)
Publication Year :: 2024
Publisher :: POSTS&TELECOM PRESS Co., LTD, 2024.
Abstract: Phishing has been the primary means utilized by attackers to conduct cyber fraud. As national anti-cyber fraud efforts continue to increase, the technical confrontation of various phishing activities has also escalated, bringing significant pressure to phishing detection work. For instance, current phishing attacks often employ images in place of text and apply small-scale shifts or rotations to high-weight website logo images to evade traditional detection algorithms that rely on text or image features. To address the problem of escalating adversarial phishing technologies, a phishing detection algorithm based on the attention mechanism and feature fusion was proposed, and a hierarchical classification model was established. This model included two stages of fusion involving domain names, web structure, web text, and web icons, capable of effectively countering various technical adversarial strategies employed by attackers. In the first stage, the algorithm leveraged the lightweight characteristics of the machine learning model to pre-recall a subset of suspicious domain names from a multitude of domain names. This was achieved by fusing the structural features of domain names, text, and web pages. In the second stage, based on the candidate subset, the attention mechanism was introduced to enhance the extraction of global text association features between the samples and the counterfeited objects. Additionally, the contrast features between the samples and the icons of the counterfeited objects were intensified, and a deep classification model fusing text and image features was established. The effectiveness of the algorithm was ultimately verified. This hierarchical detection method effectively avoids the extraction of image data from a large number of domain names to be detected, significantly improving detection efficiency while ensuring the accuracy of detection.