1. Information Extraction from Security-Related Datasets
- Author
-
Seljan, Sanja, Tolj, Nevenka, Dunđer, Ivan, and Skala, Karolj
- Subjects
information extraction ,machine learning ,corpus analysis ,security datasets ,information security ,information and communication sciences - Abstract
There are various approaches to executing security breaches which are nowadays massively occurring in electronic communication environments, and phishing attacks are one of the most applied ones. A vast majority of phishing attacks are initiated using electronic messages, which attackers utilize to direct users to harmful or fake websites, to infect computers or to obtain personal or sensitive data for malicious purposes. Consequently, it is necessary to identify phishing messages in order to provide suitable user protection. Research and numerous studies have included machine learning algorithms and techniques from the field of artificial intelligence which predominantly depend on language-specific datasets and characteristics of phishing messages, and which have demonstrated to be effective for extracting critical information and for data-driven decision making. However, phishing datasets exist mainly for the English language. The aim of this paper is to present an information extraction pipeline that encompasses phases, such as corpus pre-processing, generating predictions of phishing messages using selected machine learning algorithms, along with a basic analysis, confusion matrices and evaluation scores for Croatian phishing messages. This type of key information can be used for teaching in higher education, e.g. in security-related courses or subjects that deal with artificial intelligence, machine learning, big data analysis, computational linguistics etc. This is essential as it can provide deeper insights into phishing attack strategies and potential countermeasures.
- Published
- 2023