Back to Search
Start Over
DISET: a distance based semi-supervised self-training for automated users' agent activity detection from web access log.
- Source :
- Multimedia Tools & Applications; May2023, Vol. 82 Issue 13, p19853-19876, 24p
- Publication Year :
- 2023
-
Abstract
- Detecting automated users' agent activities at any web application through users' web access logs is a challenging issue. Many machines learning based automated solutions exist to address this issue. However, the existing supervised learning methods are heavily dependent on fully labeled data. But the scarcity of labeled log access data and the cost of labeling still make the issue challenging. Some unsupervised learning-based solutions are also proposed, but their performance accuracy is questionable. The semi-supervised based self-training method works with a small set of partially labeled data but lacks a suitable selection metric for a set of predictions with a high degree of confidence and a reliable base learner. In this paper, we propose a new semi-supervised learning based self-training method using probability-based selection criteria with Mahalanobis distance, named DIstance-based SElf-Training (DISET) for detecting automated users' agent activities. The DISET used probability-based selection criteria with Mahalanobis distance to achieve high-confidence subset selection. The DISET framework works in four steps. First, it performs the data cleaning, session identification, feature extraction, and session labeling during the data preprocessing step. The second step segments the data into labeled and unlabeled datasets. The third step of model self-training performs the subset selection using six different supervised base learners independently. Lastly, the fourth step tests the performance of the used model. The performance of DISET is evaluated on NASA95 and E-commerce weblog datasets using three-fold cross-validation training and testing. The used datasets are also divided into different ratios of labeled and unlabeled instances for experiments. The performance is recorded on the accuracy, precision, recall, and the f-1 score, and the Matthews Correlation Coefficient (MCC) measures and compares the model's performance with six different base classifiers. We also plotted the ROC and PR curves to confirm and compare the performance of different base learners with the DISET method. Out of the six-base learners, XGBoost outperformed both datasets in the 30:70 data segmentation ratio. The results show that DISET achieves a minimum percentage improvement of 1.91% in accuracy, 2.70% in precision, 3.65% in sensitivity, and 1.00% in F-1 score with large unlabeled datasets. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 13807501
- Volume :
- 82
- Issue :
- 13
- Database :
- Complementary Index
- Journal :
- Multimedia Tools & Applications
- Publication Type :
- Academic Journal
- Accession number :
- 163255958
- Full Text :
- https://doi.org/10.1007/s11042-022-14258-0