1. Weakly supervised learning for an effective focused web crawler.
- Author
-
Joe Dhanith, P.R., Saeed, Khalid, Rohith, G., and Raja, S.P.
- Subjects
- *
HYPERLINKS , *WEBSITES , *AMBIGUITY , *SEARCH engines , *SCALABILITY - Abstract
Focused crawler traverses the Web to only collect pages that are relevant to a particular topic, and is increasingly considered as a way to get around the scalability issues with current general-purpose search engines. But the data diversity in the Web forces these crawlers to face three significant problems: (i) inconsistency, (ii) ubiquity, and (iii) ambiguity, which causes misguidance in crawling. To handle these issues, this paper proposes a weakly supervised Gated Recurrent Unit (GRU) mechanism for an adaptive focused web crawler framework that matches semantically relevant t o p i c s and w e b p a g e c o n t e n t. This weakly supervised Gated Recurrent Unit model accepts the vector form of the t o p i c and the fetched w e b p a g e as input to produce meaningful s e m a n t i c vectors and incorporates the Manhattan distance rule to compute the topical relevance of the w e b p a g e. The proposed mechanism guides the focused crawler in downloading more relevant web pages by finding the relevant hyperlinks and omitting the irrelevant hyperlinks concerning the topic. The proposed method helps the focused crawler to semantically find, arrange, and index the web pages in a relatively narrow segment of the web to solve the inconsistency, ubiquity, and ambiguity problems of the focused crawlers. The experimental results indicate that the proposed technique outperforms the s t a t e − o f − t h e − a r t approaches in terms of h a r v e s t r a t e , p r e c i s i o n , r e c a l l , h a r m o n i c m e a n , and i r r e l e v a n c e r a t i o. In summary, the strategy described here works well and is important for focused crawlers. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF