Back to Search Start Over

Design and analyses of web scraping on burstable virtual machines.

Authors :
Drummond, Lúcia Maria A.
Andrade, Luciano
Muniz, Pedro de Brito
Pereira, Matheus Marotti
Silva, Thiago do Prado
Teylo, Luan
Source :
Concurrency & Computation: Practice & Experience; 4/25/2024, Vol. 36 Issue 9, p1-13, 13p
Publication Year :
2024

Abstract

Summary: Web scraping is a widely used technique for decision‐making, collecting, and structuring public data from the internet. As the volume of data continues to grow, the need for more efficient methods of data extraction becomes crucial. This article introduces a novel web scraping framework that utilizes Burstable virtual machines (VMs) on Amazon Web Services with the objective of reducing the monetary cost of execution while ensuring compliance with service level agreements (SLAs). To achieve this, the framework utilizes a combination of fixed and temporary Burstable VMs in a mixed cluster, which can be elastically scaled up to fulfill the SLA and scaled down to minimize monetary costs. Two strategies for handling VM allocation are proposed and evaluated: (i) a queue and SLA‐based strategy that employs queue size information and SLA criteria to determine the required number of VMs for the current scraping requests, and (ii) a credit‐based strategy that incorporates information about Burstable VM credits to effectively manage instance creation and termination. Experimental tests show that the proposed framework meets the defined SLA while achieving cost reductions of up to 74% compared to an approach that executes on fixed‐size clusters of Burstable instances. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
15320626
Volume :
36
Issue :
9
Database :
Complementary Index
Journal :
Concurrency & Computation: Practice & Experience
Publication Type :
Academic Journal
Accession number :
176213993
Full Text :
https://doi.org/10.1002/cpe.7999