Optimizing content freshness of relations extracted from the web using keyword search

Authors :: Mohan Yang
Lipyeow Lim
Min Wang
Haixun Wang
Source :: SIGMOD Conference
Publication Year :: 2010
Publisher :: ACM, 2010.
Abstract: An increasing number of applications operate on data obtained from the Web. These applications typically maintain local copies of the web data to avoid network latency in data accesses. As the data on the Web evolves, it is critical that the local copy be kept up-to-date. Data freshness is one of the most important data quality issues, and has been extensively studied for various applications including web crawling. However, web crawling is focused on obtaining as many raw web pages as possible. Our applications, on the other hand, are interested in specific content from specific data sources. Knowing the content or the semantics of the data enables us to differentiate data items based on their importance and volatility, which are key factors that impact the design of the data synchronization strategy. In this work, we formulate the concept of content freshness, and present a novel approach that maintains content freshness with least amount of web communication. Specifically, we assume data is accessible through a general keyword search interface, and we form keyword queries based on their selectivity, as well their contribution to content freshness of the local copy. Experiments show the effectiveness of our approach compared with several naive methods for keeping data fresh.

Subjects :: Web analytics
Information retrieval
Database
business.industry
Computer science
computer.software_genre
World Wide Web
Keyword density
Data quality
Web page
Distributed web crawling
Mashup
Web service
business
Web crawler
computer
Data Web

Database :: OpenAIRE
Journal :: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Accession number :: edsair.doi...........dd73d6f097bec8eda1327f0170105a6f
Full Text :: https://doi.org/10.1145/1807167.1807256

Full Text Access

Tools