98 results on '"B. Barla Cambazoglu"'
Search Results
2. A review of public datasets in question answering research
- Author
-
Bruce Croft, Falk Scholer, B. Barla Cambazoglu, and Mark Sanderson
- Subjects
Hardware and Architecture ,Computer science ,Related research ,Question answering ,Data science ,Management Information Systems - Abstract
Recent years have seen an increase in the number of publicly available datasets that are released to foster research in question answering systems. In this work, we survey the available datasets and also provide a simple, multi-faceted classification of those datasets. We further survey the most recent evaluation results that form the current state of the art in question answering research by exploring related research challenges and associated online leaderboards. Finally, we provide a discussion around the existing online challenges and provide a wishlist of datasets whose release could benefit question answering research in the future.
- Published
- 2020
3. Impact of response latency on sponsored search
- Author
-
Xiao Bai and B. Barla Cambazoglu
- Subjects
Computer science ,business.industry ,Usability ,Context (language use) ,Library and Information Sciences ,Management Science and Operations Research ,Computer Science Applications ,Low latency (capital markets) ,World Wide Web ,Search engine ,Media Technology ,Revenue ,Ad serving ,Latency (engineering) ,business ,Mobile device ,Information Systems - Abstract
Recent research in the human computer interaction and information retrieval areas has revealed that search response latency exhibits a clear impact on the user behavior in web search. Such impact is reflected both in users’ subjective perception of the usability of a search engine and in their interaction with the search engine in terms of the number of search results they engage with. However, a similar impact analysis has been missing so far in the context of sponsored search. Since the predominant business model for commercial search engines is advertising via sponsored search results (i.e., search advertisements), understanding how response latency influences the user interaction with the advertisements displayed on the search engine result pages is crucial to increase the revenue of a commercial search engine. To this end, we conduct a large-scale analysis using query logs obtained from a commercial web search. We analyze the short-term and long-term impact of search response latency on the querying and clicking behaviors of users using desktop and mobile devices to access the search engine, as well as the corresponding impact on the revenue of the search engine. This analysis demonstrates the importance of serving sponsored search results with low latency and provides insight into the ad serving policy of commercial search engines to ensure long-term user engagement and search revenue.
- Published
- 2019
4. Quantifying Human-Perceived Answer Utility in Non-factoid Question Answering
- Author
-
Valeriia Baranova, Bruce Croft, Mark Sanderson, Leila Tavakoli, Falk Scholer, and B. Barla Cambazoglu
- Subjects
Correctness ,Computer science ,business.industry ,Factoid ,media_common.quotation_subject ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Completeness (order theory) ,0202 electrical engineering, electronic engineering, information engineering ,Question answering ,020201 artificial intelligence & image processing ,Relevance (information retrieval) ,Quality (business) ,Artificial intelligence ,business ,computer ,Natural language processing ,media_common - Abstract
Taking a user-centric approach, we study the features that render an answer to a non-factoid question useful in the eyes of the person who asked that question. An editorial study, where participants assess the usefulness of the answers they received in response to their questions, as well as 12 different aspects associated with the answers, indicates considerable correlation between certain aspects such as relevance, correctness, and completeness with the user-perceived usefulness of answers. Moreover, we investigate the effectiveness of some commonly used answer quality measures, such as ROGUE, BLEU, METEOR, and BERTScore, demonstrating that these measures are limited in their ability to capture the aspects of usefulness and have room for improvement. The question answering dataset created in our work was made publicly available.
- Published
- 2021
5. An Intent Taxonomy for Questions Asked in Web Search
- Author
-
B. Barla Cambazoglu, Mark Sanderson, Bruce Croft, Leila Tavakoli, and Falk Scholer
- Subjects
Interrogative word ,Information retrieval ,Computer science ,media_common.quotation_subject ,020207 software engineering ,02 engineering and technology ,Ambiguity ,Search engine ,020204 information systems ,Taxonomy (general) ,0202 electrical engineering, electronic engineering, information engineering ,Overhead (computing) ,Web search engine ,Granularity ,media_common - Abstract
We present a new, multi-faceted taxonomy to classify questions asked in web search engines based on the question intent, types of entities mentioned, types of question words, and granularity of the expected answer. Built based on the inspection of 1,000 real-life questions issued to a web search engine, the taxonomy reflects the recent search behavior of users and enables deep understanding of user intents, goals, and expected answers. This taxonomy is more fine-grained than previous query taxonomies, and is designed with the ultimate goal of reducing the inherent ambiguity in determining the intent of questions. In addition, we describe the formal procedure for conducting an editorial study of the taxonomy including its evaluation. The adopted procedure aims to increase assessor agreement without incurring too much overhead. Our results demonstrate that, despite being more fine-grained, the proposed intent categories result in higher agreement between assessors compared to an existing, commonly used taxonomy.
- Published
- 2021
6. Providing Direct Answers in Search Results: A Study of User Behavior
- Author
-
W. Bruce Croft, B. Barla Cambazoglu, Falk Scholer, Zhijing Wu, and Mark Sanderson
- Subjects
Focus (computing) ,Information retrieval ,Computer science ,Interface (Java) ,Factoid ,Reading (process) ,media_common.quotation_subject ,Question answering ,Eye tracking ,Search engine results page ,Quality (business) ,media_common - Abstract
To study the impact of providing direct answers in search results on user behavior, we conducted a controlled user study to analyze factors including reading time, eye-tracked attention, and the influence of the quality of answer module content. We also studied a more advanced answer interface, where multiple answers are shown on the search engine results page (SERP). Our results show that users focus more extensively than normal on the top items in the result list when answers are provided. The existence of the answer module helps to improve user engagement on SERPs, reduces user effort, and promotes user satisfaction during the search process. Furthermore, we investigate how the question type -- factoid or non-factoid -- affects user interaction patterns. This work provides insight into the design of SERPs that includes direct answers to queries, including when answers should be shown.
- Published
- 2020
7. Feature Extraction for Large-Scale Text Collections
- Author
-
Luke Gallagher, Antonio Mallia, J. Shane Culpepper, Torsten Suel, and B. Barla Cambazoglu
- Subjects
Feature engineering ,Information retrieval ,Computer science ,business.industry ,05 social sciences ,Feature extraction ,02 engineering and technology ,Recommender system ,Pipeline (software) ,Software ,020204 information systems ,Component (UML) ,0202 electrical engineering, electronic engineering, information engineering ,Feature (machine learning) ,Learning to rank ,0509 other social sciences ,050904 information & library sciences ,business - Abstract
Feature engineering is a fundamental but poorly documented component in Learning-to-Rank (LTR) search engines. Such features are commonly used to construct learning models for web and product search engines, recommender systems, and question-answering tasks. In each of these domains, there is a growing interest in the creation of open-access test collections that promote reproducible research. However, there are still few open-source software packages capable of extracting high-quality machine learning features from large text collections. Instead, most feature-based LTR research relies on "canned" test collections, which often do not expose critical details about the underlying collection or implementation details of the extracted features. Both of these are crucial to collection creation and deployment of a search engine into production. So in this regard, the experiments are rarely reproducible with new features or collections, or helpful for companies wishing to deploy LTR systems. In this paper, we introduce Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt can easily be integrated into complex, high-performance software applications to help solve a wide variety of text-based machine learning problems. To demonstrate the software's utility, we build and document a reproducible feature extraction pipeline and show how to recreate several common LTR experiments using the ClueWeb09B collection. Researchers and practitioners can benefit from Fxt to extend their machine learning pipelines for various text-based retrieval tasks, and learn how some static document features and query-specific features are implemented.
- Published
- 2020
8. Pre-indexing Pruning Strategies
- Author
-
Soner Altin, Ricardo Baeza-Yates, and B. Barla Cambazoglu
- Subjects
Index (economics) ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Search engine indexing ,Volume (computing) ,Inverted index ,Index pruning ,Reduction (complexity) ,Search efficiency ,Statistics ,Range (statistics) ,Web search ,Pruning (decision trees) ,Mathematics ,Query log - Abstract
Comunicació presentada al SPIRE 2020: International Symposium on String Processing and Information Retrieval, celebrat del 13 al 15 d'octubre de 2020 a Orlando, Estats Units. We explore different techniques for pruning an inverted index in advance, that is, without building the full index. These techniques provide interesting trade-offs between index size, answer quality and query coverage. We experimentally analyze them in a large public web collection with two different query logs. The trade-offs that we find range from an index of size 4% and 35% of precision@10 to an index of size 46% and 90% of precision@10, with respect to the full index case. In both cases we cover almost 97% of the query volume. We also do a relative relevance analysis with a smaller private web collection and query log, finding that some of our techniques allow a reduction of almost 40% the index size by losing less than 2% for NDCG@10.
- Published
- 2020
9. Improving News Personalization Through Search Logs
- Author
-
Xiao Bai, Amin Mantrach, Francesco Gullo, Fabrizio Silvestri, and B. Barla Cambazoglu
- Subjects
Information retrieval ,Algorithmic Bias ,Exploit ,Computer science ,Social aspects ,Real-world datasets ,Content personalization ,Search activity ,News articles ,News personalization ,Personalizations ,Standing problems ,User activity ,Personalization ,News service ,Order (business) - Abstract
Content personalization is a long-standing problem for online news services. In most personalization approaches users are represented by topical interest profiles that are matched with news articles in order to properly decide which articles are to be recommended. When constructing user profiles, existing personalization methods exploit the user activity observed within the news service itself without incorporating information from other sources.
- Published
- 2020
10. Scalability Challenges in Web Search Engines
- Author
-
B. Barla Cambazoglu, Ricardo Baeza-Yates, B. Barla Cambazoglu, and Ricardo Baeza-Yates
- Abstract
In this book, we aim to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. More specifically, we cover the issues involved in the design of three separate systems that are commonly available in every web-scale search engine: web crawling, indexing, and query processing systems. We present the performance challenges encountered in these systems and review a wide range of design alternatives employed as solution to these challenges, specifically focusing on algorithmic and architectural optimizations. We discuss the available optimizations at different computational granularities, ranging from a single computer node to a collection of data centers. We provide some hints to both the practitioners and theoreticians involved in the field about the way large-scale web search engines operate and the adopted design choices. Moreover, we survey the efficiency literature, providing pointers to a large number of relatively important research papers. Finally, we discuss some open research problems in the context of search engine efficiency.
- Published
- 2022
11. Characterizing, predicting, and handling web search queries that match very few or no results
- Author
-
Roi Blanco, Rifat Ozcan, B. Barla Cambazoglu, Erdem Sarigil, Özgür Ulusoy, Ismail Sengor Altingovde, Ulusoy, Özgür, and Sarıgil, Erdem
- Subjects
Information Systems and Management ,Information retrieval ,Web search query ,Computer Networks and Communications ,Computer science ,05 social sciences ,02 engineering and technology ,Range query (database) ,Library and Information Sciences ,computer.software_genre ,Query language ,Spatial query ,Query expansion ,Search engine ,Web query classification ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Queries per second ,Data mining ,0509 other social sciences ,050904 information & library sciences ,computer ,Information Systems - Abstract
A non‐negligible fraction of user queries end up with very few or even no matching results in leading commercial web search engines. In this work, we provide a detailed characterization of such queries and show that search engines try to improve such queries by showing the results of related queries. Through a user study, we show that these query suggestions are usually perceived as relevant. Also, through a query log analysis, we show that the users are dissatisfied after submitting a query that match no results at least 88.5% of the time. As a first step towards solving these no‐answer queries, we devised a large number of features that can be used to identify such queries and built machine‐learning models. These models can be useful for scenarios such as the mobile‐ or meta‐search, where identifying a query that will retrieve no results at the client device (i.e., even before submitting it to the search engine) may yield gains in terms of the bandwidth usage, power consumption, and/or monetary costs. Experiments over query logs indicate that, despite the heavy skew in class sizes, our models achieve good prediction quality, with accuracy (in terms of area under the curve) up to 0.95.
- Published
- 2017
12. A machine learning approach for result caching in web search engines
- Author
-
B. Barla Cambazoglu, Tayfun Kucukyilmaz, Ricardo Baeza-Yates, Cevdet Aykanat, and Aykanat, Cevdet
- Subjects
Artificial intelligence ,Search engine performance ,Computer science ,Feature-based ,02 engineering and technology ,Library and Information Sciences ,Management Science and Operations Research ,Machine learning ,computer.software_genre ,Static caching ,Static-dynamic caching ,Oracle ,Feature-based caching ,Machine learning approaches ,Search engine ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Information retrieval ,Information Science & Library Science ,Query result caching ,Hardware_MEMORYSTRUCTURES ,Learning systems ,business.industry ,False sharing ,Query results ,Computer Science Applications ,Cache stampede ,Smart Cache ,Performance metrics ,Search engines ,Computer Science ,Hit rate ,020201 artificial intelligence & image processing ,Cache ,business ,Static dynamics ,computer ,Performance metric ,Information Systems - Abstract
To the best of our knowledge, our work is therst in literature to apply machine learning techniques to the result caching problem in search engines, for both static, dynamic, and state-of-the-art static-dynamic cache organizations.We evaluate a large set of features and illustrate that they can be exploited to increase the hit rate of result caches.We evaluate various oracle caching strategies to illustrate the potential room for improvement in the result caching problem.We show that the proposed machine learning framework can improve the hit rate of result caches, potentially reducing the energy consumption in search engines. A commonly used technique for improving search engine performance is result caching. In result caching, precomputed results (e.g., URLs and snippets of best matching pages) of certain queries are stored in a fast-access storage. The future occurrences of a query whose results are already stored in the cache can be directly served by the result cache, eliminating the need to process the query using costly computing resources. Although other performance metrics are possible, the main performance metric for evaluating the success of a result cache is hit rate. In this work, we present a machine learning approach to improve the hit rate of a result cache by facilitating a large number of features extracted from search engine query logs. We then apply the proposed machine learning approach to static, dynamic, and static-dynamic caching. Compared to the previous methods in the literature, the proposed approach improves the hit rate of the result cache up to 0.66%, which corresponds to 9.60% of the potential room for improvement.
- Published
- 2017
13. Exploiting search history of users for news personalization
- Author
-
Amin Mantrach, Fabrizio Silvestri, B. Barla Cambazoglu, Francesco Gullo, and Xiao Bai
- Subjects
Information Systems and Management ,Information retrieval ,Computer science ,02 engineering and technology ,Computer Science Applications ,Theoretical Computer Science ,Personalization ,World Wide Web ,Artificial Intelligence ,Control and Systems Engineering ,Order (business) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Web search engine ,020201 artificial intelligence & image processing ,Search history ,Software - Abstract
Content personalization is a long-standing problem for online news services. In most personalization approaches users of a news service are represented by topical interest profiles that are matched with news articles in order to properly decide which articles are to be recommended. When constructing user profiles, existing personalization methods exploit the user activity observed within the news service itself without incorporating additional information that can be obtained from other sources.In this paper we study the problem of news personalization by leveraging usage information that is external to the news service. We propose a novel approach that relies on the concept of search profiles, which are user profiles that are built based on the past interactions of the user with a web search engine. We extensively test our proposal on real-world datasets obtained from Yahoo. We explore various dimensions and granularities at which search profiles can be built. Experimental results show that, compared to a basic strategy that does not exploit the search activity of users, our approach is able to boost the clicks on news articles shown at the top positions of a ranked result list.
- Published
- 2017
14. On the feasibility of predicting popular news at cold start
- Author
-
Mounia Lalmas, Ioannis Arapakis, and B. Barla Cambazoglu
- Subjects
Web analytics ,Information Systems and Management ,Computer Networks and Communications ,business.industry ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,05 social sciences ,02 engineering and technology ,Library and Information Sciences ,Page view ,Popularity ,Task (project management) ,World Wide Web ,Cold start ,0202 electrical engineering, electronic engineering, information engineering ,Selection (linguistics) ,020201 artificial intelligence & image processing ,Social media ,0509 other social sciences ,050904 information & library sciences ,business ,Set (psychology) ,Information Systems - Abstract
Prominent news sites on the web provide hundreds of news articles daily. The abundance of news content competing to attract online attention, coupled with the manual effort involved in article selection, necessitates the timely prediction of future popularity of these news articles. The future popularity of a news article can be estimated using signals indicating the article's penetration in social media (e.g., number of tweets) in addition to traditional web analytics (e.g., number of page views). In practice, it is important to make such estimations as early as possible, preferably before the article is made available on the news site (i.e., at cold start). In this paper we perform a study on cold-start news popularity prediction using a collection of 13,319 news articles obtained from Yahoo News, a major news provider. We characterize the popularity of news articles through a set of online metrics and try to predict their values across time using machine learning techniques on a large collection of features obtained from various sources. Our findings indicate that predicting news popularity at cold start is a difficult task, contrary to the findings of a prior work on the same topic. Most articles' popularity may not be accurately anticipated solely on the basis of content features, without having the early-stage popularity values.
- Published
- 2016
15. Optimal Web Page Download Scheduling Policies for Green Web Crawling
- Author
-
Iordanis Koutsopoulos, B. Barla Cambazoglu, and Vassiliki Hatzi
- Subjects
Web analytics ,Web server ,Ajax ,Computer Networks and Communications ,Computer science ,Download ,010103 numerical & computational mathematics ,02 engineering and technology ,Crawling ,computer.software_genre ,01 natural sciences ,World Wide Web ,Upload ,Server ,Web page ,0202 electrical engineering, electronic engineering, information engineering ,Distributed web crawling ,0101 mathematics ,Electrical and Electronic Engineering ,computer.programming_language ,Same-origin policy ,Database ,business.industry ,020206 networking & telecommunications ,Static web page ,Web service ,Web crawler ,business ,computer ,Site map - Abstract
A web crawler is responsible for discovering and downloading new pages on the Web as well as refreshing previously downloaded pages. During these operations, the crawler issues a large number of HTTP requests to web servers. These requests increase the energy consumption and carbon footprint of the web servers since computational resources are used while serving the requests. In this work, we introduce the problem of green web crawling, where the objective is to devise a page refresh policy that minimizes the total staleness of pages in the repository of a web crawler, subject to a constraint on the amount of carbon emissions due to the processing on web servers. For the case of one web server and one crawling thread, the optimal policy turns out to be a greedy one. At each iteration, the page to be refreshed is selected based on a metric that considers the page’s staleness, its size, and the greenness of the energy consumed at the web server premises. We then extend the optimal policy to the cases of 1) many servers; 2) multiple threads; and 3) pages with variable freshness requirements. We conduct simulations on a real data set that involves a large web server collection hosting around two billion pages. We present experimental results for the optimal page refresh policy as well as for various heuristics, in an effort to study the effect of different factors on performance.
- Published
- 2016
16. Scalability Challenges in Web Search Engines
- Author
-
B. Barla Cambazoglu and Ricardo Baeza-Yates
- Subjects
Information Systems and Management ,Information retrieval ,Computer Networks and Communications ,Computer science ,Search engine indexing ,Context (language use) ,Library and Information Sciences ,Field (computer science) ,Search engine ,Open research ,Node (computer science) ,Scalability ,Web crawler ,Information Systems - Abstract
In this book, we aim to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. More specifically, we cover the issues involved in the design of three separate systems that are commonly available in every web-scale search engine: web crawling, indexing, and query processing systems. We present the performance challenges encountered in these systems and review a wide range of design alternatives employed as solution to these challenges, specifically focusing on algorithmic and architectural optimizations. We discuss the available optimizations at different computational granularities, ranging from a single computer node to a collection of data centers. We provide some hints to both the practitioners and theoreticians involved in the field about the way large-scale web search engines operate and the adopted design choices. Moreover, we survey the efficiency literature, providing pointers to a large number of relatively important research papers. Finally, we discuss some open research problems in the context of search engine efficiency.
- Published
- 2015
17. Sentiment-Focused Web Crawling
- Author
-
Pinar Senkul, B. Barla Cambazoglu, and A. Gural Vural
- Subjects
Web analytics ,Information retrieval ,Computer Networks and Communications ,business.industry ,Computer science ,Sentiment analysis ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Social Semantic Web ,World Wide Web ,Web mining ,Web page ,Distributed web crawling ,Web content ,InformationSystems_MISCELLANEOUS ,Web crawler ,business ,Web intelligence - Abstract
Sentiments and opinions expressed in Web pages towards objects, entities, and products constitute an important portion of the textual content available in the Web. In the last decade, the analysis of such content has gained importance due to its high potential for monetization. Despite the vast interest in sentiment analysis, somewhat surprisingly, the discovery of sentimental or opinionated Web content is mostly ignored. This work aims to fill this gap and addresses the problem of quickly discovering and fetching the sentimental content present in the Web. To this end, we design a sentiment-focused Web crawling framework. In particular, we propose different sentiment-focused Web crawling strategies that prioritize discovered URLs based on their predicted sentiment scores. Through simulations, these strategies are shown to achieve considerable performance improvement over general-purpose Web crawling strategies in discovery of sentimental Web content.
- Published
- 2014
18. User engagement in online News: Under the scope of sentiment, interest, affect, and gaze
- Author
-
Ioannis Arapakis, Mari Carmen Marcos, Mounia Lalmas, B. Barla Cambazoglu, and Joemon M. Jose
- Subjects
Information Systems and Management ,Exploit ,Computer Networks and Communications ,media_common.quotation_subject ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Sentimentality ,Advertising ,Variation (game tree) ,Library and Information Sciences ,Gaze ,World Wide Web ,Curiosity ,Social media ,Affect (linguistics) ,InformationSystems_MISCELLANEOUS ,Set (psychology) ,Psychology ,Information Systems ,media_common - Abstract
Online content providers, such as news portals and social media platforms, constantly seek new ways to attract large shares of online attention by keeping their users engaged. A common challenge is to identify which aspects of online interaction influence user engagement the most. In this article, through an analysis of a news article collection obtained from Yahoo News US, we demonstrate that news articles exhibit considerable variation in terms of the sentimentality and polarity of their content, depending on factors such as news provider and genre. Moreover, through a laboratory study, we observe the effect of sentimentality and polarity of news and comments on a set of subjective and objective measures of engagement. In particular, we show that attention, affect, and gaze differ across news of varying interestingness. As part of our study, we also explore methods that exploit the sentiments expressed in user comments to reorder the lists of comments displayed in news pages. Our results indicate that user engagement can be anticipated predicted if we account for the sentimentality and polarity of the content as well as other factors that drive attention and inspire human curiosity.
- Published
- 2014
19. Scalability Challenges in Web Search Engines
- Author
-
B. Barla Cambazoglu, Ricardo Baeza-Yates, B. Barla Cambazoglu, and Ricardo Baeza-Yates
- Subjects
- Web search engines, Computer networks--Scalability
- Abstract
In this book, we aim to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. More specifically, we cover the issues involved in the design of three separate systems that are commonly available in every web-scale search engine: web crawling, indexing, and query processing systems. We present the performance challenges encountered in these systems and review a wide range of design alternatives employed as solution to these challenges, specifically focusing on algorithmic and architectural optimizations. We discuss the available optimizations at different computational granularities, ranging from a single computer node to a collection of data centers. We provide some hints to both the practitioners and theoreticians involved in the field about the way large-scale web search engines operate and the adopted design choices. Moreover, we survey the efficiency literature, providing pointers to a large number of relatively important research papers. Finally, we discuss some open research problems in the context of search engine efficiency.
- Published
- 2016
20. ECIR 2012
- Author
-
Mari Carmen Marcos, David E. Losada, Mounia Lalmas, Álvaro Barreiro, B. Barla Cambazoglu, Ricardo Baeza-Yates, Hugo Zaragoza, Arjen P. de Vries, Fabrizio Silvestri, Vanessa Murdock, and Ronny Lempel
- Subjects
Presentation ,Information retrieval ,Hardware and Architecture ,Computer science ,media_common.quotation_subject ,Field (geography) ,Management Information Systems ,media_common - Abstract
The British Computer Society’s Information Retrieval Specialist Group’s European Conference on Information Retrieval (ECIR) is the main European forum for the presentation of new research results in the field of information retrieval. The conference has been running in various forms since 1979. The most recent editions of the conference were held in Rome, Italy (2007); Glasgow, UK (2008); Toulouse, France (2009), Milton Keynes, UK (2010) and Dublin, Ireland (2011). The 34th European Conference on Information Retrieval (ECIR 2012) was held at the Pompeu Fabra University from April 1, 2012 to April 5, 2012 in Barcelona, Spain, chaired by Ricardo Baeza
- Published
- 2012
21. Improved Caching Techniques for Large-Scale Image Hosting Services
- Author
-
Xiao Bai, B. Barla Cambazoglu, and Archie Russell
- Subjects
Service (systems architecture) ,Hardware_MEMORYSTRUCTURES ,Database ,CPU cache ,Computer science ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,False sharing ,02 engineering and technology ,Construct (python library) ,computer.software_genre ,Data access ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Cache ,Enhanced Data Rates for GSM Evolution ,computer ,Image retrieval - Abstract
Commercial image serving systems, such as Flickr and Facebook, rely on large image caches to avoid the retrieval of requested images from the costly backend image store, as much as possible. Such systems serve the same image in different resolutions and, thus, in different sizes to different clients, depending on the properties of the clients' devices. The requested resolutions of images can be cached individually, as in the traditional caches, reducing the backend workload. However, a potentially better approach is to store relatively high-resolution images in the cache and resize them during the retrieval to obtain lower-resolution images. Having this kind of on-the-fly image resizing capability enables image serving systems to deploy more sophisticated caching policies and improve their serving performance further. In this paper, we formalize the static caching problem in image serving systems which provide on-the-fly image resizing functionality in their edge caches or regional caches. We propose two gain-based caching policies that construct a static, fixed-capacity cache to reduce the average serving time of images. The basic idea in the proposed policies is to identify the best resolution(s) of images to be cached so that the average serving time for future image retrieval requests is reduced. We conduct extensive experiments using real-life data access logs obtained from Flickr. We show that one of the proposed caching policies reduces the average response time of the service by up to 4.2% with respect to the best-performing baseline that mainly relies on the access frequency information to make the caching decisions. This improvement implies about 25% reduction in cache size under similar serving time constraints.
- Published
- 2016
22. Introduction
- Author
-
B. Barla Cambazoglu and Ricardo Baeza-Yates
- Published
- 2016
23. Concluding Remarks
- Author
-
B. Barla Cambazoglu and Ricardo Baeza-Yates
- Published
- 2016
24. The Web Crawling System
- Author
-
B. Barla Cambazoglu and Ricardo Baeza-Yates
- Published
- 2016
25. The Query Processing System
- Author
-
B. Barla Cambazoglu and Ricardo Baeza-Yates
- Published
- 2016
26. The Indexing System
- Author
-
B. Barla Cambazoglu and Ricardo Baeza-Yates
- Published
- 2016
27. A five-level static cache architecture for web search engines
- Author
-
I. Sengor Altingovde, Rifat Ozcan, Flavio Junqueira, B. Barla Cambazoglu, Özgür Ulusoy, and Ulusoy, Özgür
- Subjects
Query processing ,Heuristic (computer science) ,Computer science ,Computational costs ,Library and Information Sciences ,Management Science and Operations Research ,computer.software_genre ,Static caching ,Search engine ,Query expansion ,Component (UML) ,Media Technology ,Greedy algorithm ,Caching decisions ,Web search query ,Information retrieval ,Database ,Cache architecture ,Cache-only memory architecture ,Greedy heuristics ,Computer Science Applications ,Access frequency ,Inter - dependencies ,Cache ,Web search engines ,computer ,Information Systems - Abstract
Cataloged from PDF version of article. Caching is a crucial performance component of large-scale web search engines, as it greatly helps reducing average query response times and query processing workloads on backend search clusters. In this paper, we describe a multi-level static cache architecture that stores five different item types: query results, precomputed scores, posting lists, precomputed intersections of posting lists, and documents. Moreover, we propose a greedy heuristic to prioritize items for caching, based on gains computed by using items' past access frequencies, estimated computational costs, and storage overheads. This heuristic takes into account the inter-dependency between individual items when making its caching decisions, i.e., after a particular item is cached, gains of all items that are affected by this decision are updated. Our simulations under realistic assumptions reveal that the proposed heuristic performs better than dividing the entire cache space among particular item types at fixed proportions. (C) 2010 Elsevier Ltd. All rights reserved.
- Published
- 2012
28. Session details: Afternoon Session
- Author
-
B. Barla Cambazoglu
- Subjects
Medical education ,Session (computer science) ,Psychology - Published
- 2015
29. Proceedings of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval
- Author
-
Nicola Tonellotto, B. Barla Cambazoglu, and Ismail Sengor Altingovde
- Subjects
Information retrieval ,Scale (ratio) ,Computer science - Published
- 2015
30. Know Your Onions
- Author
-
Luis A. Leiva, Ioannis Arapakis, and B. Barla Cambazoglu
- Subjects
World Wide Web ,Search engine ,Web search query ,User experience design ,Computer science ,business.industry ,Search analytics ,Web design ,Semantic search ,Web search engine ,Usability ,business ,Crowdsourcing - Abstract
The increasing availability of large volumes of human-curated content is shifting web search towards a paradigm that introduces seamlessly more semantic information to search engine result pages. This trend has resulted in the design of a new element known as the knowledge module (KM) where certain facts about named entities, obtained from various knowledge bases, are shown to users. So far, little has been done to uncover the role that this module plays on user experience in web search and whether it is perceived by users as a useful aid for their search tasks. Our work is an early attempt to bridge this gap. To this end, we conducted a crowdsourcing study aimed at understanding the effect of the KM on users' search experience and its overall utility. In particular, our study is the first to provide insights about the noticeability and usefulness of the KM in web search, together with comprehensive analyses of usability and workload.
- Published
- 2015
31. Site-Based Partitioning and Repartitioning Techniques for Parallel PageRank Computation
- Author
-
Ali Cevahir, B. Barla Cambazoglu, Cevdet Aykanat, Ata Turk, and Aykanat, Cevdet
- Subjects
Sparse Matrix Partitioning ,Computer science ,Sparse Matrix-vector Multiplication ,Parallelization ,Graph partition ,Sparse matrix-vector multiplication ,Sparse matrix - vector multiplication ,Parallel computing ,Supercomputer ,Matrix multiplication ,law.invention ,Matrix (mathematics) ,Graph Partitioning ,Computational Theory and Mathematics ,PageRank ,Hardware and Architecture ,law ,Signal Processing ,Pagerank ,Web Search ,Overhead (computing) ,Hypergraph Partitioning ,Repartitioning ,Cluster analysis ,Sparse matrix - Abstract
Cataloged from PDF version of article. The PageRank algorithm is an important component in effective web search. At the core of this algorithm are repeated sparse matrix-vector multiplications where the involved web matrices grow in parallel with the growth of the web and are stored in a distributed manner due to space limitations. Hence, the PageRank computation, which is frequently repeated, must be performed in parallel with high-efficiency and low-preprocessing overhead while considering the initial distributed nature of the web matrices. Our contributions in this work are twofold. We first investigate the application of state-of-the-art sparse matrix partitioning models in order to attain high efficiency in parallel PageRank computations with a particular focus on reducing the preprocessing overhead they introduce. For this purpose, we evaluate two different compression schemes on the web matrix using the site information inherently available in links. Second, we consider the more realistic scenario of starting with an initially distributed data and extend our algorithms to cover the repartitioning of such data for efficient PageRank computation. We report performance results using our parallelization of a state-of-the-art PageRank algorithm on two different PC clusters with 40 and 64 processors. Experiments show that the proposed techniques achieve considerably high speedups while incurring a preprocessing overhead of several iterations (for some instances even less than a single iteration) of the underlying sequential PageRank algorithm. © 2011 IEEE.
- Published
- 2011
32. The 8th workshop on large-scale distributed systems for information retrieval (LSDS-IR'10)
- Author
-
Roi Blanco, B. Barla Cambazoglu, and Claudio Lucchese
- Subjects
Information retrieval ,Scale (ratio) ,Hardware and Architecture ,Computer science ,Management Information Systems - Abstract
The size of theWeb as well as user bases of search systems continue to grow exponentially. Consequently, providing subsecond query response times and high query throughput become quite challenging for large-scale information retrieval systems. Distributing different aspects of search (e.g., crawling, indexing, and query processing) is essential to achieve scalability in large-scale information retrieval systems. The 8th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR'10) has provided a venue to discuss the current research challenges and identify new directions for distributed information retrieval. The workshop contained two industry talks as well as six research paper presentations. The hot topics in this year's workshop were collection selection architectures, application of MapReduce to information retrieval problems, similarity search, geographically distributed web search, and optimization techniques for search efficiency.
- Published
- 2011
33. 7th workshop on large-scale distributed systems for information retrieval (LSDS-IR'09)
- Author
-
Roi Blanco, Claudio Lucchese, and B. Barla Cambazoglu
- Subjects
Information retrieval ,Web search query ,Concept search ,Computer science ,Nearest neighbor search ,Distributed computing ,Search engine indexing ,Query language ,Management Information Systems ,Adversarial information retrieval ,Search engine ,Query expansion ,Hardware and Architecture ,Human–computer information retrieval ,Relevance (information retrieval) ,Document retrieval - Abstract
Due to the dramatically increasing amount of available data, effective and scalable solutions for data organization and search are essential. Distributed solutions naturally provide promising alternatives to standard centralized approaches. With the computational power of thousands or millions of computers in clusters or peer-to-peer systems, the challenges that arise are manifold, ranging from efficient resource discovery to issues in load balancing and distributed query processing. The 2009 edition of the Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR'09) provided a forum for researchers to discuss these problems and to define new directions in research on Distributed Information Retrieval. The Workshop program featured research contributions in the areas of collection selection, similarity search, index compression, distributed indexing, query processing, distributed computing, and network formation. In addition, there were two industry talks on large scale Web search and distributed computing.
- Published
- 2009
34. Chat mining
- Author
-
Cevdet Aykanat, B. Barla Cambazoglu, Tayfun Kucukyilmaz, Fazli Can, Department of Technology and Operations Management, and Aykanat, Cevdet
- Subjects
Vocabulary ,Focus (computing) ,Information retrieval ,Computer science ,media_common.quotation_subject ,Context (language use) ,Chat Mining ,Library and Information Sciences ,Management Science and Operations Research ,Computer Science Applications ,Term (time) ,World Wide Web ,Writing style ,Machine Learning ,Authorship Analysis ,Content analysis ,Media Technology ,Identity (object-oriented programming) ,Computer-mediated Communication ,Stylistics ,Computer-mediated communication ,Text Classification ,Information Systems ,media_common - Abstract
Cataloged from PDF version of article. The focus of this paper is to investigate the possibility of predicting several user and message attributes in text-based, real-time, online messaging services. For this purpose, a large collection of chat messages is examined. The applicability of various supervised classification techniques for extracting information from the chat messages is evaluated. Two competing models are used for defining the chat mining problem. A term-based approach is used to investigate the user and message attributes in the context of vocabulary use while a style-based approach is used to examine the chat messages according to the variations in the authors' writing styles. Among 100 authors, the identity of an author is correctly predicted with 99.7% accuracy. Moreover, the reverse problem is exploited, and the effect of author attributes on computer-mediated communications is discussed. © 2008 Elsevier Ltd. All rights reserved.
- Published
- 2008
35. Sharing Data and Analytical Resources Securely in a Biomedical Research Grid Environment
- Author
-
Tahsin Kurc, B. Barla Cambazoglu, Stephen Langella, Ashish Sharma, Justin Permar, Tony Pan, Joel H. Saltz, Scott Oster, Shannon Hastings, and David E. Ervin
- Subjects
Biomedical Research ,Data grid ,Computer science ,business.industry ,Computational Biology ,Health Informatics ,Provisioning ,Access control ,Grid ,Computer security ,computer.software_genre ,World Wide Web ,DRMAA ,Computer Communication Networks ,Semantic grid ,Grid computing ,Database Management Systems ,Model Formulation ,Grid Security Infrastructure ,business ,computer ,Computer Security - Abstract
Objectives: To develop a security infrastructure to support controlled and secure access to data and analytical resources in a biomedical research Grid environment, while facilitating resource sharing among collaborators. Design: A Grid security infrastructure, called Grid Authentication and Authorization with Reliably Distributed Services (GAARDS), is developed as a key architecture component of the NCI-funded cancer Biomedical Informatics Grid (caBIG™). The GAARDS is designed to support in a distributed environment 1) efficient provisioning and federation of user identities and credentials; 2) group-based access control support with which resource providers can enforce policies based on community accepted groups and local groups; and 3) management of a trust fabric so that policies can be enforced based on required levels of assurance. Measurements: GAARDS is implemented as a suite of Grid services and administrative tools. It provides three core services: Dorian for management and federation of user identities, Grid Trust Service for maintaining and provisioning a federated trust fabric within the Grid environment, and Grid Grouper for enforcing authorization policies based on both local and Grid-level groups. Results: The GAARDS infrastructure is available as a stand-alone system and as a component of the caGrid infrastructure. More information about GAARDS can be accessed at . Conclusions: GAARDS provides a comprehensive system to address the security challenges associated with environments in which resources may be located at different sites, requests to access the resources may cross institutional boundaries, and user credentials are created, managed, revoked dynamically in a de-centralized manner.
- Published
- 2008
36. Clustering spatial networks for aggregate query processing: A hypergraph approach
- Author
-
B. Barla Cambazoglu, Cevdet Aykanat, Engin Demir, and Aykanat, Cevdet
- Subjects
Optimization ,Hypergraph ,Query processing ,Theoretical computer science ,Clustering algorithms ,Computer science ,Correlation clustering ,computer.software_genre ,Network operations center ,Clustering ,Bipartitioning ,Resource allocation ,Cluster analysis ,Mathematical models ,Hypergraph partitioning ,Aggregate (data warehouse) ,Spatial networks ,Record-to-page allocation ,Graph theory ,Range (mathematics) ,Hardware and Architecture ,Data mining ,computer ,Software ,Recursive functions ,Information Systems - Abstract
In spatial networks, clustering adjacent data to disk pages is highly likely to reduce the number of disk page accesses made by the aggregate network operations during query processing. For this purpose, different techniques based on the clustering graph model are proposed in the literature. In this work, we show that the state-of-the-art clustering graph model is not able to correctly capture the disk access costs of aggregate network operations. Moreover, we propose a novel clustering hypergraph model that correctly captures the disk access costs of these operations. The proposed model aims to minimize the total number of disk page accesses in aggregate network operations. Based on this model, we further propose two adaptive recursive bipartitioning schemes to reduce the number of allocated disk pages while trying to minimize the number of disk page accesses. We evaluate our clustering hypergraph model and recursive bipartitioning schemes on a wide range of road network datasets. The results of the conducted experiments show that the proposed model is quite effective in reducing the number of disk accesses incurred by the network operations. © 2007 Elsevier B.V. All rights reserved.
- Published
- 2008
37. Adaptive decomposition and remapping algorithms for object-space-parallel direct volume rendering of unstructured grids
- Author
-
Cevdet Aykanat, Tahsin Kurc, Ferit Findik, B. Barla Cambazoglu, and Aykanat, Cevdet
- Subjects
Data structures ,Computer Networks and Communications ,Computer science ,Parallel algorithm ,Parallel computing ,Direct Volume Rendering ,Adaptive systems ,Theoretical Computer Science ,Remapping ,Artificial Intelligence ,Overhead (computing) ,Unstructured Grids ,Problem solving ,Storage allocation (computer) ,Parallel rendering ,Adaptive algorithm ,Parallel processing systems ,Graph partition ,Volume rendering ,Graph theory ,Grid ,Distributed computer systems ,Visualization ,Graph Partitioning ,Hardware and Architecture ,Benchmark (computing) ,Object Space Parallelization ,Algorithm ,Software ,Adaptive Decomposition - Abstract
Cataloged from PDF version of article. Object space (OS) parallelization of an efficient direct volume rendering algorithm for unstructured grids on distributed-memory architectures is investigated. The adaptive OS decomposition problem is modeled as a graph partitioning (GP) problem using an efficient and highly accurate estimation scheme for view-dependent node and edge weighting. In the proposed model, minimizing the cutsize corresponds to minimizing the parallelization overhead due to the data communication and redundant computation/storage while maintaining the GP balance constraint corresponds to maintaining the computational load balance in parallel rendering. A GP-based, view-independent cell clustering scheme is introduced to induce more tractable view-dependent computational graphs for successive visualizations. As another contribution, a graph-theoretical remapping model is proposed as a solution to the general remapping problem and is used in minimization of the cell-data migration overhead. The remapping tool RM-MeTiS is developed by modifying the GP tool MeTiS and is used in partitioning the remapping graphs. Experiments are conducted using benchmark datasets on a 28-node PC cluster to evaluate the performance of the proposed models. © 2006 Elsevier Inc. All rights reserved.
- Published
- 2007
38. Unconscious Physiological Effects of Search Latency on Users and Their Click Behaviour
- Author
-
Miguel Barreda-Ángeles, Alexandre Pereda-Baños, B. Barla Cambazoglu, Ioannis Arapakis, and Xiao Bai
- Subjects
World Wide Web ,Web search query ,Unconscious mind ,Computer science ,Human–computer interaction ,Web search engine ,Latency (engineering) - Abstract
Understanding the impact of a search system's response latency on its users' searching behaviour has been recently an active research topic in the information retrieval and human-computer interaction areas. Along the same line, this paper focuses on the user impact of search latency and makes the following two contributions. First, through a controlled experiment, we reveal the physiological effects of response latency on users and show that these effects are present even at small increases in response latency. We compare these effects with the information gathered from self-reports and show that they capture the nuanced attentional and emotional reactions to latency much better. Second, we carry out a large-scale analysis using a web search query log obtained from Yahoo to understand the change in the way users engage with a web search engine under varying levels of increasing response latency. In particular, we analyse the change in the click behaviour of users when they are subject to increasing response latency and reveal significant behavioural differences.
- Published
- 2015
39. A Random Walk Model for Optimization of Search Impact in Web Frontier Ranking
- Author
-
B. Barla Cambazoglu, Ata Turk, Giang Binh Tran, and Wolfgang Nejdl
- Subjects
Web analytics ,Information retrieval ,Web search query ,Computer science ,Download ,business.industry ,Focused crawler ,computer.software_genre ,Search engine ,Rewrite engine ,Web search engine ,Web content ,Data mining ,Web crawler ,business ,computer - Abstract
Large-scale web search engines need to crawl the Web continuously to discover and download newly created web content. The speed at which the new content is discovered and the quality of the discovered content can have a big impact on the coverage and quality of the results provided by the search engine. In this paper, we propose a search-centric solution to the problem of prioritizing the pages in the frontier of a crawler for download. Our approach essentially orders the web pages in the frontier through a random walk model that takes into account the pages' potential impact on user-perceived search quality. In addition, we propose a link graph enrichment technique that extends this solution. Finally, we explore a machine learning approach that combines different frontier prioritization approaches. We conduct experiments using two very large, real-life web datasets to observe various search quality metrics. Comparisons with several baseline techniques indicate that the proposed approaches have the potential to improve the user-perceived quality of web search results considerably.
- Published
- 2015
40. Propagating Expiration Decisions in a Search Engine Result Cache
- Author
-
Rifat Ozcan, Özgür Ulusoy, B. Barla Cambazoglu, Fethi Burak Sazoglu, and Ismail Sengor Altingovde
- Subjects
Hardware_MEMORYSTRUCTURES ,business.industry ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Real-time computing ,InformationSystems_DATABASEMANAGEMENT ,Time to live ,Search engine ,Cache invalidation ,Expiration ,Cache ,business ,Cache algorithms ,Computer network - Abstract
Detecting stale queries in a search engine result cache is an important problem. In this work, we propose a mechanism that propagates the expiration decision for a query to similar queries in the cache to re-adjust their time-to-live values.
- Published
- 2015
41. Task Allocation In Volunteer Computing Networks Under Monetary Budget Constraints
- Author
-
B. Barla Cambazoglu, Oznur Ozkasap, and Huseyin Guler
- Subjects
Operations research ,Computer Networks and Communications ,Computer science ,business.industry ,Variation (game tree) ,Task (project management) ,Electricity market ,Electricity ,Set (psychology) ,business ,Heuristics ,Baseline (configuration management) ,Software ,Simulation ,Budget constraint - Abstract
In volunteer computing networks, the peers contribute to the solution of a computationally intensive problem by freely providing their computational resources, i.e., without seeking any immediate financial benefit. In such networks, although the peers can set certain bounds on how much their resources can be exploited by the network, the monetary cost that the network brings to the peers is unclear. In this work, we propose a volunteer computing network where the peers can set monetary budgets, limiting the financial burden incurred on them due the usage of their computational resources. Under the assumption that the price of the electricity consumed by the peers has temporal variation, we show that our approach leads to an interesting task allocation problem, where the goal is to maximize the amount of work done by the peers without violating the monetary budget constraints set by them. We propose various heuristics as solution to the problem, which is NP-hard. Our extensive simulations using realistic data traces and real-life electricity prices demonstrate that the proposed techniques considerably increase the amount of useful work done by the peers, compared to a baseline technique.
- Published
- 2015
42. LSDS-IR'15: 2015 Workshop on large-scale and distributed systems for information retrieval
- Author
-
Ismail Sengor Altingovde, B. Barla Cambazoglu, and Nicola Tonellotto
- Subjects
efficiency ,large-scale information retrieval ,Web search ,performance ,distributed information retrieval - Abstract
The growth of the Web and other Big Data sources lead to important performance problems for large-scale and distributed information retrieval systems. The scalability and efficiency of such information retrieval systems have an impact on their effectiveness, eventually affecting the experience of their users and monetization as well. The LSDS-IR'15 workshop will provide space for researchers to discuss the existing performance problems in the context of large-scale and distributed information retrieval systems and define new research directions in the modern Big Data era. The workshop expects to bring together information retrieval practitioners from the industry, as well as academic researchers concerned with any aspect of large-scale and distributed information retrieval systems.
- Published
- 2015
43. Web page download scheduling policies for green web crawling
- Author
-
B. Barla Cambazoglu, Iordanis Koutsopoulos, and Vassiliki Hatzi
- Subjects
World Wide Web ,Web server ,Database ,Computer science ,Server ,Web page ,Distributed web crawling ,Static web page ,Focused crawler ,computer.software_genre ,Web crawler ,computer ,Web API - Abstract
A web crawler is responsible for discovering new web pages on the Web as well as for refreshing the content of already downloaded pages. During these operations, it can issue a huge number of page download requests to the servers in the Web. These requests, in turn, increase the energy consumption of the servers as hardware resources are used when serving the requested pages. This has the side-effect of increasing the carbon footprint of servers. In this work, we introduce the problem of green web crawling from a set of remote web servers, where the goal is to reduce the carbon footprint incurred by a large-scale web crawler. We consider a scenario where both freshness of downloaded pages and carbon emissions at remote servers need to be taken into account. We present various heuristics for prioritizing the page download requests as a means to study the relative importance of different parameters. We conduct experiments on a real data set that involves a large server collection involving two billion pages. The results indicate that the carbon footprint generated by a crawler during its external operations can be considerably reduced without compromising the freshness of pages. Our work draws guidelines for the design of large-scale commercial search engine companies, which need to comply with certain greenness regulations.
- Published
- 2014
44. Impact of response latency on user behavior in web search
- Author
-
B. Barla Cambazoglu, Ioannis Arapakis, and Xiao Bai
- Subjects
World Wide Web ,Search engine ,Human–computer interaction ,Computer science ,Order (business) ,Web search engine ,Latency (engineering) - Abstract
Traditionally, the efficiency and effectiveness of search systems have both been of great interest to the information retrieval community. However, an in-depth analysis on the interplay between the response latency of web search systems and users' search experience has been missing so far. In order to fill this gap, we conduct two separate studies aiming to reveal how response latency affects the user behavior in web search. First, we conduct a controlled user study trying to understand how users perceive the response latency of a search system and how sensitive they are to increasing delays in response. This study reveals that, when artificial delays are introduced into the response, the users of a fast search system are more likely to notice these delays than the users of a slow search system. The introduced delays become noticeable by the users once they exceed a certain threshold value. Second, we perform an analysis using a large-scale query log obtained from Yahoo web search to observe the potential impact of increasing response latency on the click behavior of users. This analysis demonstrates that latency has an impact on the click behavior of users to some extent. In particular, given two content-wise identical search result pages, we show that the users are more likely to perform clicks on the result page that is served with lower latency.
- Published
- 2014
45. Scalability and efficiency challenges in large-scale web search engines
- Author
-
Ricardo Baeza-Yates and B. Barla Cambazoglu
- Subjects
medicine.medical_specialty ,Computer science ,02 engineering and technology ,Crawling ,computer.software_genre ,World Wide Web ,Search engine ,Query expansion ,Web query classification ,020204 information systems ,medicine ,0202 electrical engineering, electronic engineering, information engineering ,Distributed web crawling ,Web search query ,Information retrieval ,Database ,business.industry ,Search analytics ,Search engine indexing ,Semantic search ,Web search engine ,020201 artificial intelligence & image processing ,Web crawler ,Metasearch engine ,business ,Web modeling ,computer - Abstract
Commercial web search engines need to process thousands of queries every second and provide responses to user queries within a few hundred milliseconds. As a consequence of these tight performance constraints, search engines construct and maintain very large computing infrastructures for crawling the Web, indexing discovered pages, and processing user queries. The scalability and efficiency of these infrastructures require careful performance optimizations in every major component of the search engine. This tutorial aims to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. In particular, the tutorial provides an in-depth architectural overview of a web search engine, mainly focusing on the web crawling, indexing, and query processing components. The scalability and efficiency issues encountered in the above-mentioned components are presented at four different granularities: at the level of a single computer, a cluster of computers, a single data center, and a multi-center search engine. The tutorial also points at the open research problems and provides recommendations to researchers who are new to the field.
- Published
- 2014
46. Improving the performance of independent task assignment heuristics minmin, maxmin and mufferage
- Author
-
B. Barla Cambazoglu, Cevdet Aykanat, E. Kartal Tabak, and Aykanat, Cevdet
- Subjects
Mathematical optimization ,Computer science ,Symmetric multiprocessor system ,Maxmin, Sufferage ,Load balancing (computing) ,Hybrid algorithm ,Load Balancing ,Independent Task Assignment ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,Heterogeneous Systems ,Parallel Processors ,Constructive Heuristics ,Heuristics ,MaxMin ,Sufferage ,Minmin - Abstract
Cataloged from PDF version of article. MinMin, MaxMin, and Sufferage are constructive heuristics that are widely and successfully used in assigning independent tasks to processors in heterogeneous computing systems. All three heuristics are known to run in O(KN2) time in assigning N tasks to K processors. In this paper, we propose an algorithmic improvement that asymptotically decreases the running time complexity of MinMin to O(KN log N) without affecting its solution quality. Furthermore, we combine the newly proposed MinMin algorithm with MaxMin as well as Sufferage, obtaining two hybrid algorithms. The motivation behind the former hybrid algorithm is to address the drawback of MaxMin in solving problem instances with highly skewed cost distributions while also improving the running time performance of MaxMin. The latter hybrid algorithm improves the running time performance of Sufferage without degrading its solution quality. The proposed algorithms are easy to implement and we illustrate them through detailed pseudocodes. The experimental results over a large number of real-life data sets show that the proposed fast MinMin algorithm and the proposed hybrid algorithms perform significantly better than their traditional counterparts as well as more recent state-of-the-art assignment heuristics. For the large data sets used in the experiments, MinMin, MaxMin, and Sufferage, as well as recent state-of-the-art heuristics, require days, weeks, or even months to produce a solution, whereas all of the proposed algorithms produce solutions within only two or three minutes.
- Published
- 2014
47. Workshop on large-scale and distributed systems for information retrieval (LSDS-IR 2014)
- Author
-
Craig Macdonald, Ismail Sengor Altingovde, Nicola Tonellotto, and B. Barla Cambazoglu
- Subjects
World Wide Web ,Information retrieval ,Computer science ,Scale (chemistry) ,Distributed computing ,Scalability ,Data science - Abstract
The LSDS-IR'14 workshop aims to bring together information retrieval practitioners from industry and academic researchers concerned with efficient and distributed IR systems. The workshop also welcomes contributions that propose different ways of leveraging diversity and multiplicity of resources available in distributed systems. The main goal of the workshop is to attract people from industry and academia to present and discuss ideas, problems, and results related to the efficiency of large scale and distributed information retrieval systems.
- Published
- 2014
48. Improving the efficiency of multi-site web search engines
- Author
-
Xiao Bai, Guillem Francès, Ricardo Baeza-Yates, and B. Barla Cambazoglu
- Subjects
Information retrieval ,Web search query ,Web query classification ,business.industry ,Computer science ,Search analytics ,Web page ,Search engine indexing ,Semantic search ,Web search engine ,Distributed web crawling ,business - Abstract
A multi-site web search engine is composed of a number of search sites geographically distributed around the world. Each search site is typically responsible for crawling and indexing the web pages that are in its geographical neighborhood. A query is selectively processed on a subset of search sites that are predicted to return the best-matching results. The scalability and efficiency of multi-site web search engines have attracted a lot of research attention in recent years. In particular, research has focused on replicating important web pages across sites, forwarding queries to relevant sites, and caching results of previous queries. Yet, these problems have only been studied in isolation, but no prior work has properly investigated the interplay between them. In this paper, we take this challenge up and conduct what we believe is the first comprehensive analysis of a full stack of techniques for efficient multi-site web search. Specifically, we propose a document replication technique that improves the query locality of the state-of-the-art approaches with various replication budget distribution strategies. We devise a machine learning approach to decide the query forwarding patterns, achieving a significantly lower false positive ratio than a state-of-the-art thresholding approach with little negative impact on search result quality. We propose three result caching strategies that reduce the number of forwarded queries and analyze the trade-off they introduce in terms of storage and network overheads. Finally, we show that the combination of the best-of-the-class techniques yields very promising search efficiency, rendering multi-site, geographically distributed web search engines an attractive alternative to centralized web search engines.
- Published
- 2014
49. On the Feasibility of Predicting News Popularity at Cold Start
- Author
-
Ioannis Arapakis, Mounia Lalmas, and B. Barla Cambazoglu
- Subjects
World Wide Web ,Cold start ,Work (electrical) ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Popularity ,Task (project management) - Abstract
We perform a study on cold-start news popularity prediction using a collection of 13,319 news articles obtained from Yahoo News. We characterise the online popularity of news articles by two different metrics and try to predict them using machine learning techniques. Contrary to a prior work on the same topic, our findings indicate that predicting the news popularity at cold start is a difficult task and the previously published results may be superficial.
- Published
- 2014
50. Review of 'Search Engines: Information Retrieval in Practice' by Croft, Metzler and Strohman
- Author
-
B. Barla Cambazoglu
- Subjects
Computer science ,Search analytics ,Search engine indexing ,Library and Information Sciences ,Management Science and Operations Research ,Computer Science Applications ,World Wide Web ,Query expansion ,Search engine ,Text processing ,Public use ,Human–computer information retrieval ,Media Technology ,Speculation ,Information Systems - Abstract
Despite the common public use of Web search engines, their internal design details mostly remain as a black art. The speculation is that there is a significant knowledge gap between what is published by academia and what is guarded behind the doors of large-scale search companies. ''Search Engines: Information Retrieval in Practice'' is one of the few books that make an attempt to cover issues involved in search engine design and is probably the most comprehensive book published so far on this topic. Unfortunately, the book fails to be a complete search engine guide as its content is dominated by the topics from information retrieval, text processing, and statistics. More precisely, the focus of the book is biased towards the ''search'' rather than the ''engines'' as, in most places, discussions on effectiveness dominate those on efficiency by a great margin. However, the book stands as a very solid IR book.
- Published
- 2010
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.