9 results on '"Benczúr, András"'
Search Results
2. Robust Decentralized Low-Rank Matrix Decomposition
- Author
-
Hegedűs, István, Berta, Árpád, Kocsis, Levente, Benczúr, András A., and Jelasity, Márk
- Abstract
Low-rank matrix approximation is an important tool in data mining with a wide range of applications, including recommender systems, clustering, and identifying topics in documents. When the matrix to be approximated originates from a large distributed system, such as a network of mobile phones or smart meters, a challenging problem arises due to the strongly conflicting yet essential requirements of efficiency, robustness, and privacy preservation. We argue that although collecting sensitive data in a centralized fashion may be efficient, it is not an option when considering privacy and efficiency at the same time. Thus, we do not allow any sensitive data to leave the nodes of the network. The local information at each node (personal attributes, documents, media ratings, etc.) defines one row in the matrix. This means that all computations have to be performed at the edge of the network. Known parallel methods that respect the locality constraint, such as synchronized parallel gradient search or distributed iterative methods, require synchronized rounds or have inherent issues with load balancing, and thus they are not robust to failure. Our distributed stochastic gradient descent algorithm overcomes these limitations. During the execution, any sensitive information remains local, whereas the global features (e.g., the factor model of movies) converge to the correct value at all nodes. We present a theoretical derivation and a thorough experimental evaluation of our algorithm. We demonstrate that the convergence speed of our method is competitive while not relying on synchronization and being robust to extreme and realistic failure scenarios. To demonstrate the feasibility of our approach, we present trace-based simulations, real smartphone user behavior analysis, and tests over real movie recommender system data.
- Published
- 2016
- Full Text
- View/download PDF
3. Towards a Normal Form and a Query Language for Extended Relations Defined by Regular Expressions
- Author
-
Benczúr, András and Szabó, Gyula
- Abstract
This paper introduces a generalized data base concept that unites relational and semi structured data models. As an important theoretical result we could find a quadratic decision algorithm for the implication problem of functional and join dependencies defined on the united data model. As practical contribution we presented a normal form for the new data model as a tool for data base design. With our novel representations of regular expressions, a more effective searching method could be developed. XML elements are described by XML schema languages such as a DTD or an XML Schema definition. The instances of these elements are semi-structured tuples. A semi-structured tuple is an ordered list of (attribute: value) pairs. We may think of a semi-structured tuple as a sentence of a formal language, where the values are the terminal symbols and the attribute names are the non-terminal symbols. In the authors' former work (Szabó and Benczúr, 2015) they introduced the notion of the extended tuple as a sentence from a regular language generated by a grammar where the non-terminal symbols of the grammar are the attribute names of the tuple. Sets of extended tuples are the extended relations. The authors then introduced the dual language, which generates the tuple types allowed to occur in extended relations. They defined functional dependencies (regular FD - RFD) over extended relations. In this paper they rephrase the RFD concept by directly using regular expressions over attribute names to define extended tuples. By the help of a special vertex labeled graph associated to regular expressions the specification of substring selection for the projection operation can be defined. The normalization for regular schemas is more complex than it is in the relational model, because the schema of an extended relation can contain an infinite number of tuple types. However, the authors can define selection, projection and join operations on extended relations too, so a lossless-join decomposition can be performed. They extended their previous model to deal with XML schema indicators too, e.g., with numerical constraints. They added line and set constructors too, in order to extend their model with more general projection and selection operators. This model establishes a query language with table join functionality for collected XML element data.
- Published
- 2016
- Full Text
- View/download PDF
4. Temporal influence over the Last.fm social network
- Author
-
Pálovics, Róbert and Benczúr, András
- Abstract
In a previous result, we showed that the influence of social contacts spreads information about new artists through the Last.fm social network. We successfully decomposed influence from effects of trends, global popularity, and homophily or shared environment of friends. In this paper, we present our new experiments that use a mathematically sound formula for defining and measuring the influence in the network. We provide new baseline and influence models and evaluation measures, both batch and online, for real-time recommendations with very strong temporal aspects. Our experiments are carried over the 2-year “scrobble” history of 70,000 Last.fm users. In our results, we formally define and distil the effect of social influence. In addition, we provide new models and evaluation measures for real-time recommendations with very strong temporal aspects.
- Published
- 2015
- Full Text
- View/download PDF
5. Modeling information systems from the viewpoint of active documents
- Author
-
Molnár, Bálint and Benczúr, András
- Abstract
The development of document handling by organizations at the level of business processes and business information systems leads to the phenomenon that the majority of documents and their contents remain in semi-structured format and definite minorities of documents are directly mapped onto structured databases. The rapidly evolving technology on the database field provides the opportunity to manage directly the semi-structured documents in line with the requirements of business processes. A continuum of possible document formats may exist in business environments. The documents can be categorized by the organization of the underlying data collections and according to the necessity and capability of data included in documents as whether wholly or partially is to be structured. The most modern database technology yields tools for handling and retrieving data making use of semi-structured and unstructured data. Our proposed approach (1) on one hand provides a theoretical frameworkfor modeling IS to shape into some structure, (2) on the other hand yields guidance for design methodto employ it for practical application with the extension of specific elementary models. The proposed modeling method that places the emphasis onto the documentsand their symbiotic life with processeshelps in understanding the behavior of most modern information systems. As a result, the combination of document-centric modeling and the enterprise architecture approach gives an opportunity for a unified modelingapproach that keeps an eye on Conway’s thesis that states that software architecture congruent with the structure of development team, and this statement may be paraphrased that the overall document structure may reflect the structure of the specific organization.
- Published
- 2015
- Full Text
- View/download PDF
6. The Classification Power of Web Features
- Author
-
Erdélyi, Miklós, Benczúr, András A., Daróczy, Bálint, Garzó, András, Kiss, Tamás, and Siklósi, Dávid
- Abstract
AbstractIn this article we give a comprehensive overview of features devised for web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy.We collect and handle a large number of features based on recent advances in web spam filtering, including temporal ones; in particular, we analyze the strength and sensitivity of linkage change.We propose new, temporal link-similarity-based features and show how to compute them efficiently on large graphs.We show that machine learning techniques, including ensemble selection, LogitBoost, and random forest significantly improve accuracy.We conclude that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on our dataset and can be further improved only slightly by computationally expensive features.We test our method on three major publicly available datasets: the Web Spam Challenge 2008 dataset WEBSPAM-UK2007, the ECML/PKDD Discovery Challenge dataset DC2010, and the Waterloo Spam Rankings for ClueWeb09.Our classifier ensemble sets the strongest classification benchmark compared to participants of the Web Spam and ECML/PKDD Discovery Challenges as well as the TREC Web track.To foster research in the area, we make several feature sets and source codes public,1https://datamining.sztaki.hu/en/download/web-spam-resourcesincluding the temporal features of eight .ukcrawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.
- Published
- 2014
- Full Text
- View/download PDF
7. Facet of Modeling Web Information Systems from a Document-Centric View
- Author
-
Molnár, Bálint and Benczúr, András
- Abstract
The modeling of Information Systems in general, and Web Information Systems (WIS) especially, is a permanent issue so that there have been already several attempts and proposals for representing various facets of WIS. In our proposed approach, we focus on the organizational and business activity modeling and we concentrate on documents that represent the information of enterprises in the form of unstructured and semi-structured documents. The compilation of documents mirrors implicitly or explicitly the structure of enterprises, the interrelationship of business processes, and activities and tasks within processes. The documents represent, at the same time, the system roles along with tasks and activities. Our modeling approach concentrates on the co-existence and co-operation of documents and activities of business. The Story Algebra, or more generally the process algebra approach provides a formal framework that promises a formal describing method for modeling precisely the event triggered processes coupled with data in document format within an Enterprise Architecture Framework.
- Published
- 2013
- Full Text
- View/download PDF
8. KDD Cup 2007 task 1 winner report
- Author
-
Kurucz, Miklós, Benczúr, András, Kiss, Tamás, Nagy, István, Szabó, Adrienn, and Torma, Balázs
- Abstract
KDD Cup 2007 focuses on predicting aspects of movie rating behavior. We present our prediction method for Task 1 "Who Rated What in 2006" where the goal is to predict which users rated which movies in 2006. We use the combination of the following methods, listed in the order of their accuracy: • The predicted number of ratings for each movie based on time series analysis, also using movie and DVD release dates and movie series detection by the edit distance of the titles. • The predicted number of ratings by each user by using the fact that ratings were sampled proportional to the margin. • The low rank approximation of the 0-1 matrix of known user-movie pairs with rating. • Prediction by using the movie-movie similarity matrix. • Association rules obtained by frequent sequence mining of user ratings considered as ordered itemsets. By combining the predictions by linear regression we obtained a prediction with root mean squared error 0.256. The first runner up result was 0.263 while a pure all zeroes prediction already gives 0.279, indicating the hardness of the task.
- Published
- 2007
- Full Text
- View/download PDF
9. Augmenting Undirected Edge Connectivity in Õ(n2) Time
- Author
-
Benczúr, András A and Karger, David R
- Abstract
We give improved randomized (Monte Carlo) algorithms for undirected edge splitting and edge connectivity augmentation problems. Our algorithms run in time Õ(n2) on n-vertex graphs, making them an Ω̃(m/n) factor faster than the best known deterministic ones on m-edge graphs.
- Published
- 2000
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.