39 results on '"Xiao, Xiaokui"'
Search Results
2. PANE: scalable and effective attributed network embedding
- Author
-
Yang, Renchi, Shi, Jieming, Xiao, Xiaokui, Yang, Yin, Bhowmick, Sourav S., and Liu, Juncheng
- Published
- 2023
- Full Text
- View/download PDF
3. Improving the utility of locally differentially private protocols for longitudinal and multidimensional frequency estimates
- Author
-
Arcolezi, Héber H., Couchot, Jean-François, Al Bouna, Bechara, and Xiao, Xiaokui
- Published
- 2024
- Full Text
- View/download PDF
4. Differentially private multivariate time series forecasting of aggregated human mobility with deep learning: Input or gradient perturbation?
- Author
-
Arcolezi, Héber Hwang, Couchot, Jean-François, Renaud, Denis, Al Bouna, Bechara, and Xiao, Xiaokui
- Published
- 2022
- Full Text
- View/download PDF
5. ERGCN: Data enhancement-based robust graph convolutional network against adversarial attacks
- Author
-
Wu, Tao, Yang, Nan, Chen, Long, Xiao, Xiaokui, Xian, Xingping, Liu, Jun, Qiao, Shaojie, and Cui, Canyixing
- Published
- 2022
- Full Text
- View/download PDF
6. Forecasting the number of firefighter interventions per region with local-differential-privacy-based data
- Author
-
Arcolezi, Héber H., Couchot, Jean-François, Cerna, Selene, Guyeux, Christophe, Royer, Guillaume, Bouna, Béchara Al, and Xiao, Xiaokui
- Published
- 2020
- Full Text
- View/download PDF
7. Efficient approximation algorithms for adaptive influence maximization
- Author
-
Huang, Keke, Tang, Jing, Han, Kai, Xiao, Xiaokui, Chen, Wei, Sun, Aixin, Tang, Xueyan, and Lim, Andrew
- Published
- 2020
- Full Text
- View/download PDF
8. Finding skyline communities in multi-valued networks
- Author
-
Li, Rong-Hua, Qin, Lu, Ye, Fanghua, Wang, Guoren, Yu, Jeffrey Xu, Xiao, Xiaokui, Xiao, Nong, and Zheng, Zibin
- Published
- 2020
- Full Text
- View/download PDF
9. FERRARI: an efficient framework for visual exploratory subgraph search in graph databases
- Author
-
Wang, Chaohui, Xie, Miao, Bhowmick, Sourav S., Choi, Byron, Xiao, Xiaokui, and Zhou, Shuigeng
- Published
- 2020
- Full Text
- View/download PDF
10. Optimal algorithms for selecting top-k combinations of attributes: theory and applications
- Author
-
Lin, Chunbin, Lu, Jiaheng, Wei, Zhewei, Wang, Jianguo, and Xiao, Xiaokui
- Published
- 2017
- Full Text
- View/download PDF
11. Efficient nonparametric and asymptotic Bayesian model selection methods for attributed graph clustering
- Author
-
Xu, Zhiqiang, Cheng, James, Xiao, Xiaokui, Fujimaki, Ryohei, and Muraoka, Yusuke
- Published
- 2017
- Full Text
- View/download PDF
12. Software process evaluation: a machine learning framework with application to defect management process
- Author
-
Chen, Ning, Hoi, Steven C. H., and Xiao, Xiaokui
- Published
- 2014
- Full Text
- View/download PDF
13. Dynamic monitoring of optimal locations in road network databases
- Author
-
Yao, Bin, Xiao, Xiaokui, Li, Feifei, and Wu, Yifan
- Published
- 2014
- Full Text
- View/download PDF
14. Differentially private histogram publication
- Author
-
Xu, Jia, Zhang, Zhenjie, Xiao, Xiaokui, Yang, Yin, Yu, Ge, and Winslett, Marianne
- Published
- 2013
- Full Text
- View/download PDF
15. LF-GDPR: A Framework for Estimating Graph Metrics With Local Differential Privacy.
- Author
-
Ye, Qingqing, Hu, Haibo, Au, Man Ho, Meng, Xiaofeng, and Xiao, Xiaokui
- Subjects
PRIVACY ,TASK analysis ,TRUST ,ACQUISITION of data ,TANNER graphs - Abstract
Local differential privacy (LDP) is an emerging technique for privacy-preserving data collection without a trusted collector. Despite its strong privacy guarantee, LDP cannot be easily applied to real-world graph analysis tasks such as community detection and centrality analysis due to its high implementation complexity and low data utility. In this paper, we address these two issues by presenting LF-GDPR, the first LDP-enabled graph metric estimation framework for graph analysis. It collects two atomic graph metrics—the adjacency bit vector and node degree—from each node locally. LF-GDPR simplifies the job of implementing LDP-related steps (e.g., local perturbation, aggregation and calibration) for a graph metric estimation task by providing either a complete or a parameterized algorithm for each step. To address low data utility of LDP, it optimally allocates privacy budget between the two atomic metrics during data collection. To demonstrate the usage of LF-GDPR, we show use cases on two common graph analysis tasks, namely, clustering coefficient estimation and community detection. The privacy and utility achieved by LF-GDPR are verified through theoretical analysis and extensive experimental results. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
16. Transparent anonymization: thwarting adversaries who know the algorithm
- Author
-
Xiao, Xiaokui, Tao, Yufei, and Koudas, Nick
- Subjects
Algorithm ,Privacy issue ,Company business management ,Algorithms -- Usage ,Database administration -- Methods ,Database administration -- Standards ,Statistical methods -- Usage ,Privacy -- Management - Abstract
Numerous generalization techniques have been proposed for privacy-preserving data publishing. Most existing techniques, however, implicitly assume that the adversary knows little about the anonymization algorithm adopted by the data publisher. Consequently, they cannot guard against privacy attacks that exploit various characteristics of the anonymization mechanism. This article provides a practical solution to this problem. First, we propose an analytical model for evaluating disclosure risks, when an adversary knows everything in the anonymization process, except the sensitive values. Based on this model, we develop a privacy principle, transparent l-diversity, which ensures privacy protection against such powerful adversaries. We identify three algorithms that achieve transparent l-diversity, and verify their effectiveness and efficiency through extensive experiments with real data. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications-Statistical databases General Terms: Theory, Algorithms, Experimentation Additional Key Words and Phrases: Privacy-preserving data publishing, generalization, l-diversity DOI = 10.1145/1735886.1735887
- Published
- 2010
17. ANGEL: enhancing the utility of generalization for privacy preserving publication
- Author
-
Tao, Yufei, Chen, Hekang, Xiao, Xiaokui, Zhou, Shuigeng, and Zhang, Donghui
- Subjects
Privacy -- Analysis ,Data security -- Methods ,Data security -- Reports ,Privacy issue ,Data security issue ,Business ,Computers ,Electronics ,Electronics and electrical industries - Abstract
Generalization is a well-known method for privacy preserving data publication. Despite its vast popularity, it has several drawbacks such as heavy information loss, difficulty of supporting marginal publication, and so on. To overcome these drawbacks, we develop ANGEL, (1) a new anonymization technique that is as effective as generalization in privacy protection, but is able to retain significantly more information in the microdata. ANGEL is applicable to any monotonic principles (e.g., l-diversity, t-closeness, etc.), with its superiority (in correlation preservation) especially obvious when tight privacy control must be enforced. We show that ANGEL lends itself elegantly to the hard problem of marginal publication. In particular, unlike generalization that can release only restricted marginals, our technique can be easily used to publish any marginals with strong privacy guarantees. Index Terms--Privacy, generalization, ANGEL.
- Published
- 2009
18. Primal or dual: which promises faster spatiotemporal search?
- Author
-
Tao, Yufei and Xiao, Xiaokui
- Published
- 2008
- Full Text
- View/download PDF
19. Efficient temporal counting with bounded error
- Author
-
Tao, Yufei and Xiao, Xiaokui
- Published
- 2008
- Full Text
- View/download PDF
20. I/O-Efficient Algorithms for Degeneracy Computation on Massive Networks.
- Author
-
Li, Rong-Hua, Song, Qiushuo, Xiao, Xiaokui, Qin, Lu, Wang, Guoren, Yu, Jeffrey Xu, and Mao, Rui
- Subjects
ALGORITHMS ,HEURISTIC algorithms - Abstract
Degeneracy is an important concept to measure the sparsity of a graph which has been widely used in many network analysis applications. Many network analysis algorithms, such as clique enumeration and truss decomposition, perform very well in graphs having small degeneracies. In this paper, we propose an I/O-efficient algorithm to compute the degeneracy of the massive graph that cannot be fully kept in the main memory. The proposed algorithm only uses $O(n)$ O (n) memory, where $n$ n denotes the number of nodes of the graph. We also develop an I/O-efficient algorithm to incrementally maintain the degeneracy on dynamic graphs. Extensive experiments show that our algorithms significantly outperform the state-of-the-art degeneracy computation algorithms in terms of both running time and I/O costs. The results also demonstrate high scalability of the proposed algorithms. For example, in a real-world web graph with 930 million nodes and 13.3 billion edges, the proposed algorithm takes only 633 seconds and uses less than 4.5GB memory to compute the degeneracy. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
21. Multidimensional reverse kNN search
- Author
-
Tao, Yufei, Papadias, Dimitris, Lian, Xiang, and Xiao, Xiaokui
- Published
- 2007
- Full Text
- View/download PDF
22. Range search on multidimensional uncertain data
- Author
-
Tao, Yufei, Xiao, Xiaokui, and Cheng, Reynold
- Subjects
Information accessibility ,Information management -- Methods ,Query processing -- Methods ,Information storage and retrieval -- Methods - Abstract
In an uncertain database, every object a is associated with a probability density function, which describes the likelihood that a appears at each position in a multidimensional workspace. This article studies two types of range retrieval fundamental to many analytical tasks. Specifically, a nonfuzzy query returns all the objects that appear in a search region [r.sub.q] with at least a certain probability [t.sub.q]. On the other hand, given an uncertain object q, fuzzy search retrieves the set of objects that are within distance [[epsilon].sub.q] from q with no less than probability [t.sub.q]. The core of our methodology is a novel concept of "probabilistically constrained rectangle", which permits effective pruning/validation of nonqualifying/qualifying data. We develop a new index structure called the U-tree for minimizing the query overhead. Our algorithmic findings are accompanied with a thorough theoretical analysis, which reveals valuable insight into the problem characteristics, and mathematically confirms the efficiency of our solutions. We verify the effectiveness of the proposed techniques with extensive experiments. Categories and Subject Descriptors: H.2.2 [Database Management]: Physical Design--Access Methods; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Experimentation Additional Key Words and Phrases: Uncertain databases, range search
- Published
- 2007
23. Efficient skyline and top-k retrieval in subspaces
- Author
-
Tao, Yufei, Xiao, Xiaokui, and Pei, Jian
- Subjects
Algorithms -- Usage ,Algorithms -- Analysis ,Information storage and retrieval -- Analysis ,Trees (Graph theory) -- Usage ,Trees (Graph theory) -- Analysis ,Algorithm ,Business ,Computers ,Electronics ,Electronics and electrical industries - Abstract
Skyline and top-k queries are two popular operations for preference retrieval. In practice, applications that require these operations usually provide numerous candidate attributes, whereas, depending on their interests, users may issue queries regarding different subsets of the dimensions. The existing algorithms are inadequate for subspace skyline/top-k search because they have at least one of the following defects: 1) They require scanning the entire database at least once, 2) they are optimized for one subspace but incur significant overhead for other subspaces, or 3) they demand expensive maintenance cost or space consumption. In this paper, we propose a technique SUBSKY, which settles both types of queries by using purely relational technologies. The core of SUBSKY is a transformation that converts multidimensional data to one-dimensional (1D) values. These values are indexed by a simple B-tree, which allows us to answer subspace queries by accessing a fraction of the database. SUBSKY entails low maintenance overhead, which equals the cost of updating a traditional B-tree. Extensive experiments with real data confirm that our technique outperforms alternative solutions significantly in both efficiency and scalability. Index Terms--Skyline, top-k, subspace, B-tree.
- Published
- 2007
24. Signed Clique Search in Signed Networks: Concepts and Algorithms.
- Author
-
Li, Rong-Hua, Dai, Qiangqiang, Qin, Lu, Wang, Guoren, Xiao, Xiaokui, Yu, Jeffrey Xu, and Qiao, Shaojie
- Subjects
BRANCH & bound algorithms ,ALGORITHMS ,SUBGRAPHS ,SCALABILITY ,SOCIAL networks - Abstract
Mining cohesive subgraphs from a network is a fundamental problem in network analysis. Most existing cohesive subgraph models are mainly tailored to unsigned networks. In this paper, we study the problem of seeking cohesive subgraphs in a signed network, in which each edge can be positive or negative, denoting friendship or conflict, respectively. We propose a novel model, called maximal (α,k)-clique, that represents a cohesive subgraph in signed networks. Specifically, a maximal (α,k)-clique is a clique in which every node has at most k negative neighbors and at least ⌈αk⌉ positive neighbors (α ≥ 1). We show that the problem of enumerating all maximal (α,k)-cliques in a signed network is NP-hard. To enumerate all maximal (α,k)-cliques efficiently, we first develop an elegant signed network reduction technique to significantly prune the signed network. Then, we present an efficient branch and bound enumeration algorithm with several carefully-designed pruning rules to enumerate all maximal (α,k)--cliques in the reduced signed network. In addition, we also propose an efficient algorithm with three novel upper-bounding techniques to find the maximum (α,k)-clique in a signed network. The results of extensive experiments on five large real-life datasets demonstrate the efficiency, scalability, and effectiveness of our algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
25. Best Bang for the Buck: Cost-Effective Seed Selection for Online Social Networks.
- Author
-
Han, Kai, He, Yuntian, Huang, Keke, Xiao, Xiaokui, Tang, Shaojie, Xu, Jingxin, and Huang, Liusheng
- Subjects
ONLINE social networks ,SELECTION (Plant breeding) ,POLYNOMIAL time algorithms ,VIRAL marketing ,SOCIAL networks ,APPROXIMATION algorithms - Abstract
We study the min-cost seed selection problem in online social networks for viral marketing, where the goal is to select a set of seed nodes with the minimum total cost such that the expected number of influenced nodes in the network exceeds a predefined threshold. We propose several algorithms that outperform the previous studies both on the theoretical approximation ratio and on the experimental performance. In the case where the nodes have heterogeneous costs, our algorithms are the first bi-criteria approximation algorithms with polynomial running time and provable approximation ratio. In the case where the users have uniform costs, our algorithms achieve logarithmic approximation ratio and provable time complexity which is smaller than that of the existing algorithms in orders of magnitude. We conduct extensive experiments using real social networks. The experimental results show that, our algorithms significantly outperform the existing algorithms both on the total cost and on the running time, and also scale well to billion-scale networks. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
26. BATON: Batch One-Hop Personalized PageRanks with Efficiency and Accuracy.
- Author
-
Luo, Siqiang, Xiao, Xiaokui, Lin, Wenqing, and Kao, Ben
- Subjects
- *
MAGNITUDE (Mathematics) , *ALGORITHMS , *BATCH processing , *GRAPH algorithms - Abstract
Personalized PageRank (PPR) is a classic measure of the relevance among different nodes in a graph, and has been applied in numerous systems, such as Twitter's Who-To-Follow and Pinterest's Related Pins. Existing work on PPR has mainly focused on three general types of queries, namely, single-pair PPR, single-source PPR, and all-pair PPR. However, we observe that there are applications that rely on a new query type (referred to as batch one-hop PPR), which takes as input a set $S$ S of source nodes and, for each node $s \in S$ s ∈ S and each of $s$ s 's neighbor $v$ v , asks for the PPR value of $v$ v with respect to $s$ s . None of the existing PPR algorithms is able to efficiently process batch one-hop queries, due to the inherent differences between batch one-hop PPR and the three general query types. To address the limitations of existing algorithms, this paper presents Baton, an algorithm for batch one-hop PPR that offers both strong theoretical guarantees and practical efficiency. Baton leverages the characteristics of one-hop PPR to avoid unnecessary computation, and it incorporates advanced mechanisms to improve the cost-effectiveness of PPR derivations. Extensive experiments on benchmark datasets show that Baton is up to three orders of magnitude faster than the state of the art, while offering the same accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
27. ROAM: A Fundamental Routing Query on Road Networks with Efficiency.
- Author
-
Luo, Siqiang, Cheng, Reynold, Kao, Ben, Xiao, Xiaokui, Zhou, Shuigeng, and Hu, Jiafeng
- Subjects
ALGORITHMS ,TRAVEL costs ,COST shifting ,PUBLIC transit ,GRAPH algorithms - Abstract
Novel road-network applications often recommend a moving object (e.g., a vehicle) about interesting services or tasks on its way to a destination. A taxi-sharing system, for instance, suggests a new passenger to a taxi while it is serving another one. The traveling cost is then shared among these passengers. A fundamental query is: given two nodes $s$ s and $t$ t , and an area $\mathcal {A}$ A on road network graph $\mathcal {G}$ G , is there a “good” route (e.g., short enough path) $P$ P from $s$ s to $t$ t that crosses $\mathcal {A}$ A in $\mathcal {G}$ G ? In a taxi-sharing system, $s$ s and $t$ t can be a taxi's current and destined locations, and $\mathcal {A}$ A contains all the places to which a person waiting for a taxi is willing to walk. Answering this Route and Area Matching (ROAM) Query allows the application involved to recommend appropriate services to users efficiently. In this paper, we examine efficient ROAM query algorithms. Particularly, we develop solutions for finding a $\rho$ ρ -route, which is an $s$ s - $t$ t path that passes $\mathcal {A}$ A , with a length of at most $(1+\rho)$ (1 + ρ) times the shortest distance between $s$ s and $t$ t . The existence of a $\rho$ ρ -route implies that a service or task located at $\mathcal {A}$ A can be found for a given moving object $m$ m , and that $m$ m only deviates slightly from its current route. We present comprehensive studies on index-free and index-based algorithms for answering ROAM queries. Comprehensive experiments show that our algorithm runs up to 30 times faster than baseline algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
28. The Disruptions of 5G on Data-Driven Technologies and Applications.
- Author
-
Loghin, Dumitrel, Cai, Shaofeng, Chen, Gang, Dinh, Tien Tuan Anh, Fan, Feiyi, Lin, Qian, Ng, Janice, Ooi, Beng Chin, Sun, Xutao, Ta, Quang-Trung, Wang, Wei, Xiao, Xiaokui, Yang, Yang, Zhang, Meihui, and Zhang, Zhonghua
- Subjects
SMART cities ,TECHNOLOGICAL innovations ,TECHNOLOGY ,DATA management ,DATABASES ,LANDSCAPE assessment ,LANDSCAPE design - Abstract
With 5G on the verge of being adopted as the next mobile network, there is a need to analyze its impact on the landscape of computing and data management. In this paper, we analyze the impact of 5G on both traditional and emerging technologies and project our view on future research challenges and opportunities. With a predicted increase of 10-100x in bandwidth and 5-10x decrease in latency, 5G is expected to be the main enabler for smart cities, smart IoT and efficient healthcare, where machine learning is conducted at the edge. In this context, we investigate how 5G can help the development of federated learning. Network slicing, another key feature of 5G, allows running multiple isolated networks on the same physical infrastructure. However, security remains the main concern in the context of virtualization, multi-tenancy and high device density. Formal verification of 5G networks can be applied to detect security issues in massive virtualized environments. In summary, 5G will make the world even more densely and closely connected. What we have experienced in 4G connectivity will pale in comparison to the vast amounts of possibilities engendered by 5G. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
29. Organizing an Influential Social Event Under a Budget Constraint.
- Author
-
Han, Kai, He, Yuntian, Xiao, Xiaokui, Tang, Shaojie, Gui, Fei, Xu, Chaoting, and Luo, Jun
- Subjects
SUBMODULAR functions ,NP-hard problems ,SOCIAL services ,SOCIAL networks ,NETWORK performance ,POLYNOMIAL time algorithms - Abstract
Recently, the proliferation of event-based social services has made it possible for organizing personalized offline events through the users’ information shared online. In this paper, we study the budget-constrained influential social event organization problem, where the goal is to select a group of influential users with required features to organize a social event under a budget $B$ B . We show that our problem is NP-hard and can be formulated as a submodular maximization problem with mixed packing and covering constraints. We then propose several polynomial time algorithms for our problem with provable approximation ratios, which adopt a novel “surrogate optimization” approach and the method of reverse-reachable set sampling. Moreover, we also consider the case where the influence spread function is unknown and can be arbitrarily selected from a set of candidate submodular functions, and extend our algorithms to address a “robust influential event organization” problem under this case. Finally, we conduct extensive experiments using real social networks to test the performance of our algorithms, and the experimental results demonstrate that our algorithms significantly outperform the prior studies both on the running time and on the influence spread. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
30. Millionaire: a hint-guided approach for crowdsourcing.
- Author
-
Han, Bo, Yao, Quanming, Pan, Yuangang, Tsang, Ivor W., Xiao, Xiaokui, Yang, Qiang, and Sugiyama, Masashi
- Subjects
CROWDSOURCING ,MACHINE learning ,GAME theory ,QUALITY control ,ENVIRONMENTAL psychology - Abstract
Modern machine learning is migrating to the era of complex models, which requires a plethora of well-annotated data. While crowdsourcing is a promising tool to achieve this goal, existing crowdsourcing approaches barely acquire a sufficient amount of high-quality labels. In this paper, motivated by the "Guess-with-Hints" answer strategy from the Millionaire game show, we introduce the hint-guided approach into crowdsourcing to deal with this challenge. Our approach encourages workers to get help from hints when they are unsure of questions. Specifically, we propose a hybrid-stage setting, consisting of the main stage and the hint stage. When workers face any uncertain question on the main stage, they are allowed to enter the hint stage and look up hints before making any answer. A unique payment mechanism that meets two important design principles for crowdsourcing is developed. Besides, the proposed mechanism further encourages high-quality workers less using hints, which helps identify and assigns larger possible payment to them. Experiments are performed on Amazon Mechanical Turk, which show that our approach ensures a sufficient number of high-quality labels with low expenditure and detects high-quality workers. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
31. Privacy Enhanced Matrix Factorization for Recommendation with Local Differential Privacy.
- Author
-
Shin, Hyejin, Kim, Sungwook, Shin, Junbum, and Xiao, Xiaokui
- Subjects
FACTORIZATION ,DIFFERENTIAL forms ,MATRIX functions ,CONJUGATE gradient methods ,DATA security - Abstract
Recommender systems are collecting and analyzing user data to provide better user experience. However, several privacy concerns have been raised when a recommender knows user's set of items or their ratings. A number of solutions have been suggested to improve privacy of legacy recommender systems, but the existing solutions in the literature can protect either items or ratings only. In this paper, we propose a recommender system that protects both user's items and ratings. For this, we develop novel matrix factorization algorithms under local differential privacy (LDP). In a recommender system with LDP, individual users randomize their data themselves to satisfy differential privacy and send the perturbed data to the recommender. Then, the recommender computes aggregates of the perturbed data. This framework ensures that both user's items and ratings remain private from the recommender. However, applying LDP to matrix factorization typically raises utility issues with i) high dimensionality due to a large number of items and ii) iterative estimation algorithms. To tackle these technical challenges, we adopt dimensionality reduction technique and a novel binary mechanism based on sampling. We additionally introduce a factor that stabilizes the perturbed gradients. With MovieLens and LibimSeTi datasets, we evaluate recommendation accuracy of our recommender system and demonstrate that our algorithm performs better than the existing differentially private gradient descent algorithm for matrix factorization under stronger privacy requirements. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
32. Optimal algorithms for selecting top-<italic>k</italic> combinations of attributes: theory and applications.
- Author
-
Lin, Chunbin, Lu, Jiaheng, Wei, Zhewei, Wang, Jianguo, and Xiao, Xiaokui
- Abstract
Traditional top-
k algorithms, e.g., TA and NRA, have been successfully applied in many areas such as information retrieval, data mining and databases. They are designed to discoverk objects, e.g., top-k restaurants, with highest overall scores aggregated from different attributes, e.g., price and location. However, new emerging applications like query recommendation require providing the best combinations ofattributes , instead of objects. The straightforward extension based on the existing top-k algorithms is prohibitively expensive to answer top-k combinations because they need to enumerate all the possible combinations, which is exponential to the number of attributes. In this article, we formalize a novel type of top-k query, called top-k ,m , which aims to find top-k combinations of attributes based on the overall scores of the top-m objects within each combination, wherem is the number of objects forming a combination. We propose a family of efficient top-k ,m algorithms with different data access methods, i.e., sorted accesses and random accesses and different query certainties, i.e., exact query processing and approximate query processing. Theoretically, we prove that our algorithms are instance optimal and analyze the bound of the depth of accesses. We further develop optimizations for efficient query evaluation to reduce the computational and the memory costs and the number of accesses. We provide a case study on the real applications of top-k ,m queries for an online biomedical search engine. Finally, we perform comprehensive experiments to demonstrate the scalability and efficiency of top-k ,m algorithms on multiple real-life datasets. [ABSTRACT FROM AUTHOR]- Published
- 2018
- Full Text
- View/download PDF
33. Network Motif Discovery: A GPU Approach.
- Author
-
Lin, Wenqing, Xiao, Xiaokui, Xie, Xing, and Li, Xiao-Li
- Subjects
- *
DATA mining , *GRAPH theory , *ALGORITHMS , *DIGITAL electronics , *GRAPHICS processing units - Abstract
The identification of network motifs has important applications in numerous domains, such as pattern detection in biological networks and graph analysis in digital circuits. However, mining network motifs is computationally challenging, as it requires enumerating subgraphs from a real-life graph, and computing the frequency of each subgraph in a large number of random graphs. In particular, existing solutions often require days to derive network motifs from biological networks with only a few thousand vertices. To address this problem, this paper presents a novel study on network motif discovery using Graphical Processing Units (GPUs). The basic idea is to employ GPUs to parallelize a large number of subgraph matching tasks in computing subgraph frequencies from random graphs, so as to reduce the overall computation time of network motif discovery. We explore the design space of GPU-based subgraph matching algorithms, with careful analysis of several crucial factors (such as branch divergences and memory coalescing) that affect the performance of GPU programs. Based on our analysis, we develop a GPU-based solution that (i) considerably differs from existing CPU-based methods in how it enumerates subgraphs, and (ii) exploits the strengths of GPUs in terms of parallelism while mitigating their limitations in terms of the computation power per GPU core. With extensive experiments on a variety of biological networks, we show that our solution is up to two orders of magnitude faster than the best CPU-based approach, and is around $20$
times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used. [ABSTRACT FROM PUBLISHER]- Published
- 2017
- Full Text
- View/download PDF
34. Leveraging Machine Learning Techniques and Engineering of Multi-Nature Features for National Daily Regional Ambulance Demand Prediction.
- Author
-
Lin, Adrian Xi, Ho, Andrew Fu Wah, Cheong, Kang Hao, Li, Zengxiang, Cai, Wentong, Chee, Marcel Lucas, Ng, Yih Yng, Xiao, Xiaokui, and Ong, Marcus Eng Hock
- Published
- 2020
- Full Text
- View/download PDF
35. Publishing Search Logs—A Comparative Study of Privacy Guarantees.
- Author
-
Gotz, Michaela, Machanavajjhala, Ashwin, Wang, Guozhang, Xiao, Xiaokui, and Gehrke, Johannes
- Subjects
ANONYMITY ,DATABASE management ,INFORMATION technology ,WEB search engines ,CONFIDENTIAL communications - Abstract
Search engine companies collect the “database of intentions,” the histories of their users' search queries. These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose sensitive information. In this paper, we analyze algorithms for publishing frequent keywords, queries, and clicks of a search log. We first show how methods that achieve variants of k-anonymity are vulnerable to active attacks. We then demonstrate that the stronger guarantee ensured by ε-differential privacy unfortunately does not provide any utility for this problem. We then propose an algorithm ZEALOUS and show how to set its parameters to achieve (ε,δ )-probabilistic privacy. We also contrast our analysis of ZEALOUS with an analysis by Korolova et al. [17] that achieves (ε′,δ′)-indistinguishability. Our paper concludes with a large experimental study using real applications where we compare ZEALOUS and previous work that achieves k-anonymity in search log publishing. Our results show that ZEALOUS yields comparable utility to k-anonymity while at the same time achieving much stronger privacy guarantees. [ABSTRACT FROM AUTHOR]
- Published
- 2012
- Full Text
- View/download PDF
36. Differential Privacy via Wavelet Transforms.
- Author
-
Xiao, Xiaokui, Wang, Guozhang, and Gehrke, Johannes
- Subjects
- *
DATABASE industry , *DATA security , *SIGNAL processing , *WAVELETS (Mathematics) , *MATHEMATICAL transformations , *QUERYING (Computer science) , *SENSITIVITY analysis - Abstract
Privacy-preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, \epsilon-differential privacy provides the strongest privacy guarantee. Existing data publishing methods that achieve \epsilon-differential privacy, however, offer little data utility. In particular, if the output data set is used to answer count queries, the noise in the query answers can be proportional to the number of tuples in the data, which renders the results useless. In this paper, we develop a data publishing technique that ensures \epsilon-differential privacy while providing accurate answers for range-count queries, i.e., count queries where the predicate on each attribute is a range. The core of our solution is a framework that applies wavelet transforms on the data before adding noise to it. We present instantiations of the proposed framework for both ordinal and nominal data, and we provide a theoretical analysis on their privacy and utility guarantees. In an extensive experimental study on both real and synthetic data, we show the effectiveness and efficiency of our solution. [ABSTRACT FROM AUTHOR]
- Published
- 2011
- Full Text
- View/download PDF
37. Disease gene classification with metagraph representations.
- Author
-
Kircali Ata, Sezin, Fang, Yuan, Wu, Min, Li, Xiao-Li, and Xiao, Xiaokui
- Subjects
- *
PROTEIN-protein interactions , *MOLECULAR biology , *METAGRAPHS , *PREDICTION models , *GENETICS of breast cancer , *SUPERVISED learning - Abstract
Protein-protein interaction (PPI) networks play an important role in studying the functional roles of proteins, including their association with diseases. However, protein interaction networks are not sufficient without the support of additional biological knowledge for proteins such as their molecular functions and biological processes. To complement and enrich PPI networks, we propose to exploit biological properties of individual proteins. More specifically, we integrate keywords describing protein properties into the PPI network, and construct a novel PPI-Keywords (PPIK) network consisting of both proteins and keywords as two different types of nodes. As disease proteins tend to have a similar topological characteristics on the PPIK network, we further propose to represent proteins with metagraphs . Different from a traditional network motif or subgraph, a metagraph can capture a particular topological arrangement involving the interactions/associations between both proteins and keywords . Based on the novel metagraph representations for proteins, we further build classifiers for disease protein classification through supervised learning. Our experiments on three different PPI databases demonstrate that the proposed method consistently improves disease protein prediction across various classifiers, by 15.3% in AUC on average. It outperforms the baselines including the diffusion-based methods (e.g., RWR) and the module-based methods by 13.8–32.9% for overall disease protein prediction. For predicting breast cancer genes, it outperforms RWR, PRINCE and the module-based baselines by 6.6–14.2%. Finally, our predictions also turn out to have better correlations with literature findings from PubMed. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
38. Disease Gene Classification with Metagraph Representations.
- Author
-
Ata SK, Fang Y, Wu M, Li XL, and Xiao X
- Subjects
- Data Mining, Humans, Protein Interaction Mapping, Publications, Algorithms, Computational Biology methods, Disease genetics
- Abstract
This chapter is based on exploiting the network-based representations of proteins, metagraphs, in protein-protein interaction network to identify candidate disease-causing proteins. Protein-protein interaction (PPI) networks are effective tools in studying the functional roles of proteins in the development of various diseases. However, they are insufficient without the support of additional biological knowledge for proteins such as their molecular functions and biological processes. To enhance PPI networks, we utilize biological properties of individual proteins as well. More specifically, we integrate keywords from UniProt database describing protein properties into the PPI network and construct a novel heterogeneous PPI-Keyword (PPIK) network consisting of both proteins and keywords. As proteins with similar functional duties or involving in the same metabolic pathway tend to have similar topological characteristics, we propose to represent them with metagraphs. Compared to the traditional network motif or subgraph, a metagraph can capture the topological arrangements through not only the protein-protein interactions but also protein-keyword associations. We feed those novel metagraph representations into classifiers for disease protein prediction and conduct our experiments on three different PPI databases. They show that the proposed method consistently increases disease protein prediction performance across various classifiers, by 15.3% in AUC on average. It outperforms the diffusion-based (e.g., RWR) and the module-based baselines by 13.8-32.9% in overall disease protein prediction. Breast cancer protein prediction outperforms RWR, PRINCE, and the module-based baselines by 6.6-14.2%. Finally, our predictions also exhibit better correlations with literature findings from PubMed database.
- Published
- 2018
- Full Text
- View/download PDF
39. Deterministic identification of specific individuals from GWAS results.
- Author
-
Cai R, Hao Z, Winslett M, Xiao X, Yang Y, Zhang Z, and Zhou S
- Subjects
- Genome, Human, Genotype, Humans, Polymorphism, Single Nucleotide, Algorithms, Genetic Privacy, Genome-Wide Association Study
- Abstract
Motivation: Genome-wide association studies (GWASs) are commonly applied on human genomic data to understand the causal gene combinations statistically connected to certain diseases. Patients involved in these GWASs could be re-identified when the studies release statistical information on a large number of single-nucleotide polymorphisms. Subsequent work, however, found that such privacy attacks are theoretically possible but unsuccessful and unconvincing in real settings., Results: We derive the first practical privacy attack that can successfully identify specific individuals from limited published associations from the Wellcome Trust Case Control Consortium (WTCCC) dataset. For GWAS results computed over 25 randomly selected loci, our algorithm always pinpoints at least one patient from the WTCCC dataset. Moreover, the number of re-identified patients grows rapidly with the number of published genotypes. Finally, we discuss prevention methods to disable the attack, thus providing a solution for enhancing patient privacy., Availability and Implementation: Proofs of the theorems and additional experimental results are available in the support online documents. The attack algorithm codes are publicly available at https://sites.google.com/site/zhangzhenjie/GWAS_attack.zip. The genomic dataset used in the experiments is available at http://www.wtccc.org.uk/ on request., (© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.)
- Published
- 2015
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.