Author: "Xiao, Xiaokui" / Publication Type: Academic Journals - Searchworks@Jio Institute Digital Library Search Results

1. NeurDB: an AI-powered autonomous data system

Author: Ooi, Beng Chin, Cai, Shaofeng, Chen, Gang, Shen, Yanyan, Tan, Kian-Lee, Wu, Yuncheng, Xiao, Xiaokui, Xing, Naili, Yue, Cong, Zeng, Lingze, Zhang, Meihui, and Zhao, Zhanhao
Published: 2024
Full Text: View/download PDF

2. PANE: scalable and effective attributed network embedding

Author: Yang, Renchi, Shi, Jieming, Xiao, Xiaokui, Yang, Yin, Bhowmick, Sourav S., and Liu, Juncheng
Published: 2023
Full Text: View/download PDF

3. Improving the utility of locally differentially private protocols for longitudinal and multidimensional frequency estimates

Author: Arcolezi, Héber H., Couchot, Jean-François, Al Bouna, Bechara, and Xiao, Xiaokui
Published: 2024
Full Text: View/download PDF

4. Differentially private multivariate time series forecasting of aggregated human mobility with deep learning: Input or gradient perturbation?

Author: Arcolezi, Héber Hwang, Couchot, Jean-François, Renaud, Denis, Al Bouna, Bechara, and Xiao, Xiaokui
Published: 2022
Full Text: View/download PDF

5. ERGCN: Data enhancement-based robust graph convolutional network against adversarial attacks

Author: Wu, Tao, Yang, Nan, Chen, Long, Xiao, Xiaokui, Xian, Xingping, Liu, Jun, Qiao, Shaojie, and Cui, Canyixing
Published: 2022
Full Text: View/download PDF

6. Forecasting the number of firefighter interventions per region with local-differential-privacy-based data

Author: Arcolezi, Héber H., Couchot, Jean-François, Cerna, Selene, Guyeux, Christophe, Royer, Guillaume, Bouna, Béchara Al, and Xiao, Xiaokui
Published: 2020
Full Text: View/download PDF

7. Efficient approximation algorithms for adaptive influence maximization

Author: Huang, Keke, Tang, Jing, Han, Kai, Xiao, Xiaokui, Chen, Wei, Sun, Aixin, Tang, Xueyan, and Lim, Andrew
Published: 2020
Full Text: View/download PDF

8. Finding skyline communities in multi-valued networks

Author: Li, Rong-Hua, Qin, Lu, Ye, Fanghua, Wang, Guoren, Yu, Jeffrey Xu, Xiao, Xiaokui, Xiao, Nong, and Zheng, Zibin
Published: 2020
Full Text: View/download PDF

9. FERRARI: an efficient framework for visual exploratory subgraph search in graph databases

Author: Wang, Chaohui, Xie, Miao, Bhowmick, Sourav S., Choi, Byron, Xiao, Xiaokui, and Zhou, Shuigeng
Published: 2020
Full Text: View/download PDF

10. Optimal algorithms for selecting top-k combinations of attributes: theory and applications

Author: Lin, Chunbin, Lu, Jiaheng, Wei, Zhewei, Wang, Jianguo, and Xiao, Xiaokui
Published: 2017
Full Text: View/download PDF

11. Efficient nonparametric and asymptotic Bayesian model selection methods for attributed graph clustering

Author: Xu, Zhiqiang, Cheng, James, Xiao, Xiaokui, Fujimaki, Ryohei, and Muraoka, Yusuke
Published: 2017
Full Text: View/download PDF

12. Software process evaluation: a machine learning framework with application to defect management process

Author: Chen, Ning, Hoi, Steven C. H., and Xiao, Xiaokui
Published: 2014
Full Text: View/download PDF

13. Dynamic monitoring of optimal locations in road network databases

Author: Yao, Bin, Xiao, Xiaokui, Li, Feifei, and Wu, Yifan
Published: 2014
Full Text: View/download PDF

14. Differentially private histogram publication

Author: Xu, Jia, Zhang, Zhenjie, Xiao, Xiaokui, Yang, Yin, Yu, Ge, and Winslett, Marianne
Published: 2013
Full Text: View/download PDF

15. LF-GDPR: A Framework for Estimating Graph Metrics With Local Differential Privacy.

Author: Ye, Qingqing, Hu, Haibo, Au, Man Ho, Meng, Xiaofeng, and Xiao, Xiaokui
Subjects: PRIVACY, TASK analysis, TRUST, ACQUISITION of data, TANNER graphs
Abstract: Local differential privacy (LDP) is an emerging technique for privacy-preserving data collection without a trusted collector. Despite its strong privacy guarantee, LDP cannot be easily applied to real-world graph analysis tasks such as community detection and centrality analysis due to its high implementation complexity and low data utility. In this paper, we address these two issues by presenting LF-GDPR, the first LDP-enabled graph metric estimation framework for graph analysis. It collects two atomic graph metrics—the adjacency bit vector and node degree—from each node locally. LF-GDPR simplifies the job of implementing LDP-related steps (e.g., local perturbation, aggregation and calibration) for a graph metric estimation task by providing either a complete or a parameterized algorithm for each step. To address low data utility of LDP, it optimally allocates privacy budget between the two atomic metrics during data collection. To demonstrate the usage of LF-GDPR, we show use cases on two common graph analysis tasks, namely, clustering coefficient estimation and community detection. The privacy and utility achieved by LF-GDPR are verified through theoretical analysis and extensive experimental results. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. Transparent anonymization: thwarting adversaries who know the algorithm

Author: Xiao, Xiaokui, Tao, Yufei, and Koudas, Nick
Subjects: Algorithm, Privacy issue, Company business management, Algorithms -- Usage, Database administration -- Methods, Database administration -- Standards, Statistical methods -- Usage, Privacy -- Management
Abstract: Numerous generalization techniques have been proposed for privacy-preserving data publishing. Most existing techniques, however, implicitly assume that the adversary knows little about the anonymization algorithm adopted by the data publisher. Consequently, they cannot guard against privacy attacks that exploit various characteristics of the anonymization mechanism. This article provides a practical solution to this problem. First, we propose an analytical model for evaluating disclosure risks, when an adversary knows everything in the anonymization process, except the sensitive values. Based on this model, we develop a privacy principle, transparent l-diversity, which ensures privacy protection against such powerful adversaries. We identify three algorithms that achieve transparent l-diversity, and verify their effectiveness and efficiency through extensive experiments with real data. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications-Statistical databases General Terms: Theory, Algorithms, Experimentation Additional Key Words and Phrases: Privacy-preserving data publishing, generalization, l-diversity DOI = 10.1145/1735886.1735887
Published: 2010

17. ANGEL: enhancing the utility of generalization for privacy preserving publication

Author: Tao, Yufei, Chen, Hekang, Xiao, Xiaokui, Zhou, Shuigeng, and Zhang, Donghui
Subjects: Privacy -- Analysis, Data security -- Methods, Data security -- Reports, Privacy issue, Data security issue, Business, Computers, Electronics, Electronics and electrical industries
Abstract: Generalization is a well-known method for privacy preserving data publication. Despite its vast popularity, it has several drawbacks such as heavy information loss, difficulty of supporting marginal publication, and so on. To overcome these drawbacks, we develop ANGEL, (1) a new anonymization technique that is as effective as generalization in privacy protection, but is able to retain significantly more information in the microdata. ANGEL is applicable to any monotonic principles (e.g., l-diversity, t-closeness, etc.), with its superiority (in correlation preservation) especially obvious when tight privacy control must be enforced. We show that ANGEL lends itself elegantly to the hard problem of marginal publication. In particular, unlike generalization that can release only restricted marginals, our technique can be easily used to publish any marginals with strong privacy guarantees. Index Terms--Privacy, generalization, ANGEL.
Published: 2009

18. Primal or dual: which promises faster spatiotemporal search?

Author: Tao, Yufei and Xiao, Xiaokui
Published: 2008
Full Text: View/download PDF

19. Efficient temporal counting with bounded error

Author: Tao, Yufei and Xiao, Xiaokui
Published: 2008
Full Text: View/download PDF

20. I/O-Efficient Algorithms for Degeneracy Computation on Massive Networks.

Author: Li, Rong-Hua, Song, Qiushuo, Xiao, Xiaokui, Qin, Lu, Wang, Guoren, Yu, Jeffrey Xu, and Mao, Rui
Subjects: ALGORITHMS, HEURISTIC algorithms
Abstract: Degeneracy is an important concept to measure the sparsity of a graph which has been widely used in many network analysis applications. Many network analysis algorithms, such as clique enumeration and truss decomposition, perform very well in graphs having small degeneracies. In this paper, we propose an I/O-efficient algorithm to compute the degeneracy of the massive graph that cannot be fully kept in the main memory. The proposed algorithm only uses $O(n)$ O (n) memory, where $n$ n denotes the number of nodes of the graph. We also develop an I/O-efficient algorithm to incrementally maintain the degeneracy on dynamic graphs. Extensive experiments show that our algorithms significantly outperform the state-of-the-art degeneracy computation algorithms in terms of both running time and I/O costs. The results also demonstrate high scalability of the proposed algorithms. For example, in a real-world web graph with 930 million nodes and 13.3 billion edges, the proposed algorithm takes only 633 seconds and uses less than 4.5GB memory to compute the degeneracy. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

21. Multidimensional reverse kNN search

Author: Tao, Yufei, Papadias, Dimitris, Lian, Xiang, and Xiao, Xiaokui
Published: 2007
Full Text: View/download PDF

22. Range search on multidimensional uncertain data

Author: Tao, Yufei, Xiao, Xiaokui, and Cheng, Reynold
Subjects: Information accessibility, Information management -- Methods, Query processing -- Methods, Information storage and retrieval -- Methods
Abstract: In an uncertain database, every object a is associated with a probability density function, which describes the likelihood that a appears at each position in a multidimensional workspace. This article studies two types of range retrieval fundamental to many analytical tasks. Specifically, a nonfuzzy query returns all the objects that appear in a search region [r.sub.q] with at least a certain probability [t.sub.q]. On the other hand, given an uncertain object q, fuzzy search retrieves the set of objects that are within distance [[epsilon].sub.q] from q with no less than probability [t.sub.q]. The core of our methodology is a novel concept of "probabilistically constrained rectangle", which permits effective pruning/validation of nonqualifying/qualifying data. We develop a new index structure called the U-tree for minimizing the query overhead. Our algorithmic findings are accompanied with a thorough theoretical analysis, which reveals valuable insight into the problem characteristics, and mathematically confirms the efficiency of our solutions. We verify the effectiveness of the proposed techniques with extensive experiments. Categories and Subject Descriptors: H.2.2 [Database Management]: Physical Design--Access Methods; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Experimentation Additional Key Words and Phrases: Uncertain databases, range search
Published: 2007

23. Efficient skyline and top-k retrieval in subspaces

Author: Tao, Yufei, Xiao, Xiaokui, and Pei, Jian
Subjects: Algorithms -- Usage, Algorithms -- Analysis, Information storage and retrieval -- Analysis, Trees (Graph theory) -- Usage, Trees (Graph theory) -- Analysis, Algorithm, Business, Computers, Electronics, Electronics and electrical industries
Abstract: Skyline and top-k queries are two popular operations for preference retrieval. In practice, applications that require these operations usually provide numerous candidate attributes, whereas, depending on their interests, users may issue queries regarding different subsets of the dimensions. The existing algorithms are inadequate for subspace skyline/top-k search because they have at least one of the following defects: 1) They require scanning the entire database at least once, 2) they are optimized for one subspace but incur significant overhead for other subspaces, or 3) they demand expensive maintenance cost or space consumption. In this paper, we propose a technique SUBSKY, which settles both types of queries by using purely relational technologies. The core of SUBSKY is a transformation that converts multidimensional data to one-dimensional (1D) values. These values are indexed by a simple B-tree, which allows us to answer subspace queries by accessing a fraction of the database. SUBSKY entails low maintenance overhead, which equals the cost of updating a traditional B-tree. Extensive experiments with real data confirm that our technique outperforms alternative solutions significantly in both efficiency and scalability. Index Terms--Skyline, top-k, subspace, B-tree.
Published: 2007

24. Signed Clique Search in Signed Networks: Concepts and Algorithms.

Author: Li, Rong-Hua, Dai, Qiangqiang, Qin, Lu, Wang, Guoren, Xiao, Xiaokui, Yu, Jeffrey Xu, and Qiao, Shaojie
Subjects: BRANCH & bound algorithms, ALGORITHMS, SUBGRAPHS, SCALABILITY, SOCIAL networks
Abstract: Mining cohesive subgraphs from a network is a fundamental problem in network analysis. Most existing cohesive subgraph models are mainly tailored to unsigned networks. In this paper, we study the problem of seeking cohesive subgraphs in a signed network, in which each edge can be positive or negative, denoting friendship or conflict, respectively. We propose a novel model, called maximal (α,k)-clique, that represents a cohesive subgraph in signed networks. Specifically, a maximal (α,k)-clique is a clique in which every node has at most k negative neighbors and at least ⌈αk⌉ positive neighbors (α ≥ 1). We show that the problem of enumerating all maximal (α,k)-cliques in a signed network is NP-hard. To enumerate all maximal (α,k)-cliques efficiently, we first develop an elegant signed network reduction technique to significantly prune the signed network. Then, we present an efficient branch and bound enumeration algorithm with several carefully-designed pruning rules to enumerate all maximal (α,k)--cliques in the reduced signed network. In addition, we also propose an efficient algorithm with three novel upper-bounding techniques to find the maximum (α,k)-clique in a signed network. The results of extensive experiments on five large real-life datasets demonstrate the efficiency, scalability, and effectiveness of our algorithms. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

25. Best Bang for the Buck: Cost-Effective Seed Selection for Online Social Networks.

Author: Han, Kai, He, Yuntian, Huang, Keke, Xiao, Xiaokui, Tang, Shaojie, Xu, Jingxin, and Huang, Liusheng
Subjects: ONLINE social networks, SELECTION (Plant breeding), POLYNOMIAL time algorithms, VIRAL marketing, SOCIAL networks, APPROXIMATION algorithms
Abstract: We study the min-cost seed selection problem in online social networks for viral marketing, where the goal is to select a set of seed nodes with the minimum total cost such that the expected number of influenced nodes in the network exceeds a predefined threshold. We propose several algorithms that outperform the previous studies both on the theoretical approximation ratio and on the experimental performance. In the case where the nodes have heterogeneous costs, our algorithms are the first bi-criteria approximation algorithms with polynomial running time and provable approximation ratio. In the case where the users have uniform costs, our algorithms achieve logarithmic approximation ratio and provable time complexity which is smaller than that of the existing algorithms in orders of magnitude. We conduct extensive experiments using real social networks. The experimental results show that, our algorithms significantly outperform the existing algorithms both on the total cost and on the running time, and also scale well to billion-scale networks. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

26. BATON: Batch One-Hop Personalized PageRanks with Efficiency and Accuracy.

Author: Luo, Siqiang, Xiao, Xiaokui, Lin, Wenqing, and Kao, Ben
Subjects: *MAGNITUDE (Mathematics), *ALGORITHMS, *BATCH processing, *GRAPH algorithms
Abstract: Personalized PageRank (PPR) is a classic measure of the relevance among different nodes in a graph, and has been applied in numerous systems, such as Twitter's Who-To-Follow and Pinterest's Related Pins. Existing work on PPR has mainly focused on three general types of queries, namely, single-pair PPR, single-source PPR, and all-pair PPR. However, we observe that there are applications that rely on a new query type (referred to as batch one-hop PPR), which takes as input a set $S$ S of source nodes and, for each node $s \in S$ s ∈ S and each of $s$ s 's neighbor $v$ v , asks for the PPR value of $v$ v with respect to $s$ s . None of the existing PPR algorithms is able to efficiently process batch one-hop queries, due to the inherent differences between batch one-hop PPR and the three general query types. To address the limitations of existing algorithms, this paper presents Baton, an algorithm for batch one-hop PPR that offers both strong theoretical guarantees and practical efficiency. Baton leverages the characteristics of one-hop PPR to avoid unnecessary computation, and it incorporates advanced mechanisms to improve the cost-effectiveness of PPR derivations. Extensive experiments on benchmark datasets show that Baton is up to three orders of magnitude faster than the state of the art, while offering the same accuracy. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

27. ROAM: A Fundamental Routing Query on Road Networks with Efficiency.

Author: Luo, Siqiang, Cheng, Reynold, Kao, Ben, Xiao, Xiaokui, Zhou, Shuigeng, and Hu, Jiafeng
Subjects: ALGORITHMS, TRAVEL costs, COST shifting, PUBLIC transit, GRAPH algorithms
Abstract: Novel road-network applications often recommend a moving object (e.g., a vehicle) about interesting services or tasks on its way to a destination. A taxi-sharing system, for instance, suggests a new passenger to a taxi while it is serving another one. The traveling cost is then shared among these passengers. A fundamental query is: given two nodes $s$ s and $t$ t , and an area $\mathcal {A}$ A on road network graph $\mathcal {G}$ G , is there a “good” route (e.g., short enough path) $P$ P from $s$ s to $t$ t that crosses $\mathcal {A}$ A in $\mathcal {G}$ G ? In a taxi-sharing system, $s$ s and $t$ t can be a taxi's current and destined locations, and $\mathcal {A}$ A contains all the places to which a person waiting for a taxi is willing to walk. Answering this Route and Area Matching (ROAM) Query allows the application involved to recommend appropriate services to users efficiently. In this paper, we examine efficient ROAM query algorithms. Particularly, we develop solutions for finding a $\rho$ ρ -route, which is an $s$ s - $t$ t path that passes $\mathcal {A}$ A , with a length of at most $(1+\rho)$ (1 + ρ) times the shortest distance between $s$ s and $t$ t . The existence of a $\rho$ ρ -route implies that a service or task located at $\mathcal {A}$ A can be found for a given moving object $m$ m , and that $m$ m only deviates slightly from its current route. We present comprehensive studies on index-free and index-based algorithms for answering ROAM queries. Comprehensive experiments show that our algorithm runs up to 30 times faster than baseline algorithms. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

28. The Disruptions of 5G on Data-Driven Technologies and Applications.

Author: Loghin, Dumitrel, Cai, Shaofeng, Chen, Gang, Dinh, Tien Tuan Anh, Fan, Feiyi, Lin, Qian, Ng, Janice, Ooi, Beng Chin, Sun, Xutao, Ta, Quang-Trung, Wang, Wei, Xiao, Xiaokui, Yang, Yang, Zhang, Meihui, and Zhang, Zhonghua
Subjects: SMART cities, TECHNOLOGICAL innovations, TECHNOLOGY, DATA management, DATABASES, LANDSCAPE assessment, LANDSCAPE design
Abstract: With 5G on the verge of being adopted as the next mobile network, there is a need to analyze its impact on the landscape of computing and data management. In this paper, we analyze the impact of 5G on both traditional and emerging technologies and project our view on future research challenges and opportunities. With a predicted increase of 10-100x in bandwidth and 5-10x decrease in latency, 5G is expected to be the main enabler for smart cities, smart IoT and efficient healthcare, where machine learning is conducted at the edge. In this context, we investigate how 5G can help the development of federated learning. Network slicing, another key feature of 5G, allows running multiple isolated networks on the same physical infrastructure. However, security remains the main concern in the context of virtualization, multi-tenancy and high device density. Formal verification of 5G networks can be applied to detect security issues in massive virtualized environments. In summary, 5G will make the world even more densely and closely connected. What we have experienced in 4G connectivity will pale in comparison to the vast amounts of possibilities engendered by 5G. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

29. Organizing an Influential Social Event Under a Budget Constraint.

Author: Han, Kai, He, Yuntian, Xiao, Xiaokui, Tang, Shaojie, Gui, Fei, Xu, Chaoting, and Luo, Jun
Subjects: SUBMODULAR functions, NP-hard problems, SOCIAL services, SOCIAL networks, NETWORK performance, POLYNOMIAL time algorithms
Abstract: Recently, the proliferation of event-based social services has made it possible for organizing personalized offline events through the users’ information shared online. In this paper, we study the budget-constrained influential social event organization problem, where the goal is to select a group of influential users with required features to organize a social event under a budget $B$ B . We show that our problem is NP-hard and can be formulated as a submodular maximization problem with mixed packing and covering constraints. We then propose several polynomial time algorithms for our problem with provable approximation ratios, which adopt a novel “surrogate optimization” approach and the method of reverse-reachable set sampling. Moreover, we also consider the case where the influence spread function is unknown and can be arbitrarily selected from a set of candidate submodular functions, and extend our algorithms to address a “robust influential event organization” problem under this case. Finally, we conduct extensive experiments using real social networks to test the performance of our algorithms, and the experimental results demonstrate that our algorithms significantly outperform the prior studies both on the running time and on the influence spread. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

30. Millionaire: a hint-guided approach for crowdsourcing.

Author: Han, Bo, Yao, Quanming, Pan, Yuangang, Tsang, Ivor W., Xiao, Xiaokui, Yang, Qiang, and Sugiyama, Masashi
Subjects: CROWDSOURCING, MACHINE learning, GAME theory, QUALITY control, ENVIRONMENTAL psychology
Abstract: Modern machine learning is migrating to the era of complex models, which requires a plethora of well-annotated data. While crowdsourcing is a promising tool to achieve this goal, existing crowdsourcing approaches barely acquire a sufficient amount of high-quality labels. In this paper, motivated by the "Guess-with-Hints" answer strategy from the Millionaire game show, we introduce the hint-guided approach into crowdsourcing to deal with this challenge. Our approach encourages workers to get help from hints when they are unsure of questions. Specifically, we propose a hybrid-stage setting, consisting of the main stage and the hint stage. When workers face any uncertain question on the main stage, they are allowed to enter the hint stage and look up hints before making any answer. A unique payment mechanism that meets two important design principles for crowdsourcing is developed. Besides, the proposed mechanism further encourages high-quality workers less using hints, which helps identify and assigns larger possible payment to them. Experiments are performed on Amazon Mechanical Turk, which show that our approach ensures a sufficient number of high-quality labels with low expenditure and detects high-quality workers. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

31. Privacy Enhanced Matrix Factorization for Recommendation with Local Differential Privacy.

Author: Shin, Hyejin, Kim, Sungwook, Shin, Junbum, and Xiao, Xiaokui
Subjects: FACTORIZATION, DIFFERENTIAL forms, MATRIX functions, CONJUGATE gradient methods, DATA security
Abstract: Recommender systems are collecting and analyzing user data to provide better user experience. However, several privacy concerns have been raised when a recommender knows user's set of items or their ratings. A number of solutions have been suggested to improve privacy of legacy recommender systems, but the existing solutions in the literature can protect either items or ratings only. In this paper, we propose a recommender system that protects both user's items and ratings. For this, we develop novel matrix factorization algorithms under local differential privacy (LDP). In a recommender system with LDP, individual users randomize their data themselves to satisfy differential privacy and send the perturbed data to the recommender. Then, the recommender computes aggregates of the perturbed data. This framework ensures that both user's items and ratings remain private from the recommender. However, applying LDP to matrix factorization typically raises utility issues with i) high dimensionality due to a large number of items and ii) iterative estimation algorithms. To tackle these technical challenges, we adopt dimensionality reduction technique and a novel binary mechanism based on sampling. We additionally introduce a factor that stabilizes the perturbed gradients. With MovieLens and LibimSeTi datasets, we evaluate recommendation accuracy of our recommender system and demonstrate that our algorithm performs better than the existing differentially private gradient descent algorithm for matrix factorization under stronger privacy requirements. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

32. Optimal algorithms for selecting top-<italic>k</italic> combinations of attributes: theory and applications.

Author: Lin, Chunbin, Lu, Jiaheng, Wei, Zhewei, Wang, Jianguo, and Xiao, Xiaokui
Abstract: Traditional top-k algorithms, e.g., TA and NRA, have been successfully applied in many areas such as information retrieval, data mining and databases. They are designed to discover k objects, e.g., top-k restaurants, with highest overall scores aggregated from different attributes, e.g., price and location. However, new emerging applications like query recommendation require providing the best combinations of attributes, instead of objects. The straightforward extension based on the existing top-k algorithms is prohibitively expensive to answer top-k combinations because they need to enumerate all the possible combinations, which is exponential to the number of attributes. In this article, we formalize a novel type of top-k query, called top-k, m, which aims to find top-k combinations of attributes based on the overall scores of the top-m objects within each combination, where m is the number of objects forming a combination. We propose a family of efficient top-k, m algorithms with different data access methods, i.e., sorted accesses and random accesses and different query certainties, i.e., exact query processing and approximate query processing. Theoretically, we prove that our algorithms are instance optimal and analyze the bound of the depth of accesses. We further develop optimizations for efficient query evaluation to reduce the computational and the memory costs and the number of accesses. We provide a case study on the real applications of top-k, m queries for an online biomedical search engine. Finally, we perform comprehensive experiments to demonstrate the scalability and efficiency of top-k, m algorithms on multiple real-life datasets. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

33. Network Motif Discovery: A GPU Approach.

Author: Lin, Wenqing, Xiao, Xiaokui, Xie, Xing, and Li, Xiao-Li
Subjects: *DATA mining, *GRAPH theory, *ALGORITHMS, *DIGITAL electronics, *GRAPHICS processing units
Abstract: The identification of network motifs has important applications in numerous domains, such as pattern detection in biological networks and graph analysis in digital circuits. However, mining network motifs is computationally challenging, as it requires enumerating subgraphs from a real-life graph, and computing the frequency of each subgraph in a large number of random graphs. In particular, existing solutions often require days to derive network motifs from biological networks with only a few thousand vertices. To address this problem, this paper presents a novel study on network motif discovery using Graphical Processing Units (GPUs). The basic idea is to employ GPUs to parallelize a large number of subgraph matching tasks in computing subgraph frequencies from random graphs, so as to reduce the overall computation time of network motif discovery. We explore the design space of GPU-based subgraph matching algorithms, with careful analysis of several crucial factors (such as branch divergences and memory coalescing) that affect the performance of GPU programs. Based on our analysis, we develop a GPU-based solution that (i) considerably differs from existing CPU-based methods in how it enumerates subgraphs, and (ii) exploits the strengths of GPUs in terms of parallelism while mitigating their limitations in terms of the computation power per GPU core. With extensive experiments on a variety of biological networks, we show that our solution is up to two orders of magnitude faster than the best CPU-based approach, and is around $20$ times more cost-effective than the latter, when taking into account the monetary costs of the CPU and GPUs used. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

34. Leveraging Machine Learning Techniques and Engineering of Multi-Nature Features for National Daily Regional Ambulance Demand Prediction.

Author: Lin, Adrian Xi, Ho, Andrew Fu Wah, Cheong, Kang Hao, Li, Zengxiang, Cai, Wentong, Chee, Marcel Lucas, Ng, Yih Yng, Xiao, Xiaokui, and Ong, Marcus Eng Hock
Published: 2020
Full Text: View/download PDF

35. Publishing Search Logs—A Comparative Study of Privacy Guarantees.

Author: Gotz, Michaela, Machanavajjhala, Ashwin, Wang, Guozhang, Xiao, Xiaokui, and Gehrke, Johannes
Subjects: ANONYMITY, DATABASE management, INFORMATION technology, WEB search engines, CONFIDENTIAL communications
Abstract: Search engine companies collect the “database of intentions,” the histories of their users' search queries. These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose sensitive information. In this paper, we analyze algorithms for publishing frequent keywords, queries, and clicks of a search log. We first show how methods that achieve variants of k-anonymity are vulnerable to active attacks. We then demonstrate that the stronger guarantee ensured by ε-differential privacy unfortunately does not provide any utility for this problem. We then propose an algorithm ZEALOUS and show how to set its parameters to achieve (ε,δ )-probabilistic privacy. We also contrast our analysis of ZEALOUS with an analysis by Korolova et al. [17] that achieves (ε′,δ′)-indistinguishability. Our paper concludes with a large experimental study using real applications where we compare ZEALOUS and previous work that achieves k-anonymity in search log publishing. Our results show that ZEALOUS yields comparable utility to k-anonymity while at the same time achieving much stronger privacy guarantees. [ABSTRACT FROM AUTHOR]
Published: 2012
Full Text: View/download PDF

36. Differential Privacy via Wavelet Transforms.

Author: Xiao, Xiaokui, Wang, Guozhang, and Gehrke, Johannes
Subjects: *DATABASE industry, *DATA security, *SIGNAL processing, *WAVELETS (Mathematics), *MATHEMATICAL transformations, *QUERYING (Computer science), *SENSITIVITY analysis
Abstract: Privacy-preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, \epsilon-differential privacy provides the strongest privacy guarantee. Existing data publishing methods that achieve \epsilon-differential privacy, however, offer little data utility. In particular, if the output data set is used to answer count queries, the noise in the query answers can be proportional to the number of tuples in the data, which renders the results useless. In this paper, we develop a data publishing technique that ensures \epsilon-differential privacy while providing accurate answers for range-count queries, i.e., count queries where the predicate on each attribute is a range. The core of our solution is a framework that applies wavelet transforms on the data before adding noise to it. We present instantiations of the proposed framework for both ordinal and nominal data, and we provide a theoretical analysis on their privacy and utility guarantees. In an extensive experimental study on both real and synthetic data, we show the effectiveness and efficiency of our solution. [ABSTRACT FROM AUTHOR]
Published: 2011
Full Text: View/download PDF

37. Disease gene classification with metagraph representations.

Author: Kircali Ata, Sezin, Fang, Yuan, Wu, Min, Li, Xiao-Li, and Xiao, Xiaokui
Subjects: *PROTEIN-protein interactions, *MOLECULAR biology, *METAGRAPHS, *PREDICTION models, *GENETICS of breast cancer, *SUPERVISED learning
Abstract: Protein-protein interaction (PPI) networks play an important role in studying the functional roles of proteins, including their association with diseases. However, protein interaction networks are not sufficient without the support of additional biological knowledge for proteins such as their molecular functions and biological processes. To complement and enrich PPI networks, we propose to exploit biological properties of individual proteins. More specifically, we integrate keywords describing protein properties into the PPI network, and construct a novel PPI-Keywords (PPIK) network consisting of both proteins and keywords as two different types of nodes. As disease proteins tend to have a similar topological characteristics on the PPIK network, we further propose to represent proteins with metagraphs . Different from a traditional network motif or subgraph, a metagraph can capture a particular topological arrangement involving the interactions/associations between both proteins and keywords . Based on the novel metagraph representations for proteins, we further build classifiers for disease protein classification through supervised learning. Our experiments on three different PPI databases demonstrate that the proposed method consistently improves disease protein prediction across various classifiers, by 15.3% in AUC on average. It outperforms the baselines including the diffusion-based methods (e.g., RWR) and the module-based methods by 13.8–32.9% for overall disease protein prediction. For predicting breast cancer genes, it outperforms RWR, PRINCE and the module-based baselines by 6.6–14.2%. Finally, our predictions also turn out to have better correlations with literature findings from PubMed. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

38. Disease Gene Classification with Metagraph Representations.

Author: Ata SK, Fang Y, Wu M, Li XL, and Xiao X
Subjects: Data Mining, Humans, Protein Interaction Mapping, Publications, Algorithms, Computational Biology methods, Disease genetics
Abstract: This chapter is based on exploiting the network-based representations of proteins, metagraphs, in protein-protein interaction network to identify candidate disease-causing proteins. Protein-protein interaction (PPI) networks are effective tools in studying the functional roles of proteins in the development of various diseases. However, they are insufficient without the support of additional biological knowledge for proteins such as their molecular functions and biological processes. To enhance PPI networks, we utilize biological properties of individual proteins as well. More specifically, we integrate keywords from UniProt database describing protein properties into the PPI network and construct a novel heterogeneous PPI-Keyword (PPIK) network consisting of both proteins and keywords. As proteins with similar functional duties or involving in the same metabolic pathway tend to have similar topological characteristics, we propose to represent them with metagraphs. Compared to the traditional network motif or subgraph, a metagraph can capture the topological arrangements through not only the protein-protein interactions but also protein-keyword associations. We feed those novel metagraph representations into classifiers for disease protein prediction and conduct our experiments on three different PPI databases. They show that the proposed method consistently increases disease protein prediction performance across various classifiers, by 15.3% in AUC on average. It outperforms the diffusion-based (e.g., RWR) and the module-based baselines by 13.8-32.9% in overall disease protein prediction. Breast cancer protein prediction outperforms RWR, PRINCE, and the module-based baselines by 6.6-14.2%. Finally, our predictions also exhibit better correlations with literature findings from PubMed database.
Published: 2018
Full Text: View/download PDF

39. Deterministic identification of specific individuals from GWAS results.

Author: Cai R, Hao Z, Winslett M, Xiao X, Yang Y, Zhang Z, and Zhou S
Subjects: Genome, Human, Genotype, Humans, Polymorphism, Single Nucleotide, Algorithms, Genetic Privacy, Genome-Wide Association Study
Abstract: Motivation: Genome-wide association studies (GWASs) are commonly applied on human genomic data to understand the causal gene combinations statistically connected to certain diseases. Patients involved in these GWASs could be re-identified when the studies release statistical information on a large number of single-nucleotide polymorphisms. Subsequent work, however, found that such privacy attacks are theoretically possible but unsuccessful and unconvincing in real settings., Results: We derive the first practical privacy attack that can successfully identify specific individuals from limited published associations from the Wellcome Trust Case Control Consortium (WTCCC) dataset. For GWAS results computed over 25 randomly selected loci, our algorithm always pinpoints at least one patient from the WTCCC dataset. Moreover, the number of re-identified patients grows rapidly with the number of published genotypes. Finally, we discuss prevention methods to disable the attack, thus providing a solution for enhancing patient privacy., Availability and Implementation: Proofs of the theorems and additional experimental results are available in the support online documents. The attack algorithm codes are publicly available at https://sites.google.com/site/zhangzhenjie/GWAS_attack.zip. The genomic dataset used in the experiments is available at http://www.wtccc.org.uk/ on request., (© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.)
Published: 2015
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

39 results on '"Xiao, Xiaokui"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources