Back to Search Start Over

Differentially Private Frequent Sequence Mining.

Authors :
Xu, Shengzhi
Cheng, Xiang
Su, Sen
Xiao, Ke
Xiong, Li
Source :
IEEE Transactions on Knowledge & Data Engineering; Nov2016, Vol. 28 Issue 11, p2910-2926, 17p
Publication Year :
2016

Abstract

In this paper, we study the problem of mining frequent sequences under the rigorous differential privacy model. We explore the possibility of designing a differentially private frequent sequence mining (FSM) algorithm which can achieve both high data utility and a high degree of privacy. We found, in differentially private FSM, the amount of required noise is proportionate to the number of candidate sequences. If we could effectively prune those unpromising candidate sequences, the utility and privacy tradeoff can be significantly improved. To this end, by leveraging a sampling-based candidate pruning technique, we propose PFS $^2$<alternatives><inline-graphic xlink:type="simple" xlink:href="xu-ieq1-2601106.gif"/></alternatives>, a novel differentially private FSM algorithm. It is the first algorithm that supports the general gap-constrained FSM in the context of differential privacy. The gap constraints in FSM can be used to limit the mining results to a controlled set of frequent sequences. In our PFS$^2$ <alternatives><inline-graphic xlink:type="simple" xlink:href="xu-ieq2-2601106.gif"/></alternatives> algorithm, the core is to utilize sample databases to prune the candidate sequences generated based on the downward closure property. In particular, we use the noisy local support of candidate sequences in the sample databases to estimate which candidate sequences are potentially frequent. To improve the accuracy of such private estimations, a gap-aware sequence shrinking method is proposed to enforce the length constraint on the sample databases. Moreover, to calibrate the amount of noise required by differential privacy, a gap-aware sensitivity computation method is proposed to obtain the sensitivity of the local support computations with different gap constraints. Furthermore, to decrease the probability of misestimating frequent sequences as infrequent, a threshold relaxation method is proposed to relax the user-specified threshold for the sample databases. Through formal privacy analysis, we show that our PFS $^2$<alternatives> <inline-graphic xlink:type="simple" xlink:href="xu-ieq3-2601106.gif"/></alternatives> algorithm is $\epsilon$<alternatives><inline-graphic xlink:type="simple" xlink:href="xu-ieq4-2601106.gif"/> </alternatives>-differentially private. Extensive experiments on real datasets illustrate that our PFS$^2$<alternatives> <inline-graphic xlink:type="simple" xlink:href="xu-ieq5-2601106.gif"/></alternatives> algorithm can privately find frequent sequences with high accuracy. [ABSTRACT FROM PUBLISHER]

Details

Language :
English
ISSN :
10414347
Volume :
28
Issue :
11
Database :
Complementary Index
Journal :
IEEE Transactions on Knowledge & Data Engineering
Publication Type :
Academic Journal
Accession number :
118673725
Full Text :
https://doi.org/10.1109/TKDE.2016.2601106