1. Simultaneous discrimination prevention and privacy protection in data publishing and mining
- Author
-
Hajian, Sara, Domingo-Ferrer, Josep, 1965, Pedreschi, Dino, Universitat Rovira i Virgili. Departament d'Enginyeria Informàtica i Matemàtiques, Departament d'Enginyeria Informàtica i Matemàtiques, and Universitat Rovira i Virgili.
- Subjects
ComputingMethodologies_PATTERNRECOGNITION ,004 - Informàtica ,Simultaneous Discrimination Prevention and Privacy Protection in Data Publishing and Mining - Abstract
Data mining is an increasingly important technology for extracting useful knowledge hidden in large collections of data. There are, however, negative social perceptions about data mining, among which potential privacy violation and potential discrimination. The former is an unintentional or deliberate disclosure of a user pro le or activity data as part of the output of a data mining algorithm or as a result of data sharing. For this reason, privacy preserving data mining has been introduced to trade o the utility of the resulting data/models for protecting individual privacy. The latter consists of treating people unfairly on the basis of their belonging to a speci c group. Automated data collection and data mining techniques such as classi cation have paved the way to making automated decisions, like loan granting/denial, insurance premium computation, etc. If the training datasets are biased in what regards discriminatory attributes like gender, race, religion, etc., discriminatory decisions may ensue. For this reason, anti-discrimination techniques including discrimination discovery and prevention have been introduced in data mining. Discrimination can be either direct or indirect. Direct discrimination occurs when decisions are made based on discriminatory attributes. Indirect discrimination occurs when decisions are made based on non-discriminatory attributes which are strongly correlated with biased discriminatory ones. In the rst part of this thesis, we tackle discrimination prevention in data mining and propose new techniques applicable for direct or indirect discrimination prevention individually or both at the same time. We discuss how to clean training datasets and outsourced datasets in such a way that direct and/or indirect discriminatory decision rules are converted to legitimate (non-discriminatory) classi cation rules. The experimental evaluations demonstrate that the proposed techniques are e ective at removing direct and/or indirect discrimination biases in the original dataset while preserving data quality. In the second part of this thesis, by presenting samples of privacy violation and potential discrimination in data mining, we argue that privacy and discrimination risks should be tackled together. We explore the relationship between privacy preserving data mining and discrimination prevention in data mining to design holistic approaches capable of addressing both threats simultaneously during the knowledge discovery process. As part of this e ort, we have investigated for the rst time the problem of discrimination and privacy aware frequent pattern discovery, i.e. the sanitization of the collection of patterns mined from a transaction database in such a way that neither privacy-violating nor discriminatory inferences can be inferred on the released patterns. Moreover, we investigate the problem of discrimination and privacy aware data publishing, i.e. transforming the data, instead of patterns, in order to simultaneously ful ll privacy preservation and discrimination prevention. In the above cases, it turns out that the impact of our transformation on the quality of data or patterns is the same or only slightly higher than the impact of achieving just privacy preservation.
- Published
- 2013