Back to Search
Start Over
Measuring re-identification risk using a synthetic estimator to enable data sharing.
- Source :
-
PloS one [PLoS One] 2022 Jun 17; Vol. 17 (6), pp. e0269097. Date of Electronic Publication: 2022 Jun 17 (Print Publication: 2022). - Publication Year :
- 2022
-
Abstract
- Background: One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population.<br />Objectives: Develop an accurate risk estimator for the sample-to-population attack.<br />Methods: A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature.<br />Results: Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset.<br />Conclusions: The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.<br />Competing Interests: LM is an employee of Replica Analytics Ltd. She participated in the design of the study and provided expertise on data synthesis methods for the execution of the project. KEE leads and has equity in Replica Analytics Ltd. Replica Analytics is a commercialization spin-off from the University of Ottawa. This does not alter our adherence to PLOS ONE policies on sharing data and materials.
- Subjects :
- Humans
Information Dissemination
Privacy
Probability
Risk
COVID-19 epidemiology
Subjects
Details
- Language :
- English
- ISSN :
- 1932-6203
- Volume :
- 17
- Issue :
- 6
- Database :
- MEDLINE
- Journal :
- PloS one
- Publication Type :
- Academic Journal
- Accession number :
- 35714132
- Full Text :
- https://doi.org/10.1371/journal.pone.0269097