1. Preliminary Evaluation of Fine-Tuning the OpenDeLD Deidentification Pipeline Across Multi-Center Corpora.
- Author
-
Gupta S, Liu J, Wong ZS, and Jonnagaddala J
- Subjects
- Humans, Confidentiality, Data Anonymization, Deep Learning, United States, Australia, Natural Language Processing, Electronic Health Records
- Abstract
Automatic deidentification of Electronic Health Records (EHR) is a crucial step in secondary usage for biomedical research. This study introduces evaluation of an intricate hybrid deidentification strategy to enhance patient privacy in secondary usage of EHR. Specifically, this study focuses on assessing automatic deidentification using OpenDeID pipeline across diverse corpora for safeguarding sensitive information within EHR datasets by incorporating diverse corpora. Three distinct corpora were utilized: the OpenDeID v2 corpus containing pathology reports from Australian hospitals, the 2014 i2b2/UTHealth deidentification corpus with clinical narratives from the USA, and the 2016 CEGS N-GRID identification corpus comprising psychiatric notes. The OpenDeID pipeline employs a hybrid approach based on deep learning and contextual rules. Pre-processing steps involved harmonizing and addressing encoding and format issues. Precision, Recall, F-measure metrics were used to assess the performance. The evaluation metrics demonstrated the superior performance of the Discharge Summary BioBERT model. Trained on three corpora with a total of 4,038 reports, the best performing model exhibited robust deidentification capabilities when applied to EHR. It achieved impressive micro-averaged F1-scores of 0.9248 and 0.9692 for strict and relaxed settings, respectively. These results offer valuable insights into the model's efficacy and its potential role in safeguarding patient privacy in secondary usage of EHR.
- Published
- 2024
- Full Text
- View/download PDF