Back to Search
Start Over
Automatic de-identification of electronic medical records using token-level and character-level conditional random fields.
- Source :
-
Journal of biomedical informatics [J Biomed Inform] 2015 Dec; Vol. 58 Suppl, pp. S47-S52. Date of Electronic Publication: 2015 Jun 26. - Publication Year :
- 2015
-
Abstract
- De-identification, identifying and removing all protected health information (PHI) present in clinical data including electronic medical records (EMRs), is a critical step in making clinical data publicly available. The 2014 i2b2 (Center of Informatics for Integrating Biology and Bedside) clinical natural language processing (NLP) challenge sets up a track for de-identification (track 1). In this study, we propose a hybrid system based on both machine learning and rule approaches for the de-identification track. In our system, PHI instances are first identified by two (token-level and character-level) conditional random fields (CRFs) and a rule-based classifier, and then are merged by some rules. Experiments conducted on the i2b2 corpus show that our system submitted for the challenge achieves the highest micro F-scores of 94.64%, 91.24% and 91.63% under the "token", "strict" and "relaxed" criteria respectively, which is among top-ranked systems of the 2014 i2b2 challenge. After integrating some refined localization dictionaries, our system is further improved with F-scores of 94.83%, 91.57% and 91.95% under the "token", "strict" and "relaxed" criteria respectively.<br /> (Copyright © 2015 Elsevier Inc. All rights reserved.)
Details
- Language :
- English
- ISSN :
- 1532-0480
- Volume :
- 58 Suppl
- Database :
- MEDLINE
- Journal :
- Journal of biomedical informatics
- Publication Type :
- Academic Journal
- Accession number :
- 26122526
- Full Text :
- https://doi.org/10.1016/j.jbi.2015.06.009