Back to Search
Start Over
Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle
- Source :
- BMJ Open, Vol 10, Iss 6 (2020), BMJ Open
- Publication Year :
- 2020
- Publisher :
- BMJ Publishing Group, 2020.
-
Abstract
- ObjectiveMedical research studies often rely on the manual collection of data from scanned typewritten clinical records, which can be laborious, time consuming and error prone because of the need to review individual clinical records. We aimed to use text mining to assist with the extraction of clinical features from complex text-based scanned pathology records for medical research studies.DesignText mining performance was measured by extracting and annotating three distinct pathological features from scanned photocopies of endometrial carcinoma clinical pathology reports, and comparing results to manually abstracted terms. Inclusion and exclusion keyword trigger terms to capture leiomyomas, endometriosis and adenomyosis were provided based on expert knowledge. Terms were expanded with character variations based on common optical character recognition (OCR) error patterns as well as negation phrases found in sample reports. The approach was evaluated on an unseen test set of 1293 scanned pathology reports originating from laboratories across Australia.SettingScanned typewritten pathology reports for women aged 18–79 years with newly diagnosed endometrial cancer (2005–2007) in Australia.ResultsHigh concordance with final abstracted codes was observed for identifying the presence of three pathology features (94%–98% F-measure). The approach was more consistent and reliable than manual abstractions, identifying 3%–14% additional feature instances.ConclusionKeyword trigger-based automation with OCR error correction and negation handling proved not only to be rapid and convenient, but also providing consistent and reliable data abstractions from scanned clinical records. In conjunction with manual review, it can assist in the generation of high-quality data abstractions for medical research studies.
- Subjects :
- Adult
Pathology
medicine.medical_specialty
Adolescent
lcsh:Medicine
Health Informatics
computer.software_genre
Health informatics
Proof of Concept Study
Text mining
information technology
Medicine
Data Mining
Electronic Health Records
Humans
Aged
Clinical pathology
business.industry
lcsh:R
General Medicine
Optical character recognition
Middle Aged
Automation
Endometrial Neoplasms
Feature (computer vision)
Data quality
Test set
oncology
pathology
Female
business
computer
Subjects
Details
- Language :
- English
- ISSN :
- 20446055
- Volume :
- 10
- Issue :
- 6
- Database :
- OpenAIRE
- Journal :
- BMJ Open
- Accession number :
- edsair.doi.dedup.....0e6a403e92e91ab505044ae9ea9ea639