Start Over

Data quality assessment and anomaly detection via map / reduce and linked data : a case study in the medical domain

Authors :: Andrew Stephen McGough
Georgios Theodoropoulos
Laura Moss
Ibad Kureshi
John Brennan
David Corsar
Stephen Bonner
Grigoris Antoniou
Ho, Howard
Ooi, Beng Chin
Zaki, Mohammed J.
Hu, Xiaohua
Haas, Laura
Kumar, Vipin
Rachuri, Sudarsan
Yu, Shipeng
Hui-I Hsiao, Morris
Li, Jian
Luo, Feng
Pyne, Saumyadipta
Ogan, Kemafor
Source :: Ho, Howard & Ooi, Beng Chin & Zaki, Mohammed J. & Hu, Xiaohua & Haas, Laura & Kumar, Vipin & Rachuri, Sudarsan & Yu, Shipeng & Hui-I Hsiao, Morris & Li, Jian & Luo, Feng & Pyne, Saumyadipta & Ogan, Kemafor (Eds.). (2015). Proceedings, 2015 IEEE International Conference on Big Data : Oct 29-Nov 01, 2015, Santa Clara, CA, USA. : IEEE, pp. 737-746, IEEE BigData
Publication Year :: 2015
Publisher :: IEEE, 2015.
Abstract: Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3%???26.9% in a selection of medical databases. Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data. In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map / Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach.

Subjects :: business.industry
Computer science
Joins
Big data
Map / Reduce
computer.file_format
Linked data
computer.software_genre
RDF
Data quality
Health care
Domain knowledge
SPARQL
Anomaly detection
Data mining
business
Semantic Web
computer
Medical Data

Details

Database :: OpenAIRE
Journal :: Ho, Howard & Ooi, Beng Chin & Zaki, Mohammed J. & Hu, Xiaohua & Haas, Laura & Kumar, Vipin & Rachuri, Sudarsan & Yu, Shipeng & Hui-I Hsiao, Morris & Li, Jian & Luo, Feng & Pyne, Saumyadipta & Ogan, Kemafor (Eds.). (2015). Proceedings, 2015 IEEE International Conference on Big Data : Oct 29-Nov 01, 2015, Santa Clara, CA, USA. : IEEE, pp. 737-746, IEEE BigData
Accession number :: edsair.doi.dedup.....f9bc23f832ca324a3ecbb5f68a6ccf36

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Data quality assessment and anomaly detection via map / reduce and linked data : a case study in the medical domain

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Data quality assessment and anomaly detection via map / reduce and linked data : a case study in the medical domain

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources