1. Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling
- Author
-
Alvaro A. A. Fernandes, Norman W. Paton, Martin Koehler, Alex Bogatu, Edward Abel, Leonid Libkin, John A. Keane, Nikolaos Konstantinou, Lacramioara Mazilu, and Cristina Civili
- Subjects
Information management ,Complex data type ,Data, context and interaction ,mapping generation ,Information Systems and Management ,business.industry ,Computer science ,Big data ,02 engineering and technology ,Automation ,Data profiling ,data transformation ,End-to-end principle ,020204 information systems ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,data matching ,020201 artificial intelligence & image processing ,source selection ,data wrangling ,business ,Software engineering ,data cleaning ,Information Systems - Abstract
The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process are carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. In typical big data applications, we need to ensure that all wrangling steps, including web extraction, selection, integration and cleaning, benefit from automation wherever possible. Towards this goal, in the paper we: (i) introduce a notion of data context, which associates portions of a target schema with extensional data of types that are commonly available; (ii) define a scalable methodology to bootstrap an end-to-end data wrangling process based on data profiling; (iii) describe how data context is used to inform automation in several steps within wrangling, specifically, matching, value format transformation, data repair, and mapping generation and selection to optimise the accuracy, consistency and relevance of the result; and (iv) we evaluate the approach with real estate data and financial data, showing substantial improvements in the results of automated wrangling.
- Published
- 2021
- Full Text
- View/download PDF