1. Frequent patterns in ETL workflows: An empirical approach
- Author
-
Alberto Abelló, Maik Thiele, Wolfgang Lehner, Vasileios Theodorou, Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació, Universitat Politècnica de Catalunya. MPI - Modelització i Processament de la Informació, and Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering
- Subjects
Information Systems and Management ,Expert systems (Computer science) ,Computer science ,media_common.quotation_subject ,Informàtica::Sistemes d'informació [Àrees temàtiques de la UPC] ,Maintainability ,02 engineering and technology ,computer.software_genre ,Knowledge representation (Information theory) ,Empirical ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Representació del coneixement (Teoria de la informació) ,Quality (business) ,Graph matching ,Representation (mathematics) ,Patterns ,media_common ,Abstraction (linguistics) ,business.industry ,InformationSystems_DATABASEMANAGEMENT ,Automation ,ETL ,Identification (information) ,Workflow ,TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,Business intelligence ,020201 artificial intelligence & image processing ,Data mining ,business ,computer ,Sistemes experts (Informàtica) - Abstract
The complexity of Business Intelligence activities has driven the proposal of several approaches for the effective modeling of Extract-Transform-Load (ETL) processes, based on the conceptual abstraction of their operations. Apart from fostering automation and maintainability, such modeling also provides the building blocks to identify and represent frequently recurring patterns. Despite some existing work on classifying ETL components and functionality archetypes, the issue of systematically mining such patterns and their connection to quality attributes such as performance has not yet been addressed. In this work, we propose a methodology for the identification of ETL structural patterns. We logically model the ETL workflows using labeled graphs and employ graph algorithms to identify candidate patterns and to recognize them on different workflows. We showcase our approach through a use case that is applied on implemented ETL processes from the TPC-DI specification and we present mined ETL patterns. Decomposing ETL processes to identified patterns, our approach provides a stepping stone for the automatic translation of ETL logical models to their conceptual representation and to generate fine-grained cost models at the granularity level of patterns.
- Published
- 2017
- Full Text
- View/download PDF