Back to Search
Start Over
Incrementally updating unary inclusion dependencies in dynamic data
- Source :
- Distributed and Parallel Databases. 37:133-176
- Publication Year :
- 2018
- Publisher :
- Springer Science and Business Media LLC, 2018.
-
Abstract
- Inclusion dependencies form one of the most fundamental classes of integrity constraints. Their importance in classical data management is reinforced by modern applications like data profiling, data cleaning, entity resolution, and schema matching. Their discovery in an unknown dataset is at the core of any data-analysis effort. Therefore, several research approaches have focused on their efficient discovery in a given, static dataset. However, none of these approaches are appropriate for application on dynamic datasets. In these cases, discovery techniques should be able to efficiently update the inclusion dependencies after an update in the dataset, without reprocessing the entire dataset. We present the first approach for incrementally updating the unary inclusion dependencies. In particular, our approach is based on the concept of attribute clustering, from which the unary inclusion dependencies are efficiently derivable. We incrementally update the clusters after each update of the dataset. An update of the clusters does not need access to the dataset because of special data structures designed to efficiently support the updating process. We performed an exhaustive analysis of our approach by applying it to large datasets with several hundred attributes and more than 116.2 million tuples. The results showed that the incremental discovery significantly reduces the runtime needed by the static discovery. This reduction in the runtime is up to 99.9996% for both the insertion and the deletion.
- Subjects :
- Information Systems and Management
Unary operation
Computer science
Dynamic data
0102 computer and information sciences
02 engineering and technology
computer.software_genre
Data structure
01 natural sciences
Schema matching
Data profiling
010201 computation theory & mathematics
Hardware and Architecture
020204 information systems
Data integrity
0202 electrical engineering, electronic engineering, information engineering
Data mining
Tuple
Cluster analysis
computer
Software
Information Systems
Subjects
Details
- ISSN :
- 15737578 and 09268782
- Volume :
- 37
- Database :
- OpenAIRE
- Journal :
- Distributed and Parallel Databases
- Accession number :
- edsair.doi...........e04486aa73ffb8871346b0d349b45766
- Full Text :
- https://doi.org/10.1007/s10619-018-7233-5