Back to Search Start Over

Self-supervised data lakes discovery through unsupervised metadata-driven weighted similarity.

Authors :
Putrama, I Made
Martinek, Peter
Source :
Information Sciences. Mar2024, Vol. 662, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

Data engineers invest significant effort in the early stages of data analysis, including identifying relevant datasets in large and complex data lakes. Data retrieval efficiency becomes an urgent concern as data volumes continue to increase with structures that vary in size and complexity. This paper presents a novel strategy for accelerating data discovery in data lakes. Our approach integrates self-supervised techniques and weighted similarity estimation for efficient dataset classification, facilitating faster search and retrieval. By extracting meta-feature characteristics, our approach improves data traceability in data lakes through clustering, resulting in significant improvements in modularity, showing an efficiency gain of 69.8%. Regarding dataset search through classification, it consistently achieves AUC-ROC scores exceeding 0.85, indicating strong performance in class differentiation. Finally, our method reduces the overall execution time more than twofold and shows promising applications for addressing real-world challenges in various data lake domains. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00200255
Volume :
662
Database :
Academic Search Index
Journal :
Information Sciences
Publication Type :
Periodical
Accession number :
175456719
Full Text :
https://doi.org/10.1016/j.ins.2024.120242