1. HInT: Hybrid and Incremental Type Discovery for Large RDF Data Sources
- Author
-
Georgia Troullinou, Kenza Kellou-Menouer, Zoubida Kedad, Dimitris Plexousakis, Nikolaos Kardoulakis, Haridimos Kondylakis, Données et algorithmes pour une ville intelligente et durable - DAVID (DAVID), Université de Versailles Saint-Quentin-en-Yvelines (UVSQ), Hellenic Foundation for Research and Innovation, ΕΛ.ΙΔ.Ε.Κ: 1147, and Work reported in this paper has been partially supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the '2nd Call for H.F.R.I. Research Projects to support Post-Doctoral Researchers' (iQARuS Project No 1147)
- Subjects
Exploit ,LSH ,Process (engineering) ,Computer science ,media_common.quotation_subject ,02 engineering and technology ,Machine learning ,computer.software_genre ,RDF ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Quality (business) ,[INFO]Computer Science [cs] ,media_common ,Complement (set theory) ,business.industry ,Hybrid type discovery ,Linked data ,computer.file_format ,Automatic summarization ,Schema (genetic algorithms) ,020201 artificial intelligence & image processing ,Incrementality ,Artificial intelligence ,business ,computer - Abstract
International audience; The rapid explosion of linked data has resulted into many weakly structured and incomplete data sources, where typing information might be missing. On the other hand, type information is essential for a number of tasks such as query answering, integration, summarization and partitioning. Existing approaches for type discovery, either completely ignore type declarations available in the dataset (implicit type discovery approaches), or rely only on existing types, in order to complement them (explicit type enrichment approaches). Implicit type discovery approaches are based on instance grouping, which requires an exhaustive comparison between the instances. This process is expensive and not incremental. Explicit type enrichment approaches on the other hand, are not able to identify new types and they can not process data sources that have little or no schema information. In this paper, we present HInT, the first incremental and hybrid type discovery system for RDF datasets, enabling type discovery in datasets where type declarations are missing. To achieve this goal, we incrementally identify the patterns of the various instances, we index and then group them to identify the types. During the processing of an instance, our approach exploits its type information, if available, to improve the quality of the discovered types by guiding the classification of the new instance in the correct group and by refining the groups already built. We analytically and experimentally show that our approach dominates in terms of efficiency, competitors from both worlds, implicit type discovery and explicit type enrichment while outperforming them in most of the cases in terms of quality.
- Published
- 2021
- Full Text
- View/download PDF