Back to Search Start Over

Learning interesting attributes for automated data categorization

Authors :
Koninika Pal
Sebastian Michel
Source :
SSDBM
Publication Year :
2018
Publisher :
ACM, 2018.

Abstract

This work proposes and evaluates a novel approach to determining interesting attributes, in order to categorize entities accordingly. Once identified, such categories are of immense value to allow constraining (filtering) a user's current view to subsets of entities. We show how a classifier is trained that is able to tell whether or not a categorical attribute can act as a constraint, in the sense of human-perceived interestingness. The training data is harnessed from Wikipedia tables, treating the presence or absence of a table as an indication that the attribute used as a filter constraint is reasonable or not. For learning the classification model, we review four well-known statistical measures (features) for categorical attributes---entropy, unalikeability, peculiarity, and coverage. We additionally propose three new statistical measures to capture the distribution of data, tailored to our main objective. The learned model is evaluated by relevance assessments obtained through a user study, reflecting the applicability of the approach as a whole and, further, demonstrates the superiority of the proposed diversity measures over existing measures like information entropy.

Details

Database :
OpenAIRE
Journal :
Proceedings of the 30th International Conference on Scientific and Statistical Database Management
Accession number :
edsair.doi...........500de45d744ae040c61757f9a3a6c83e
Full Text :
https://doi.org/10.1145/3221269.3223035