Learning interesting attributes for automated data categorization

Authors :: Koninika Pal
Sebastian Michel
Source :: SSDBM
Publication Year :: 2018
Publisher :: ACM, 2018.
Abstract: This work proposes and evaluates a novel approach to determining interesting attributes, in order to categorize entities accordingly. Once identified, such categories are of immense value to allow constraining (filtering) a user's current view to subsets of entities. We show how a classifier is trained that is able to tell whether or not a categorical attribute can act as a constraint, in the sense of human-perceived interestingness. The training data is harnessed from Wikipedia tables, treating the presence or absence of a table as an indication that the attribute used as a filter constraint is reasonable or not. For learning the classification model, we review four well-known statistical measures (features) for categorical attributes---entropy, unalikeability, peculiarity, and coverage. We additionally propose three new statistical measures to capture the distribution of data, tailored to our main objective. The learned model is evaluated by relevance assessments obtained through a user study, reflecting the applicability of the approach as a whole and, further, demonstrates the superiority of the proposed diversity measures over existing measures like information entropy.

Subjects :: Training set
business.industry
Computer science
02 engineering and technology
Filter (signal processing)
Machine learning
computer.software_genre
Automated data
Categorization
020204 information systems
0202 electrical engineering, electronic engineering, information engineering
020201 artificial intelligence & image processing
Artificial intelligence
business
Categorical variable
Classifier (UML)
computer

Database :: OpenAIRE
Journal :: Proceedings of the 30th International Conference on Scientific and Statistical Database Management
Accession number :: edsair.doi...........500de45d744ae040c61757f9a3a6c83e
Full Text :: https://doi.org/10.1145/3221269.3223035