Back to Search
Start Over
Learning interesting attributes for automated data categorization
- Source :
- SSDBM
- Publication Year :
- 2018
- Publisher :
- ACM, 2018.
-
Abstract
- This work proposes and evaluates a novel approach to determining interesting attributes, in order to categorize entities accordingly. Once identified, such categories are of immense value to allow constraining (filtering) a user's current view to subsets of entities. We show how a classifier is trained that is able to tell whether or not a categorical attribute can act as a constraint, in the sense of human-perceived interestingness. The training data is harnessed from Wikipedia tables, treating the presence or absence of a table as an indication that the attribute used as a filter constraint is reasonable or not. For learning the classification model, we review four well-known statistical measures (features) for categorical attributes---entropy, unalikeability, peculiarity, and coverage. We additionally propose three new statistical measures to capture the distribution of data, tailored to our main objective. The learned model is evaluated by relevance assessments obtained through a user study, reflecting the applicability of the approach as a whole and, further, demonstrates the superiority of the proposed diversity measures over existing measures like information entropy.
- Subjects :
- Training set
business.industry
Computer science
02 engineering and technology
Filter (signal processing)
Machine learning
computer.software_genre
Automated data
Categorization
020204 information systems
0202 electrical engineering, electronic engineering, information engineering
020201 artificial intelligence & image processing
Artificial intelligence
business
Categorical variable
Classifier (UML)
computer
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- Proceedings of the 30th International Conference on Scientific and Statistical Database Management
- Accession number :
- edsair.doi...........500de45d744ae040c61757f9a3a6c83e
- Full Text :
- https://doi.org/10.1145/3221269.3223035