Back to Search
Start Over
Fast and reliable inference of semantic clusters
- Source :
- Knowledge-Based Systems, Knowledge-Based Systems, Elsevier, 2016, 111, pp.133-143. ⟨10.1016/j.knosys.2016.08.008⟩
- Publication Year :
- 2016
- Publisher :
- Elsevier BV, 2016.
-
Abstract
- AGAP : équipe GE2pop; Document Indexing is but not limited to summarizing document contents with a small set of keywords or concepts of a knowledge base. Such a compact representation of document contents eases their use in numerous processes such as content-based information retrieval, corpus-mining and classification. An important effort has been devoted in recent years to (partly) automate semantic indexing, i.e. associating concepts to documents, leading to the availability of large corpora of semantically indexed documents. In this paper we introduce a method that hierarchically clusters documents based on their semantic indices while providing the proposed clusters with semantic labels. Our approach follows a neighbor joining strategy. Starting from a distance matrix reflecting the semantic similarity of documents, it iteratively se- lects the two closest clusters to merge them in a larger one. The similarity matrix is then updated. This is usually done by combining similarity of the two merged clusters, e.g. using the average similarity. We propose in this paper an alternative approach where the new cluster is first semantically annotated and the similarity matrix is then updated using the semantic similarity of this new annotation with those of the remaining clusters. The hierarchical clustering so obtained is a binary tree with branch lengths that convey semantic distances of clusters. It is then post-processed by using the branch lengths to keep only the most relevant clusters. Such a tool has numerous practical applications as it automates the organi- zation of documents in meaningful clusters (e.g. papers indexed by MeSH terms, bookmarks or pictures indexed by WordNet) which is a tedious everyday task for many people. We assess the quality of the proposed methods using a specific benchmark of annotated clusters of bookmarks that were built man- ually. Each dataset of this benchmark has been clustered independently by several users. Remarkably, the clusters automatically built by our method are congruent with the clusters proposed by experts. All resources of this work, including source code, jar file, benchmark files and results are available at this address: http://sc.nicolasfiorini.info .
- Subjects :
- 0301 basic medicine
Information Systems and Management
semantic indexing
Computer science
méthode d'indexation
WordNet
Semantic data model
complexity analysis
[MATH.MATH-GR]Mathematics [math]/Group Theory [math.GR]
Management Information Systems
03 medical and health sciences
Annotation
intelligence artificielle
Semantic similarity
Artificial Intelligence
base de connaissances
Cluster analysis
automation
cluster labeling
Information retrieval
donnée informatique
business.industry
Search engine indexing
Similarity matrix
donnée sémantique
Hierarchical clustering
030104 developmental biology
Knowledge base
Distance matrix
Cluster labeling
knowledge base
automatisation
business
Software
clustering
neighbor joining
semantic data
Subjects
Details
- ISSN :
- 09507051 and 18727409
- Volume :
- 111
- Database :
- OpenAIRE
- Journal :
- Knowledge-Based Systems
- Accession number :
- edsair.doi.dedup.....731f205b09a41828dbce74be39f02f64
- Full Text :
- https://doi.org/10.1016/j.knosys.2016.08.008