Back to Search Start Over

Fast and reliable inference of semantic clusters

Authors :
Sébastien Harispe
Sylvie Ranwez
Vincent Ranwez
Nicolas Fiorini
Jacky Montmain
Laboratoire de Génie Informatique et Ingénierie de Production (LGI2P)
IMT - MINES ALES (IMT - MINES ALES)
Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)
Amélioration génétique et adaptation des plantes méditerranéennes et tropicales (UMR AGAP)
Institut national d’études supérieures agronomiques de Montpellier (Montpellier SupAgro)-Institut National de la Recherche Agronomique (INRA)-Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-Centre international d'études supérieures en sciences agronomiques (Montpellier SupAgro)
Institut national d’études supérieures agronomiques de Montpellier (Montpellier SupAgro)
AVieSan national program (French Alliance nationale pour les sciences de la Vie et de la Sant)
French Agence Nationale de la Recherche : ANR-10-BINF-01 Ancestrome
Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-Institut National de la Recherche Agronomique (INRA)-Centre international d'études supérieures en sciences agronomiques (Montpellier SupAgro)-Institut national d’études supérieures agronomiques de Montpellier (Montpellier SupAgro)
Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)
Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)
ANR-10-BINF-0001,ANCESTROME,Approche de phylogénie intégrative pour la reconstruction de génomes ancestraux(2010)
Source :
Knowledge-Based Systems, Knowledge-Based Systems, Elsevier, 2016, 111, pp.133-143. ⟨10.1016/j.knosys.2016.08.008⟩
Publication Year :
2016
Publisher :
Elsevier BV, 2016.

Abstract

AGAP : équipe GE2pop; Document Indexing is but not limited to summarizing document contents with a small set of keywords or concepts of a knowledge base. Such a compact representation of document contents eases their use in numerous processes such as content-based information retrieval, corpus-mining and classification. An important effort has been devoted in recent years to (partly) automate semantic indexing, i.e. associating concepts to documents, leading to the availability of large corpora of semantically indexed documents. In this paper we introduce a method that hierarchically clusters documents based on their semantic indices while providing the proposed clusters with semantic labels. Our approach follows a neighbor joining strategy. Starting from a distance matrix reflecting the semantic similarity of documents, it iteratively se- lects the two closest clusters to merge them in a larger one. The similarity matrix is then updated. This is usually done by combining similarity of the two merged clusters, e.g. using the average similarity. We propose in this paper an alternative approach where the new cluster is first semantically annotated and the similarity matrix is then updated using the semantic similarity of this new annotation with those of the remaining clusters. The hierarchical clustering so obtained is a binary tree with branch lengths that convey semantic distances of clusters. It is then post-processed by using the branch lengths to keep only the most relevant clusters. Such a tool has numerous practical applications as it automates the organi- zation of documents in meaningful clusters (e.g. papers indexed by MeSH terms, bookmarks or pictures indexed by WordNet) which is a tedious everyday task for many people. We assess the quality of the proposed methods using a specific benchmark of annotated clusters of bookmarks that were built man- ually. Each dataset of this benchmark has been clustered independently by several users. Remarkably, the clusters automatically built by our method are congruent with the clusters proposed by experts. All resources of this work, including source code, jar file, benchmark files and results are available at this address: http://sc.nicolasfiorini.info .

Details

ISSN :
09507051 and 18727409
Volume :
111
Database :
OpenAIRE
Journal :
Knowledge-Based Systems
Accession number :
edsair.doi.dedup.....731f205b09a41828dbce74be39f02f64
Full Text :
https://doi.org/10.1016/j.knosys.2016.08.008