1. Unsupervised Extra Trees: a stochastic approach to compute similarities in heterogeneous data
- Author
-
Kevin Dalleau, Miguel Couceiro, Malika Smaïl-Tabbone, Computational Algorithms for Protein Structures and Interactions (CAPSID), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Complex Systems, Artificial Intelligence & Robotics (LORIA - AIS), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL), Knowledge representation, reasonning (ORPAILLEUR), Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Kevin Dalleau’s PhD is funded by the RHU FIGHTHF (ANR-15-RHUS-0004) and the Region Grand Est (France)., ANR-15-RHUS-0004,FIGHT-HF,Combattre l'insuffisance cardiaque(2015), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
0301 basic medicine ,Hierarchical agglomerative clustering ,Unsupervised classification ,Computer science ,Decision tree ,Monotonic function ,Similarity measure ,Clustering ,03 medical and health sciences ,0302 clinical medicine ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Robustness (computer science) ,Cluster (physics) ,Preprocessor ,Cluster analysis ,business.industry ,Applied Mathematics ,Pattern recognition ,Computer Science Applications ,030104 developmental biology ,Computational Theory and Mathematics ,030220 oncology & carcinogenesis ,Modeling and Simulation ,Artificial intelligence ,Extremely randomized trees ,business ,Information Systems - Abstract
International audience; In this paper we present a method to compute similarities on unlabeled data, based on extremely randomized trees. The main idea of our method, Unsu-pervised Extremely Randomized Trees (UET) is to randomly split the data in an iterative fashion until a stopping criterion is met, and to compute a similarity based on the co-occurrence of samples in the leaves of each generated tree. Using a tree-based approach to compute similarities is interesting, as the inherent We evaluate our method on synthetic and real-world datasets by comparing the mean similarities between samples with the same label and the mean similarities between samples with different labels. These metrics are similar to intracluster and intercluster similarities, and are used to assess the computed similarities instead of a clustering algorithm's results. Our empirical study shows that the method effectively gives distinct similarity values between samples belonging to different clusters, and gives indiscernible values when there is no cluster structure. We also assess some interesting properties such as in-variance under monotone transformations of variables and robustness to correlated variables and noise. Finally , we performed hierarchical agglomerative clustering on synthetic and real-world homogeneous and heterogeneous datasets using UET versus standard similarity measures. Our experiments show that the algorithm outperforms existing methods in some cases, and can reduce the amount of preprocessing needed with many real-world datasets.
- Published
- 2020
- Full Text
- View/download PDF