Descriptor: "Cluster labeling" / Topic: cluster analysis - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Cluster labeling"' showing total 81 results

Start Over Descriptor "Cluster labeling" Topic cluster analysis

81 results on '"Cluster labeling"'

1. MULTITOPIC TEXT CLUSTERING AND CLUSTER LABELING USING CONTEXTUALIZED WORD EMBEDDINGS

Author: Z. V. Ostapiuk and T. O. Korotyeyeva
Subjects: Word embedding, Computer science, business.industry, Keyword extraction, 02 engineering and technology, General Medicine, 010501 environmental sciences, Document clustering, computer.software_genre, 01 natural sciences, Semantic similarity, Cluster labeling, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Relevance (information retrieval), Artificial intelligence, Cluster analysis, business, computer, Natural language processing, 0105 earth and related environmental sciences, Interpretability
Abstract: Context. In the current information era, the problem of analyzing large volumes of unlabeled textual data and its further grouping with respect to the semantic similarity between texts is emerging. This raises the need for robust text analysis algorithms, namely, clustering and extraction of key data from texts. Despite recent progress in the field of natural language processing, new neural methods lack interpretability when used for unsupervised tasks, whereas traditional distributed semantics and word counting techniques tend to disregard contextual information.Objective. The objective of the study is to develop an interpretable text clustering and cluster labeling methods with respect to the semantic similarity that require no additional training on the user’s dataset. Method. To approach the task of text clustering, we incorporate deep contextualized word embeddings and analyze their evolution through layers of pretrained transformer models. Given word embeddings, we look for similar tokens across all corpus and form topics that are present in multiple sentences. We merge topics so that sentences that share many topics are assigned to one cluster. One sentence can contain a few topics, it can be present in more then one cluster simultaneously. Similarly, to generate labels for the existing cluster, we use token embeddings to order them based on how much they are descriptive of the cluster. To do so, we propose a novel metric – token rank measure and evaluate two other metrics.Results. A new unsupervised text clustering approach was described and implemented. It is capable of assigning a text to different clusters based on semantic similarity to other texts in the group. A keyword extraction approach was developed and applied in both text clustering and cluster labeling tasks. Obtained clusters are annotated and can be interpreted through the terms that formed the clusters.Conclusions. Evaluation on different datasets demonstrated applicability, relevance, and interpretability of the obtained results. The advantages and possible improvements to the proposed methods were described. Recommendations for using methods were provided, as well as possible modifications.
Published: 2020
Full Text: View/download PDF

2. Exploiting Gaussian Mixture Model Clustering for Full-Duplex Transceiver Design

Author: Lin Zhang, Ying-Chang Liang, and Jie Chen
Subjects: Computer science, Maximum likelihood, Detector, Estimator, 020302 automobile design & engineering, 020206 networking & telecommunications, 02 engineering and technology, Mixture model, Least squares, Least mean squares filter, 0203 mechanical engineering, Single antenna interference cancellation, Cluster labeling, 0202 electrical engineering, electronic engineering, information engineering, Bit error rate, Electrical and Electronic Engineering, Cluster analysis, Algorithm, Communication channel, Data transmission
Abstract: In conventional full-duplex communications, dedicated symbols are transmitted to estimate both the self-interference channel and the desired signal channel in order to perform self-interference cancellation (SIC) and to coherently detect the desired signal. However, inaccurate channel estimation will produce residual self-interference and degrade the detection performance. In this paper, we exploit a Gaussian mixture model (GMM) clustering to design a full-duplex transceiver (FDT), which is able to detect the desired signal without requiring digital-domain channel estimation and SIC. The frame structure of the designed FDT contains two successive phases: labeling phase and data transmission phase. In particular, the designed FDT performs cluster labeling in the labeling phase and performs GMM clustering based on an expectation-maximization (EM) algorithm in the data transmission phase. Furthermore, the theoretical analysis about the detection performance, computational complexity, and convergence performance for the designed FDT are studied. Finally, simulation results show that the bit error rate (BER) of the designed FDT is closed to the performance of the FDT with a maximum likelihood (ML) detector and perfect channel knowledge meanwhile is superior to the BER performance of the FDT with a ML detector and a least square (LS) or least mean square (LMS) channel estimator.
Published: 2019
Full Text: View/download PDF

3. Topic Models and Fusion Methods: a Union to Improve Text Clustering and Cluster Labeling

Author: Hosna Omidvarborna, Mohsen Pourvali, and Salvatore Orlando
Subjects: Statistics and Probability, Topic model, Topic structure, Computer Networks and Communications, Computer science, Text Mining, media_common.quotation_subject, Cluster Labeling, text mining, 02 engineering and technology, lcsh:Technology, document clustering, Text mining, Artificial Intelligence, Document Clustering, 0202 electrical engineering, electronic engineering, information engineering, Quality (business), Cluster analysis, media_common, Document Enriching, cluster labeling, Information retrieval, Settore INF/01 - Informatica, business.industry, lcsh:T, 05 social sciences, IJIMAI, 020207 software engineering, document enriching, Document clustering, Sensor fusion, Computer Science Applications, Signal Processing, Cluster labeling, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Computer Vision and Pattern Recognition, 0509 other social sciences, 050904 information & library sciences, business
Abstract: Topic modeling algorithms are statistical methods that aim to discover the topics running through the text documents. Using topic models in machine learning and text mining is popular due to its applicability in inferring the latent topic structure of a corpus. In this paper, we represent an enriching document approach, using state-ofthe-art topic models and data fusion methods, to enrich documents of a collection with the aim of improving the quality of text clustering and cluster labeling. We propose a bi-vector space model in which every document of the corpus is represented by two vectors: one is generated based on the fusion-based topic modeling approach, and one simply is the traditional vector model. Our experiments on various datasets show that using a combination of topic modeling and fusion methods to create documents’ vectors can significantly improve the quality of the results in clustering the documents.
Published: 2019

4. Using Regression Error Analysis and Feature Selection to Automatic Cluster Labeling

Author: Bruno Vicente Alves de Lima, Rodrigo Veras, Sidiney Souza Araujo, Lúcia Emília Soares Silva, and Vinicius Ponte Machado
Subjects: Computer science, business.industry, Dimensionality reduction, Feature selection, Pattern recognition, Regression analysis, Fuzzy logic, Regression, ComputingMethodologies_PATTERNRECOGNITION, Cluster labeling, Unsupervised learning, Artificial intelligence, Cluster analysis, business
Abstract: Cluster Labeling Models apply Artificial Intelligence techniques to extract the key features of clustered data to provide a tool for clustering interpretation. For this purpose, we applied different techniques such as Classification, Regression, Fuzzy Logic, and Data Discretization to identify essential attributes for cluster formation and the ranges of values associated with them. This paper presents an improvement to the Regression-based Cluster Labeling Model that integrates to the model an attribute selection step based on the coefficient of determination obtained by regression models in order to make its application possible in large datasets. The model was tested on the literature datasets Iris, Breast Cancer, and Parkinson’s Disease, evaluating the labeling performance of different dimensionality. The results obtained from the experiments showed that the model is sound, providing specific labels for each cluster representing between 99% and 100% of the elements of the clusters for the datasets used.
Published: 2021
Full Text: View/download PDF

5. Research on Clustering Identification Method Based on Path Sampling in Support Vector Clustering

Author: Cai Jiliang, Zeng Huiyong, Bai Juan, Zong Binfeng, Wang Shiqiang, and Gao Caiyun
Subjects: Computer science, business.industry, Complete graph, Pattern recognition, Support vector machine, ComputingMethodologies_PATTERNRECOGNITION, Similarity (network science), Path (graph theory), Cluster labeling, Graph (abstract data type), Artificial intelligence, Gradient descent, Cluster analysis, business
Abstract: Support vector clustering is a nonparametric and unsupervised clustering algorithm based on nonlinear mapping. The algorithm obtains the set of isolines containing data through mapping and reflection operations. In order to complete the clustering, we need to deal with the set of equal value lines. This paper studies the cluster labeling (CL) method based on path sampling, including complete graph (CG) method, support vector graph (SVG) method, similarity graph (PG) method, and gradient descent (GD) method.
Published: 2020
Full Text: View/download PDF

6. Frequent Itemsets Methods for Text Clustering

Author: Chama El Saili, Soukaina Fatimi, and Larbi Alaoui
Subjects: Identification (information), ComputingMethodologies_PATTERNRECOGNITION, Information retrieval, law, Computer science, Dimensionality reduction, Cluster labeling, k-means clustering, Hypertext, Document clustering, Cluster analysis, law.invention, Task (project management)
Abstract: Text clustering is a crucial application of data mining. It can be used to structure hypertext documents or large sets of text. Many research works have dived into document clustering as a technique for improving search, information retrieval, document browsing, automatic topic identification, as well as the primitive task of clustering. Major challenges are entangling researchers, especially when working with large scale datasets, such as very high dimensionality and cluster labeling. To tackle these challenges, a number of techniques using frequent itemsets mining methods in text clustering have been proposed. In this paper, we review such techniques while highlighting their strengths and limitations. With the analysis of associated methodologies, we also propose a general framework for the task of text clustering using frequent itemsets mining algorithms.
Published: 2020
Full Text: View/download PDF

7. MVStream: Multiview Data Stream Clustering

Author: Ling Huang, Hongyang Chao, Chang-Dong Wang, and Philip S. Yu
Subjects: Data stream, Concept drift, Computer Networks and Communications, Computer science, 02 engineering and technology, computer.software_genre, Computer Science Applications, Data modeling, Support vector machine, Data stream clustering, Artificial Intelligence, Cluster labeling, 0202 electrical engineering, electronic engineering, information engineering, Cluster (physics), 020201 artificial intelligence & image processing, Data mining, Cluster analysis, computer, Software
Abstract: This article studies a new problem of data stream clustering, namely, multiview data stream (MVStream) clustering. Although many data stream clustering algorithms have been developed, they are restricted to the single-view streaming data, and clustering MVStreams still remains largely unsolved. In addition to the many issues encountered by the conventional single-view data stream clustering, such as capturing cluster evolution and discovering clusters of arbitrary shapes under the limited computational resources, the main challenge of MVStream clustering lies in integrating information from multiple views in a streaming manner and abstracting summary statistics from the integrated features simultaneously. In this article, we propose a novel MVStream clustering algorithm for the first time. The main idea is to design a multiview support vector domain description (MVSVDD) model, by which the information from multiple insufficient views can be integrated, and the outputting support vectors (SVs) are utilized to abstract the summary statistics of the historical multiview data objects. Based on the MVSVDD model, a new multiview cluster labeling method is designed, whereby clusters of arbitrary shapes can be discovered for each view. By tracking the cluster labels of SVs in each view, the cluster evolution associated with concept drift can be captured. Since the SVs occupy only a small portion of data objects, the proposed MVStream algorithm is quite efficient with the limited computational resources. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of the proposed method.
Published: 2019

8. Enhancing Software Requirements Cluster Labeling Using Wikipedia

Author: Sandeep Reddivari
Subjects: Information retrieval, Requirements engineering, Computer science, business.industry, 020207 software engineering, Context (language use), 02 engineering and technology, Tracing, ComputingMethodologies_PATTERNRECOGNITION, Software, Requirement prioritization, Cluster labeling, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Software requirements, business, Cluster analysis
Abstract: Clustering plays an important role in reusable requirements retrieval from the ever-growing software project repositories. The literature on requirements cluster labeling is still emerging. Researchers have investigated clustering to support various software engineering activities such as requirements prioritization, feature identification, automated tracing, and code navigation. The primary task in analyzing the clustering results is to "label" the clusters by means of some representative words to summarize and comprehend the requirements data. Despite the development of automatic cluster labeling techniques for software requirements, very little is understood about enhancing the cluster labels using external knowledge sources such as Wikipedia. In this paper, we review the literature on enhancing cluster labeling, present a framework for requirements cluster labeling and conduct an experiment to evaluate how the Wikipedia-based enhancement performs in labeling requirements clusters. The results show that Wikipedia-based labeling outperforms traditional Information Retrieval (IR) techniques. Our work sheds light on improving automated ways to support information reuse and management in the context of requirements engineering (RE).
Published: 2019
Full Text: View/download PDF

9. A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering

Author: Hanane Azzag, Mustapha Lebbah, Gaël Beck, Tarn Duong, and Christophe Cérin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Networks and Communications, Computer science, Machine Learning (stat.ML), 02 engineering and technology, Theoretical Computer Science, Locality-sensitive hashing, Machine Learning (cs.LG), Statistics - Machine Learning, Artificial Intelligence, Spark (mathematics), 0202 electrical engineering, electronic engineering, information engineering, Mean-shift, Cluster analysis, Time complexity, 020206 networking & telecommunications, ComputingMethodologies_PATTERNRECOGNITION, Artificial Intelligence (cs.AI), Computer Science - Distributed, Parallel, and Cluster Computing, Hardware and Architecture, Cluster labeling, Unsupervised learning, 020201 artificial intelligence & image processing, Distributed, Parallel, and Cluster Computing (cs.DC), Algorithm, Software
Abstract: In this paper we target the class of modal clustering methods where clusters are defined in terms of the local modes of the probability density function which generates the data. The most well-known modal clustering method is the k-means clustering. Mean Shift clustering is a generalization of the k-means clustering which computes arbitrarily shaped clusters as defined as the basins of attraction to the local modes created by the density gradient ascent paths. Despite its potential, the Mean Shift approach is a computationally expensive method for unsupervised learning. Thus, we introduce two contributions aiming to provide clustering algorithms with a linear time complexity, as opposed to the quadratic time complexity for the exact Mean Shift clustering. Firstly we propose a scalable procedure to approximate the density gradient ascent. Second, our proposed scalable cluster labeling technique is presented. Both propositions are based on Locality Sensitive Hashing (LSH) to approximate nearest neighbors. These two techniques may be used for moderate sized datasets. Furthermore, we show that using our proposed approximations of the density gradient ascent as a pre-processing step in other clustering methods can also improve dedicated classification metrics. For the latter, a distributed implementation, written for the Spark/Scala ecosystem is proposed. For all these considered clustering methods, we present experimental results illustrating their labeling accuracy and their potential to solve concrete problems., Comment: Algorithms are available at https://github.com/Clustering4Ever/Clustering4Ever
Published: 2019
Full Text: View/download PDF

10. Unsupervised method of clustering and labeling of the online product based on reviews

Author: Masum Billah, Md. Akterujjaman, and Mohammad Nuruzzaman Bhuiyan
Subjects: ComputingMethodologies_PATTERNRECOGNITION, Computer science, Modeling and Simulation, Product (mathematics), Cluster labeling, Cluster (physics), Sentence clustering, Data mining, Cluster analysis, computer.software_genre, computer, Computer Science Applications
Abstract: This paper presents an unsupervised approach to cluster reviews of products collected from Amazon and then generates its labels of each cluster. Instead of using a complete review, this paper splits a review into sentences and considers all sentences from the reviews as inputs for Clustering. Hierarchical Agglomerative Clustering (HAC) is used to cluster sentences. The approaches of cluster labeling are also unsupervised. For labeling, three different methods have been used to find a limited number of essential words for each cluster. Extracted essential words are used to construct phrases. Constructed phrases are used as labels for each cluster. This paper compares the result of the labeling method with baseline labeling. In the result evaluation, all the labeling methods outperform the baseline method. The aim of this research is cluster labeling that makes a set of labels to describe a cluster content and distinguishes the labels from other cluster labels.
Published: 2021
Full Text: View/download PDF

11. Fast and reliable inference of semantic clusters

Author: Sébastien Harispe, Sylvie Ranwez, Vincent Ranwez, Nicolas Fiorini, Jacky Montmain, Laboratoire de Génie Informatique et Ingénierie de Production (LGI2P), IMT - MINES ALES (IMT - MINES ALES), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT), Amélioration génétique et adaptation des plantes méditerranéennes et tropicales (UMR AGAP), Institut national d’études supérieures agronomiques de Montpellier (Montpellier SupAgro)-Institut National de la Recherche Agronomique (INRA)-Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-Centre international d'études supérieures en sciences agronomiques (Montpellier SupAgro), Institut national d’études supérieures agronomiques de Montpellier (Montpellier SupAgro), AVieSan national program (French Alliance nationale pour les sciences de la Vie et de la Sant), French Agence Nationale de la Recherche : ANR-10-BINF-01 Ancestrome, Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-Institut National de la Recherche Agronomique (INRA)-Centre international d'études supérieures en sciences agronomiques (Montpellier SupAgro)-Institut national d’études supérieures agronomiques de Montpellier (Montpellier SupAgro), Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro)-Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro), Institut national d'enseignement supérieur pour l'agriculture, l'alimentation et l'environnement (Institut Agro), and ANR-10-BINF-0001,ANCESTROME,Approche de phylogénie intégrative pour la reconstruction de génomes ancestraux(2010)
Subjects: 0301 basic medicine, Information Systems and Management, semantic indexing, Computer science, méthode d'indexation, WordNet, Semantic data model, complexity analysis, [MATH.MATH-GR]Mathematics [math]/Group Theory [math.GR], Management Information Systems, 03 medical and health sciences, Annotation, intelligence artificielle, Semantic similarity, Artificial Intelligence, base de connaissances, Cluster analysis, automation, cluster labeling, Information retrieval, donnée informatique, business.industry, Search engine indexing, Similarity matrix, donnée sémantique, Hierarchical clustering, 030104 developmental biology, Knowledge base, Distance matrix, Cluster labeling, knowledge base, automatisation, business, Software, clustering, neighbor joining, semantic data
Abstract: AGAP : équipe GE2pop; Document Indexing is but not limited to summarizing document contents with a small set of keywords or concepts of a knowledge base. Such a compact representation of document contents eases their use in numerous processes such as content-based information retrieval, corpus-mining and classification. An important effort has been devoted in recent years to (partly) automate semantic indexing, i.e. associating concepts to documents, leading to the availability of large corpora of semantically indexed documents. In this paper we introduce a method that hierarchically clusters documents based on their semantic indices while providing the proposed clusters with semantic labels. Our approach follows a neighbor joining strategy. Starting from a distance matrix reflecting the semantic similarity of documents, it iteratively se- lects the two closest clusters to merge them in a larger one. The similarity matrix is then updated. This is usually done by combining similarity of the two merged clusters, e.g. using the average similarity. We propose in this paper an alternative approach where the new cluster is first semantically annotated and the similarity matrix is then updated using the semantic similarity of this new annotation with those of the remaining clusters. The hierarchical clustering so obtained is a binary tree with branch lengths that convey semantic distances of clusters. It is then post-processed by using the branch lengths to keep only the most relevant clusters. Such a tool has numerous practical applications as it automates the organi- zation of documents in meaningful clusters (e.g. papers indexed by MeSH terms, bookmarks or pictures indexed by WordNet) which is a tedious everyday task for many people. We assess the quality of the proposed methods using a specific benchmark of annotated clusters of bookmarks that were built man- ually. Each dataset of this benchmark has been clustered independently by several users. Remarkably, the clusters automatically built by our method are congruent with the clusters proposed by experts. All resources of this work, including source code, jar file, benchmark files and results are available at this address: http://sc.nicolasfiorini.info .
Published: 2016
Full Text: View/download PDF

12. Unsupervised Bug Report Categorization Using Clustering and Labeling Algorithm

Author: Hideaki Hata, Akito Monden, Nachai Limsettho, and Kenichi Matsumoto
Subjects: Topic model, Information retrieval, Computer Networks and Communications, Process (engineering), Computer science, Supervised learning, 020207 software engineering, 02 engineering and technology, 01 natural sciences, Computer Graphics and Computer-Aided Design, 010104 statistics & probability, ComputingMethodologies_PATTERNRECOGNITION, Categorization, Artificial Intelligence, Cluster labeling, Chunking (psychology), 0202 electrical engineering, electronic engineering, information engineering, 0101 mathematics, Cluster analysis, Algorithm, Software, Natural language
Abstract: Bug reports are one of the most crucial information sources for software engineering offering answers to many questions. Yet, getting these answers is not always easy; the information in bug reports is often implicit and some processes are required to extract the meaning of these reports. Most research in this area employ a supervised learning approach to classify bug reports so that required types of reports could be identified. However, this approach often requires an immense amount of time and effort, the resources that already too scarce in many projects. We aim to develop an automated framework that can categorize bug reports, according to their grammatical structure without the need for labeled data. Our framework categorizes bug reports according to their text similarity using topic modeling and a clustering algorithm. Each group of bug reports are labeled with our new clustering labeling algorithm specifically made for clusters in the topic space. Our framework is highly customizable with a modular approach and options to incorporate available background knowledge to improve its performance, while our cluster labeling approach make use of natural language process (NLP) chunking to create the representative labels. Our experiment results demonstrate that the performance of our unsupervised framework is comparable to a supervised learning one. We also show that our labeling process is capable of labeling each cluster with phrases that are representative for that cluster's characteristics. Our framework can be used to automatically categorize the incoming bug reports without any prior knowledge, as an automated labeling suggestion system or as a tool for obtaining knowledge about the structure of the bug report repository.
Published: 2016
Full Text: View/download PDF

13. Improving spherical k-means for document clustering: Fast initialization, sparse centroid projection, and efficient cluster labeling

Author: Hyun-Joong Kim, Sungzoon Cho, and Han Kyul Kim
Subjects: 0209 industrial biotechnology, Computer science, business.industry, General Engineering, k-means clustering, Initialization, Centroid, Pattern recognition, 02 engineering and technology, Document clustering, Computer Science Applications, ComputingMethodologies_PATTERNRECOGNITION, 020901 industrial engineering & automation, Artificial Intelligence, Cluster labeling, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, Projection (set theory), Cluster analysis, business
Abstract: Due to its simplicity and intuitive interpretability, spherical k-means is often used for clustering a large number of documents. However, there exist a number of drawbacks that need to be addressed for much effective document clustering. Without well-dispersed initial points, spherical k-means fails to converge quickly, which is critical for clustering a large number of documents. Furthermore, its dense centroid vectors needlessly incorporate the impact of infrequent and less-informative words, thereby distorting the distance calculation between the document vectors. In this paper, we propose practical improvements on spherical k-means to overcome these issues during document clustering. Our proposed initialization method not only guarantees dispersed initial points, but is also up to 1000 times faster than previously well-known initialization method such as k-means++. Furthermore, we enforce sparsity on the centroid vectors by using a data-driven threshold that is capable of dynamically adjusting its value depending on the clusters. Additionally, we propose an unsupervised cluster labeling method that effectively extracts meaningful keywords to describe each cluster. We have tested our improvements on seven different text datasets that include both new and publicly available datasets. Based on our experiments on these datasets, we have found that our proposed improvements successfully overcome the drawbacks of spherical k-means in significantly reduced computation time. Furthermore, we have qualitatively verified the performance of the proposed cluster labeling method by extracting descriptive keywords of the clusters from these datasets.
Published: 2020
Full Text: View/download PDF

14. Bringing a Feature Selection Metric from Machine Learning to Complex Networks

Author: Jean-Charles Lamirel, Anthony Perez, Nicolas Dugué, Laboratoire d'Informatique de l'Université du Mans (LIUM), Le Mans Université (UM), Natural Language Processing : representations, inference and semantics (SYNALP), Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire d'Informatique Fondamentale d'Orléans (LIFO), Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université d'Orléans (UO), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Graphes, Algorithmes et Modèles de Calcul (GAMoC), Université d'Orléans (UO)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université d'Orléans (UO)-Institut National des Sciences Appliquées - Centre Val de Loire (INSA CVL), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA), and Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Subjects: Computer science, business.industry, Node (networking), [INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS], Feature selection, 02 engineering and technology, Complex network, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], Machine learning, computer.software_genre, [INFO.INFO-SI]Computer Science [cs]/Social and Information Networks [cs.SI], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], 020204 information systems, Cluster labeling, Metric (mathematics), 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), 020201 artificial intelligence & image processing, Artificial intelligence, business, Centrality, Cluster analysis, computer, ComputingMilieux_MISCELLANEOUS
Abstract: International audience; Introduced in the context of machine learning, the Feature F-measure is a statistical feature selection metric without parameters that allows to describe classes through a set of salient features. It was shown efficient for classification, cluster labeling and clustering model quality measurement. In this paper, we introduce the Node F-measure, its transposition in the context of networks, where it can by analogy be applied to detect salient nodes in communities. This approach benefits from the parameter-free system of Feature F-Measure, its low computational complexity and its well-evaluated performance. Interestingly, we show that in addition to these properties, Node F-measure is correlated with certain centrality measures, and with measures designed to characterize the community roles of nodes. We also observe that the usual community roles measures are strongly dependent from the size of the communities whereas the ones we propose are by definition linked to the density of the community. This hence makes their results comparable from one network to another. Finally, the parameter-free selection process applied to nodes allows for a universal system, contrary to the thresholds previously defined empirically for the establishment of community roles. These results may have applications regarding leadership in scientific communities or when considering temporal monitoring of communities.
Published: 2018
Full Text: View/download PDF

15. Automatic Cluster Labeling Based on Phylogram Analysis

Author: Francisco N. C. Araújo, Rodrigo M. S. de Veras, Antonio Helson Mineiro Soares, and Vinicius Ponte Machado
Subjects: business.industry, Computer science, Process (engineering), Pattern recognition, 02 engineering and technology, Data set, ComputingMethodologies_PATTERNRECOGNITION, Similarity (network science), 020204 information systems, Cluster labeling, Metric (mathematics), Pattern recognition (psychology), 0202 electrical engineering, electronic engineering, information engineering, Task analysis, 020201 artificial intelligence & image processing, Artificial intelligence, Cluster analysis, business
Abstract: Clustering is one of the main pattern recognition techniques. This technique consists in organizing the elements into groups (clusters) taking into consideration some metric that allows determining the similarity of them. These data sets often describe the elements that compose them through attributes that can assume various values types, requiring efficient methods in the detecting correlations task between complex (or mixed) type data. However, the clustering process does not provide clear information that allows inferring the characteristics of each cluster formed, that is, the clustering process result does not allow the clusters to have their meaning easily understood. Data labeling aims at identifying these characteristics and then allowing full understanding of the resulting clusters. This paper proposes the joint use of unsupervised and supervised Machine Learning methods for data clustering and labeling tasks, respectively. The labeling task consists in identifying the clusters through their most relevant characteristics. The algorithms used are known to be efficient, obtaining satisfactory results in the definitions of the clusters formed in the experiments exposed here.
Published: 2018
Full Text: View/download PDF

16. Multi-focus cluster labeling

Author: Tor-Kristian Jenssen, Marit Holden, and Line Eikvil
Subjects: Information retrieval, Computer science, MEDLINE, Pie chart, Health Informatics, Documentation, Domain (software engineering), Visualization, law.invention, Pattern Recognition, Automated, Computer Science Applications, Set (abstract data type), Machine Learning, User-Computer Interface, ComputingMethodologies_PATTERNRECOGNITION, Vocabulary, Controlled, law, Cluster labeling, Cluster (physics), Selection (linguistics), Data Mining, Cluster analysis, Natural Language Processing
Abstract: Display Omitted We introduce the concept of multi-focus cluster labeling.Multi-focus labeling produces several sets of labels from different viewpoints.Multi-focus labels can be produced without re-clustering or heavy computations.Word clouds and pie charts present the results in a way that is familiar to users.The visualization allows for a compact presentation suitable also for smaller devices. Document collections resulting from searches in the biomedical literature, for instance, in PubMed, are often so large that some organization of the returned information is necessary. Clustering is an efficient tool for organizing search results. To help the user to decide how to continue the search for relevant documents, the content of each cluster can be characterized by a set of representative keywords or cluster labels. As different users may have different interests, it can be desirable with solutions that make it possible to produce labels from a selection of different topical categories. We therefore introduce the concept of multi-focus cluster labeling to give users the possibility to get an overview of the contents through labels from multiple viewpoints.The concept for multi-focus cluster labeling has been established and has been demonstrated on three different document collections. We illustrate that multi-focus visualizations can give an overview of clusters along axes that general labels are not able to convey. The approach is generic and should be applicable to any biomedical (or other) domain with any selection of foci where appropriate focus vocabularies can be established. A user evaluation also indicates that such a multi-focus concept is useful.
Published: 2015
Full Text: View/download PDF

17. Search Result Clustering Based on Query Context

Author: Michal Meina and Hung Son Nguyen
Subjects: Algebra and Number Theory, Fuzzy clustering, Information retrieval, Computer science, Correlation clustering, Constrained clustering, computer.software_genre, Theoretical Computer Science, Query expansion, Data stream clustering, Computational Theory and Mathematics, Cluster labeling, Canopy clustering algorithm, Data mining, Cluster analysis, computer, Information Systems
Abstract: This paper introduces a novel, interactive and exploratory, approach to information retrieval search engines based on clustering. Presented method allows users to change the clustering structure by applying a free-text clustering context query that is treated as a criterion for document-to-cluster allocation. Exploration mechanisms are delivered by redefining the interaction scenario in which the user can interact with data on the level of topic discovery or cluster labeling. In this paper, the presented idea is realized by a graph structure called the Query-Summarize Graph. This data structure is useful in the definition of the similarity measure between the snippets as well as in the snippet clustering algorithm. The experiments on real-world data are showing that the proposed solution has many interesting properties and can be an alternative approach to interactive information retrieval.
Published: 2015
Full Text: View/download PDF

18. An improved cluster labeling algorithm based on vector similarity in radar signal sorting

Author: Yang Sheng, Changbo Hou, and Weijian Si
Subjects: Support vector machine, Computational complexity theory, Similarity (network science), Computer science, law, Cluster labeling, Sorting, sort, Radar, Cluster analysis, Algorithm, law.invention
Abstract: Radar signal sorting is a process of detecting and identifying the target radar emitter. This paper presents an improved cluster labeling algorithm based on vector similarity in radar signal sort, which reduces the computation complexity without loss of accuracy. The basic idea of the algorithm is making full use of the features of support vectors to search the stable equilibrium points and calculating the similarity of vectors to assign the rest data points. Simulation results show the effectiveness of the proposed method.
Published: 2017
Full Text: View/download PDF

19. Automatic assessment of medication states of patients with Parkinson's disease using wearable sensors

Author: Behnaz Ghoraani, Michelle A. Burack, and Murtadha D. Hssayeni
Subjects: 0301 basic medicine, Fuzzy classification, Parkinson's disease, Computer science, Feature extraction, Monitoring, Ambulatory, Disease, Fuzzy logic, Clothing, 03 medical and health sciences, 0302 clinical medicine, Fuzzy Logic, medicine, Humans, Cluster analysis, Dyskinesias, business.industry, Pattern recognition, Parkinson Disease, Signal Processing, Computer-Assisted, medicine.disease, Trunk, 030104 developmental biology, Dyskinesia, Cluster labeling, Artificial intelligence, medicine.symptom, Drug Monitoring, business, 030217 neurology & neurosurgery, Algorithms
Abstract: Motor fluctuations are a major focus of clinical managements in patients with mid-stage and advance Parkinson's disease (PD). In this paper, we develop a new patient-specific algorithm that can classify those fluctuations during a variety of activities. We extract a set of temporal and spectral features from the ambulatory signals and then introduce a semi-supervised classification algorithm based on K-means and self-organizing tree map clustering methods. Two different types of cluster labeling are introduced: hard and fuzzy labeling. The developed algorithm is evaluated on a dataset from triaxial gyroscope sensors for 12 PD patients. The average result of using K-means and fuzzy labeling on the trunk and the more affected leg sensors' readings was 75.96%, 70.57%, and 86.93% for accuracy, sensitivity, and specificity, respectively. The accuracy for individual patients varied from 99.95% to 42.53%, which was correlated with dyskinesia severity and the improvement of the PD symptoms with medication.
Published: 2017

20. Automatic Summarization of Online Debates

Author: Kalina Bontcheva, Ahmet Aker, and Nattapong Sanchan
Subjects: FOS: Computer and information sciences, Information retrieval, Computer Science - Computation and Language, Bar chart, Computer science, Computer Science - Artificial Intelligence, Mutual information, Pipeline (software), Automatic summarization, law.invention, Visualization, Computer Science - Information Retrieval, Artificial Intelligence (cs.AI), law, Cluster labeling, Cluster analysis, First impression (psychology), Computation and Language (cs.CL), Information Retrieval (cs.IR)
Abstract: Debate summarization is one of the novel and challenging research areas in automatic text summarization which has been largely unexplored. In this paper, we develop a debate summarization pipeline to summarize key topics which are discussed or argued in the two opposing sides of online debates. We view that the generation of debate summaries can be achieved by clustering, cluster labeling, and visualization. In our work, we investigate two different clustering approaches for the generation of the summaries. In the first approach, we generate the summaries by applying purely term-based clustering and cluster labeling. The second approach makes use of X-means for clustering and Mutual Information for labeling the clusters. Both approaches are driven by ontologies. We visualize the results using bar charts. We think that our results are a smooth entry for users aiming to receive the first impression about what is discussed within a debate topic containing waste number of argumentations., Comment: Accepted and to be published in Natural Language Processing and Information Retrieval workshop, Recent Advances in Natural Language Processing 2017 (RANLP 2017)
Published: 2017
Full Text: View/download PDF

21. Unsupervised Anomaly Detection for Network Flow Using Immune Network Based K-means Clustering

Author: Xiaoning Peng, Yu Zhang, Yuanquan Shi, and Renfa Li
Subjects: Computer science, business.industry, k-means clustering, Pattern recognition, Flow network, ComputingMethodologies_PATTERNRECOGNITION, Flow (mathematics), Cluster labeling, Cluster (physics), Anomaly detection, Artificial intelligence, Anomaly (physics), Cluster analysis, business
Abstract: To detect effectively unknown anomalous attack behaviors of network traffic, an Unsupervised Anomaly Detection approach for network flow using Immune Network based K-means clustering (UADINK) is proposed. In UADINK, artificial immune network based K-means clustering algorithm (aiNet_KMC) is introduced to cluster network flow, i.e. extracting abstract internal images from network flows and obtaining an optimizing parameter K of K-means by aiNet model, and network flows are clustered by K-means algorithm. The cluster labeling algorithm (clusLA) and the network flow anomaly detection algorithm (NFAD) are introduced to detect anomalous attack behaviors of network flows, where the clusLA algorithm is used for labeling whether each cluster belongs to malicious, and the labeled clusters are regarded as detectors to identify anomaly network flows by NFAD. To evaluate the effectiveness of UADINK, the ISCX 2012 IDS dataset is considered as the simulating experimental dataset. Compared with the NDM based K-means anomaly detection approach, the results show that UADINK is a radical anomaly detection approach in order to detect anomalies of network flows.
Published: 2017
Full Text: View/download PDF

22. Clustering of web search results based on the cuckoo search algorithm and Balanced Bayesian Information Criterion

Author: Elizabeth León, Martha Mendoza, Carlos Cobos, Enrique Herrera-Viedma, Henry Muñoz-Collazos, and Richar Urbano-Muñoz
Subjects: Clustering high-dimensional data, Information Systems and Management, Fuzzy clustering, Computer science, Correlation clustering, Population, Suffix tree clustering, computer.software_genre, Machine learning, Theoretical Computer Science, Biclustering, Artificial Intelligence, CURE data clustering algorithm, Local search (optimization), Cuckoo search, Cluster analysis, education, k-medians clustering, education.field_of_study, Fitness function, business.industry, Constrained clustering, k-means clustering, Document clustering, Computer Science Applications, Data set, Determining the number of clusters in a data set, Data stream clustering, Control and Systems Engineering, Cluster labeling, Canopy clustering algorithm, FLAME clustering, Affinity propagation, Artificial intelligence, Data mining, business, computer, Algorithm, Software
Abstract: The clustering of web search results - or web document clustering - has become a very interesting research area among academic and scientific communities involved in information retrieval. Web search result clustering systems, also called Web Clustering Engines, seek to increase the coverage of documents presented for the user to review, while reducing the time spent reviewing them. Several algorithms for clustering web results already exist, but results show room for more to be done. This paper introduces a new description-centric algorithm for the clustering of web results, called WDC-CSK, which is based on the cuckoo search meta-heuristic algorithm, k-means algorithm, Balanced Bayesian Information Criterion, split and merge methods on clusters, and frequent phrases approach for cluster labeling. The cuckoo search meta-heuristic provides a combined global and local search strategy in the solution space. Split and merge methods replace the original Levy flights operation and try to improve existing solutions (nests), so they can be considered as local search methods. WDC-CSK includes an abandon operation that provides diversity and prevents the population nests from converging too quickly. Balanced Bayesian Information Criterion is used as a fitness function and allows defining the number of clusters automatically. WDC-CSK was tested with four data sets (DMOZ-50, AMBIENT, MORESQUE and ODP-239) over 447 queries. The algorithm was also compared against other established web document clustering algorithms, including Suffix Tree Clustering (STC), Lingo, and Bisecting k-means. The results show a considerable improvement upon the other algorithms as measured by recall, F-measure, fall-out, accuracy and SSL"k.
Published: 2014
Full Text: View/download PDF

23. Analyzing topic evolution in bioinformatics: investigation of dynamics of the field with conference data in DBLP

Author: Min Song, Suyeon Kim, and Go Eun Heo
Subjects: Markov random field, Computer science, General Social Sciences, Library and Information Sciences, Bioinformatics, Data science, Field (computer science), Computer Science Applications, Dynamics (music), Cluster labeling, Similarity (psychology), tf–idf, Cluster analysis, Period (music)
Abstract: In this paper we analyze topic evolution over time within bioinformatics to uncover the underlying dynamics of that field, focusing on the recent developments in the 2000s. We select 33 bioinformatics related conferences indexed in DBLP from 2000 to 2011. The major reason for choosing DBLP as the data source instead of PubMed is that DBLP retains most bioinformatics related conferences, and to study dynamics of the field, conference papers are more suitable than journal papers. We divide a period of a dozen years into four periods: period 1 (2000---2002), period 2 (2003---2005), period 3 (2006---2008) and period 4 (2009---2011). To conduct topic evolution analysis, we employ three major procedures, and for each procedure, we develop the following novel technique: the Markov Random Field-based topic clustering, automatic cluster labeling, and topic similarity based on Within-Period Cluster Similarity and Between-Period Cluster Similarity. The experimental results show that there are distinct topic transition patterns between different time periods. From period 1 to period 3, new topics seem to have emerged and expanded, whereas from period 3 to period 4, topics are merged and display more rigorous interaction with each other. This trend is confirmed by the collaboration pattern over time.
Published: 2014
Full Text: View/download PDF

24. Fast and scalable support vector clustering for large-scale data analysis

Author: Yun Feng Chang, Zhili Zhang, Ying Jie Tian, Yi Xian Yang, Yuan Ping, and Yajian Zhou
Subjects: Basis (linear algebra), Support function, Hypersphere, computer.software_genre, Human-Computer Interaction, ComputingMethodologies_PATTERNRECOGNITION, Data point, Artificial Intelligence, Hardware and Architecture, Scalability, Cluster labeling, Data mining, Cluster analysis, Algorithm, computer, Software, k-medians clustering, Information Systems, Mathematics
Abstract: As an important boundary-based clustering algorithm, support vector clustering (SVC) benefits multiple applications for its capability of handling arbitrary cluster shapes. However, its popularity is degraded by both its highly intensive pricey computation and poor label performance which are due to redundant kernel function matrix required by estimating a support function and ineffectively checking segmers between all pairs of data points, respectively. To address these two problems, a fast and scalable SVC (FSSVC) method is proposed in this paper to achieve significant improvement on efficiency while guarantees a comparable accuracy with the state-of-the-art methods. The heart of our approach includes (1) constructing the hypersphere and support function by cluster boundaries which prunes unnecessary computation and storage of kernel functions and (2) presenting an adaptive labeling strategy which decomposes clusters into convex hulls and then employs a convex-decomposition-based cluster labeling algorithm or cone cluster labeling algorithm on the basis of whether the radius of the hypersphere is greater than 1. Both theoretical analysis and experimental results (e.g., the first rank of a nonparametric statistical test) show the superiority of our method over the others, especially for large-scale data analysis under limited memory requirements.
Published: 2014
Full Text: View/download PDF

25. Beyond cluster labeling: Semantic interpretation of clusters’ contents using a graph representation

Author: Mohamed Nadif and François Role
Subjects: Information Systems and Management, Theoretical computer science, Computer science, Semantic interpretation, Feature selection, Graph, Management Information Systems, Exploratory data analysis, Artificial Intelligence, Cluster labeling, Cluster (physics), Graph (abstract data type), Cluster analysis, Software
Abstract: Efficient clustering algorithms have been developed to automatically group documents into subgroups (clusters). Once clustering has been performed, an important additional step is to help users make sense of the obtained clusters. Existing methods address this issue by assigning to each cluster a flat list of descriptive terms (labels) that are extracted from the documents, most often using statistical techniques borrowed from the field of feature selection or reduction. A limitation of these unstructured descriptions of clusters' contents is that they do not account for the meaningful relationships between the terms. In contrast, we propose a graph representation, which makes the clusters easier to interpret by putting the descriptive terms in context, and by performing some simple network analysis. Our experiments reveal that the proposed method allows for a deeper level of interpretation, both when the clusters under study are homogeneous and when they are heterogeneous. In addition, evaluation procedures presented in the paper show that the graph-based representation of each cluster, while being very synthetic, still quite faithfully reflects the original content of the cluster.
Published: 2014
Full Text: View/download PDF

26. Multiple-Parameter Radar Signal Sorting Using Support Vector Clustering and Similitude Entropy Index

Author: Dengfu Zhang, Duyan Bi, Shiqiang Wang, and Zhanling Wang
Subjects: Mathematical optimization, Applied Mathematics, Feature vector, Similitude, law.invention, Generalized entropy index, law, Signal Processing, Cluster labeling, Entropy (information theory), Radar, Cluster analysis, Algorithm, Time complexity, Mathematics
Abstract: The radar signal sorting method based on traditional support vector clustering (SVC) algorithm takes a high time complexity, and the traditional validity index cannot efficiently indicate the best sorting result. Aiming at solving the problem, we study a new sorting method based on cone cluster labeling (CCL) method. The CCL method relies on the theory of approximate coverings both in feature space and data space. Also a new cluster validity index, similitude entropy (SE), is proposed. It can be used to evaluate the compactness and separation of clusters with information entropy theory. Simulations including the performance comparison between the proposed method and the conventional methods are presented. Results show that while maintaining the sorting accuracy, the proposed method can reduce the computing complexity effectively in sorting the signals.
Published: 2013
Full Text: View/download PDF

27. Efficient Cluster Labeling for Support Vector Clustering

Author: M. Shengrui Wang, M. Ernest Monga, V. D'Orangeville, and M. Andre Mayers
Subjects: Clustering high-dimensional data, Fuzzy clustering, business.industry, Computer science, Correlation clustering, Single-linkage clustering, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Computer Science Applications, Support vector machine, ComputingMethodologies_PATTERNRECOGNITION, Data stream clustering, Computational Theory and Mathematics, CURE data clustering algorithm, Cluster labeling, Canopy clustering algorithm, Affinity propagation, Artificial intelligence, Cluster analysis, business, k-medians clustering, Information Systems
Abstract: We propose a new efficient algorithm for solving the cluster labeling problem in support vector clustering (SVC). The proposed algorithm analyzes the topology of the function describing the SVC cluster contours and explores interconnection paths between critical points separating distinct cluster contours. This process allows distinguishing disjoint clusters and associating each point to its respective one. The proposed algorithm implements a new fast method for detecting and classifying critical points while analyzing the interconnection patterns between them. Experiments indicate that the proposed algorithm significantly improves the accuracy of the SVC labeling process in the presence of clusters of complex shape, while reducing the processing time required by existing SVC labeling algorithms by orders of magnitude.
Published: 2013
Full Text: View/download PDF

28. Automated analysis of flow cytometry data: a systematic review of recent methods

Author: Mawal Mohammed, Taher Ahmed Ghaleb, and Emad Ramadan
Subjects: education.field_of_study, business.industry, Computer science, Population, computer.software_genre, Automation, Field (computer science), Identification (information), ComputingMethodologies_PATTERNRECOGNITION, Cluster labeling, Data analysis, Data mining, business, Cluster analysis, education, Research question, computer
Abstract: Flow cytometry (FCM) is a very well-known method that is broadly used in clinical and research laboratories. Both clinical and research laboratories have been the target domains of FCM applications. The key research question in this particular field is “how to effectively automate FCM data analysis?”. To answer this question, this paper systematically reviews current advances in the automation of FCM data analysis. All recent techniques have been studied in a way readers can recognize current trends, challenges, limitations and future directions. For future research, we have identified three main venues. First, the identification of the number of clusters prior to starting cell population identification is still a challenging process. Second, automating the process of cluster labeling still requires more improvement to be fully automated. Last, benchmark datasets are essential in order for researchers to be able to comparatively evaluate different techniques of FCM data analysis under fixed conditions.We end up this paper with a discussion about how flow cytometry data analysis techniques and datasets are correlated with open source technology.
Published: 2016
Full Text: View/download PDF

29. Towards Ontology Reasoning for Topological Cluster Labeling

Author: Isabelle Mougenot, Laure Berti-Equille, Younès Bennani, Hatim Chahdi, Nistor Grozavu, Laboratoire d'Informatique de Paris-Nord (LIPN), Université Sorbonne Paris Cité (USPC)-Institut Galilée-Université Paris 13 (UP13)-Centre National de la Recherche Scientifique (CNRS), UMR 228 Espace-Dev, Espace pour le développement, Institut de Recherche pour le Développement (IRD)-Université de Perpignan Via Domitia (UPVD)-Avignon Université (AU)-Université de La Réunion (UR)-Université de Montpellier (UM)-Université de Guyane (UG)-Université des Antilles (UA), Qatar Computing Research Institute [Doha, Qatar] (QCRI), ANR-12-MONU-0001,Coclico,COllaboration, CLassification, Incrémentalité et COnnaissances(2012), Université Paris 13 (UP13)-Institut Galilée-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS), and Université de Guyane (UG)-Université des Antilles (UA)-Institut de Recherche pour le Développement (IRD)-Université de Perpignan Via Domitia (UPVD)-Avignon Université (AU)-Université de La Réunion (UR)-Université de Montpellier (UM)
Subjects: Knowledge representation and reasoning, Computer science, Reasoning & Knowledge Representation, Ontologie, 02 engineering and technology, Ontology (information science), Topology, Machine learning, computer.software_genre, Semantics, Logique de Description, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Semantic Analysis, Description logic, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Cluster labelling, Cluster analysis, Reasoning system, business.industry, Ontology, Description logic reasoning, [INFO.INFO-LO]Computer Science [cs]/Logic in Computer Science [cs.LO], [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], Self organising maps SOM, ComputingMethodologies_PATTERNRECOGNITION, Raisonnement, [INFO.INFO-TI]Computer Science [cs]/Image Processing [eess.IV], Cluster labeling, Unsupervised learning, 020201 artificial intelligence & image processing, Artificial intelligence, Data mining, business, computer
Abstract: International audience; In this paper, we present a new approach combining topological un-supervised learning with ontology based reasoning to achieve both : (i) automatic interpretation of clustering, and (ii) scaling ontology reasoning over large datasets. The interest of such approach holds on the use of expert knowledge to automate cluster labeling and gives them high level semantics that meets the user interest. The proposed approach is based on two steps. The first step performs a topographic unsupervised learning based on the SOM (Self-Organizing Maps) algorithm. The second step integrates expert knowledge in the map using ontol-ogy reasoning over the prototypes and provides an automatic interpretation of the clusters. We apply our approach to the real problem of satellite image classification. The experiments highlight the capacity of our approach to obtain a semantically labeled topographic map and the obtained results show very promising performances.
Published: 2016
Full Text: View/download PDF

30. Exploring the Similarity/Dissimilarity measures for unsupervised IDS

Author: R. Kiran Kumar, M. Sailaja, and P. Sita Rama Murty
Subjects: business.industry, Anomaly-based intrusion detection system, Pattern recognition, Intrusion detection system, computer.software_genre, Hierarchical clustering, Euclidean distance, ComputingMethodologies_PATTERNRECOGNITION, Similarity (network science), Cluster labeling, Unsupervised learning, Artificial intelligence, Data mining, Cluster analysis, business, computer, Mathematics
Abstract: This paper investigates various Similarity/Dissimilarity measures for Intrusion Detection Problem. In this paper we implemented an offline Anomaly based IDS using agglomerative and partition based clustering algorithms with selected Similarity/Dissimilarity measures. In unsupervised learning labeling the clusters is an important task. This paper employed two cluster labeling algorithms, SNC labeling algorithm and “labeling clusters using class representative objects”. This work is evaluated using KDDCup 99 dataset.
Published: 2016
Full Text: View/download PDF

31. Keyqueries for Clustering and Labeling

Author: Benno Stein, Matthias Busse, Tim Gollub, and Matthias Hagen
Subjects: Information retrieval, Web search query, Computer science, Perspective (graphical), 02 engineering and technology, Document clustering, Consistency (database systems), Search engine, ComputingMethodologies_PATTERNRECOGNITION, 020204 information systems, Cluster labeling, 0202 electrical engineering, electronic engineering, information engineering, Vector space model, 020201 artificial intelligence & image processing, Cluster analysis
Abstract: In this paper we revisit the document clustering problem from an information retrieval perspective. The idea is to use queries as features in the clustering process that finally also serve as descriptive cluster labels “for free.” Our novel perspective includes query constraints for clustering and cluster labeling that ensure consistency with a keyword-based reference search engine.
Published: 2016
Full Text: View/download PDF

32. A clustering technique for news articles using WordNet

Author: Vassilis Tsogkas and Christos Bouras
Subjects: DBSCAN, Clustering high-dimensional data, Information Systems and Management, Fuzzy clustering, Computer science, Correlation clustering, Conceptual clustering, computer.software_genre, Management Information Systems, Biclustering, Artificial Intelligence, CURE data clustering algorithm, Consensus clustering, Cluster analysis, Information retrieval, Brown clustering, k-means clustering, Document clustering, ComputingMethodologies_PATTERNRECOGNITION, Data stream clustering, Bag-of-words model, Cluster labeling, Canopy clustering algorithm, FLAME clustering, Data mining, computer, Software
Abstract: The Web is overcrowded with news articles, an overwhelming information source both with its amount and diversity. Document clustering is a powerful technique that has been widely used for organizing data into smaller and manageable information kernels. Several approaches have been proposed which, however, suffer from problems like synonymy, ambiguity and lack of a descriptive content marking of the generated clusters. In this work, we are investigating the application of a great spectrum of clustering algorithms, as well as similarity measures, to news articles that originate from the Web. Also, we are proposing the enhancement of standard k-means algorithm using the external knowledge from WordNet hypernyms in a twofold manner: enriching the ''bag of words'' used prior to the clustering process and assisting the label generation procedure following it. Furthermore, we are examining the effect that text preprocessing has on clustering. Operating on a corpus of news articles derived from major news portals, our comparison of the existing clustering methodologies revealed that k-means, gives better aggregate results when it comes to efficiency. This is amplified when the algorithm is accompanied with preliminary steps for data cleaning and normalizing, despite its simple nature. Moreover, the proposed WordNet-enabled W-k means clustering algorithm significantly improves standard k-means generating also useful and high quality cluster tags by using the presented cluster labeling process.
Published: 2012
Full Text: View/download PDF

33. Convex Decomposition Based Cluster Labeling Method for Support Vector Clustering

Author: Yajian Zhou, Ying-Jie Tian, Yuan Ping, and Yixian Yang
Subjects: Convex hull, Mathematical optimization, Computer science, Convex set, Proper convex function, Support function, Theoretical Computer Science, Convex polytope, Convex combination, Adjacency matrix, Convex conjugate, Cluster analysis, Time complexity, Connected component, Linear matrix inequality, Radon's theorem, Computer Science Applications, Computational Theory and Mathematics, Effective domain, Hardware and Architecture, Cluster labeling, Convex optimization, Convex cone, Algorithm, Software, Conic optimization
Abstract: Support vector clustering (SVC) is an important boundary-based clustering algorithm in multiple applications for its capability of handling arbitrary cluster shapes. However, SVC’s popularity is degraded by its highly intensive time complexity and poor label performance. To overcome such problems, we present a novel efficient and robust convex decomposition based cluster labeling (CDCL) method based on the topological property of dataset. The CDCL decomposes the implicit cluster into convex hulls and each one is comprised by a subset of support vectors (SVs). According to a robust algorithm applied in the nearest neighboring convex hulls, the adjacency matrix of convex hulls is built up for finding the connected components; and the remaining data points would be assigned the label of the nearest convex hull appropriately. The approach's validation is guaranteed by geometric proofs. Time complexity analysis and comparative experiments suggest that CDCL improves both the efficiency and clustering quality significantly.
Published: 2012
Full Text: View/download PDF

34. Multi-parameter Radar Signal Sorting Method Based on Fast Support Vector Clustering and Similitude Entropy

Author: Bi Du-Yan, Zhang Dengfu, Yong Xiaoju, and Wang Shi-Qiang
Subjects: business.industry, Feature vector, Pattern recognition, Similitude, law.invention, law, Cluster labeling, Entropy (information theory), Artificial intelligence, Adjacency matrix, Electrical and Electronic Engineering, Radar, business, Cluster analysis, Time complexity, Mathematics
Abstract: The radar signal sorting method based on traditional clustering algorithm takes on a high time complexity and has poor accuracy. Considering the issue, a new sorting method is researched based on Cone Cluster Labeling (CCL) method for Support Vector Clustering (SVC) algorithm. The CCL method labels cluster in data space, and therefore avoides the high complexity caused by the calculation of adjacency matrix in feature space. This method is introduced into the radar signal sorting and it is modified for lower complexity and high accuracy by handling the outliers. Meanwhile a new cluster validity index, Similitude Entropy (SE) index, is proposed which assesses the compactness and separation of clusters using information entropy theory. Experimental results show that the strategy can improve efficiency without sacrificing sorting accuracy.
Published: 2011
Full Text: View/download PDF

35. Wikipedia-based topic clustering for microblogs

Author: Tan Xu and Douglas W. Oard
Subjects: Primary channel, Structure (mathematical logic), Information retrieval, Semantic similarity, Computer science, Microblogging, Similarity (psychology), Cluster labeling, Leverage (statistics), Social media, Library and Information Sciences, Cluster analysis, Information Systems
Abstract: Microblogging has become a primary channel by which people not only share information, but also search for information. However, microblog search results are most often displayed by simple criteria such as creation time or author. A review of the literature suggests that clustering by topic may be useful, but short posts offer limited scope for clustering using lexical evidence alone. This paper therefore presents an approach to topical clustering based on augmenting lexical evidence with the use of Wikipedia as an external source of evidence for topical similarity. The main idea is to link terms in microblog posts to Wikipedia pages and then to leverage Wikipedia's link structure to estimate semantic similarity, Results show statistically significant relative improvements of about 3% in cluster purity using a relatively small (7500-post, 5-topic) Twitter test collection. Linking terms in microblog posts to Wikipedia pages is also shown to offer a useful basis for cluster labeling.
Published: 2011
Full Text: View/download PDF

36. Fast support-based clustering method for large-scale problems

Author: Kyu-Hwan Jung, Jaewook Lee, and Daewon Lee
Subjects: Clustering high-dimensional data, Correlation clustering, Constrained clustering, computer.software_genre, Kernel method, Data stream clustering, Artificial Intelligence, CURE data clustering algorithm, Signal Processing, Cluster labeling, Computer Vision and Pattern Recognition, Data mining, Cluster analysis, Algorithm, computer, Software, Mathematics
Abstract: In many support vector-based clustering algorithms, a key computational bottleneck is the cluster labeling time of each data point which restricts the scalability of the method. In this paper, we review a general framework of support vector-based clustering using dynamical system and propose a novel method to speed up labeling time which is log-linear to the size of data. We also give theoretical background of the proposed method. Various large-scale benchmark results are provided to show the effectiveness and efficiency of the proposed method.
Published: 2010
Full Text: View/download PDF

37. Path-percolation modeling of electrical property variations with statistical procedures in spatially disordered inhomogeneous media

Author: Sukkee Um, Hye-Mi Jung, and Wongyu Choi
Subjects: Reverse engineering, Resistive touchscreen, Materials science, Conductive materials, General Physics and Astronomy, Nanotechnology, computer.software_genre, Electrical resistance and conductance, Lattice (order), Cluster labeling, Statistical physics, Thin film, Cluster analysis, computer
Abstract: A current-path percolation model has been developed to simulate the electrical property variations in spatially-disordered inhomogeneous media by establishing a computational path determination scheme, i.e., a cluster labeling process. This scheme eliminates the necessity to estimate the bond resistance at lattice edges. Subsequently, an active clustering process provides more accurate effective electrical resistance values than both the eective medium approximation (EMA) and the Kirkpatrick algorithm. We apply the present model to a solid thin film mixture of temperaturedependent resistive materials. The computational results agree well with experimental data for the eective resistance of pure phases of VO 2 in the literature. Results show that the electrical resistance of VO2 thin films is rather strongly aected by the micro- or the nano-structural configurations of the conductive materials of the thin films. It is expected that current-path percolation modeling combined with a cluster labeling process can be applied to investigating the eective electrical properties of conductive materials and can be utilized as a reverse engineering tool to tailor the micro- or the nano-structural properties of conductive materials.
Published: 2010
Full Text: View/download PDF

38. Performance of a Finite-State Machine Implementation of Iterative Cluster Labeling on Desktop and Mobile Computing Platforms

Author: Matthew Aldridge and Michael W. Berry
Subjects: Finite-state machine, Computer science, Iterative method, Mobile computing, Application software, computer.software_genre, Data type, Computer Science Applications, Computational Theory and Mathematics, Computer engineering, Cluster labeling, Algorithm design, Data mining, Cluster analysis, computer, Information Systems
Abstract: In this paper, we present an efficient finite-state machine implementation of the Hoshen-Kopelman cluster identification algorithm using the nearest-eight neighborhood rule suitable to applications such as computer modeling for landscape ecology. The implementation presented in this study was tested using both actual land cover maps, as well as randomly generated data similar to those in the original presentation of the Hoshen-Kopelman algorithm for percolation analysis. The finite-state machine implementation clearly outperformed a straightforward adaptation of the original Hoshen-Kopelman algorithm on either data type. Research was also conducted to explore the finite-state machine's performance on a palm mobile computing device, and while it was competitive, it did not exceed the performance of the straightforward Hoshen-Kopelman implementation. However, a discussion of why this was the case is provided along with a possible remedy for future hardware designs.
Published: 2009
Full Text: View/download PDF

39. Physically based inversion modeling for unsupervised cluster labeling, independent forest classification, and LAI estimation using MFM-5-Scale

Author: Ryan L Johnson, Jing M. Chen, Josef Cihlar, Derek R. Peddle, Sylvain G. Leblanc, and Forrest G. Hall
Subjects: Geography, Training set, Forest classification, Cluster labeling, General Earth and Planetary Sciences, A priori and a posteriori, Inversion (meteorology), Land cover, Leaf area index, Cluster analysis, Remote sensing
Abstract: Unsupervised clustering is important for regional- to national-scale forest inventories where supervised training data are impractical or unavailable. However, labeling clusters in terms of land-cover classes can be labour intensive and problematic, and clustering and related methods do not provide biophysical-structural information (BSI). Canopy reflectance models such as 5-Scale are powerful forest remote sensing tools; however, 5-Scale can only be run in forward mode and is not invertible to obtain the required forest information. This problem was solved using multiple-forward-mode (MFM) coupled with 5-Scale to enable MFM-5-Scale inversion of land cover and BSI using a look-up table (MFM-LUT) approach that matches satellite image reflectance values with modeled reflectance values that have associated land cover and BSI, such as density, leaf area index (LAI), and crown dimensions, as well as subpixel-scale component fractions. MFM requires no training data or a priori BSI and can optionally be stratifi...
Published: 2007
Full Text: View/download PDF

40. Automated software clustering: An insight using cluster labels

Author: Haroon A. Babri and Onaiza Maqbool
Subjects: Clustering high-dimensional data, Fuzzy clustering, Computer science, Single-linkage clustering, Correlation clustering, Constrained clustering, computer.software_genre, ComputingMethodologies_PATTERNRECOGNITION, Data stream clustering, Ranking, Hardware and Architecture, CURE data clustering algorithm, Cluster labeling, Canopy clustering algorithm, FLAME clustering, Data mining, Cluster analysis, computer, Software, Information Systems
Abstract: Clustering techniques have shown promising results for the architecture recovery and re-modularization of legacy software systems. Clusters that are obtained as a result of the clustering process may not be easy to interpret until they are assigned appropriate labels. Automatic labeling of clusters reduces the time required to understand them and can also be used to evaluate the effectiveness of the clustering process, if the assigned labels are meaningful and convey the purpose of each cluster effectively. In this paper, we present a labeling scheme based on identifiers of an entity. As the clustering process proceeds, keywords within identifiers are ranked using two ranking schemes: frequency and inverse frequency. We present experimental results to demonstrate the effectiveness of our labeling approach. A comparison between the ranking schemes reveals the inverse frequency scheme to form more meaningful labels, especially for small clusters. A comparison of clustering results of the complete and weighted combined algorithms based on labels of the clusters produced by them during clustering shows that the latter produces a more understandable cluster hierarchy with easily identifiable software sub-systems.
Published: 2006
Full Text: View/download PDF

41. An improved cluster labeling method for support vector clustering

Author: Jaewook Lee and Daewon Lee
Subjects: business.industry, Computer science, Applied Mathematics, Information Storage and Retrieval, Reproducibility of Results, Support vector clustering, Numerical Analysis, Computer-Assisted, Pattern recognition, Sensitivity and Specificity, Pattern Recognition, Automated, Support vector machine, Kernel (linear algebra), Computational Theory and Mathematics, Artificial Intelligence, Robustness (computer science), Cluster labeling, Cluster Analysis, Unsupervised learning, Computer Vision and Pattern Recognition, Artificial intelligence, business, Cluster analysis, Algorithms, Software
Abstract: The support vector clustering (SVC) algorithm is a recently emerged unsupervised learning method inspired by support vector machines. One key step involved in the SVC algorithm is the cluster assignment of each data point. A new cluster labeling method for SVC is developed based on some invariant topological properties of a trained kernel radius function. Benchmark results show that the proposed method outperforms previously reported labeling techniques.
Published: 2005
Full Text: View/download PDF

42. An Automatic Labeling of K-means Clusters based on Chi-Square Value

Author: R Kusumaningrum and Farikhin
Subjects: History, Computer science, business.industry, k-means clustering, Pattern recognition, Document clustering, Computer Science Applications, Education, ComputingMethodologies_PATTERNRECOGNITION, Cluster labeling, Numeric data, Chi-square test, Cluster (physics), Artificial intelligence, Cluster analysis, business
Abstract: Automatic labeling methods in text clustering are widely implemented. However, there are limited studies in automatic cluster labeling for numeric data points. Therefore, the aim of this study is to develop a novel automatic cluster labeling of numeric data points that utilize analysis of Chi-Square test as its cluster label. We performed K-means clustering as a clustering method and disparity of Health Human Resources as a case study. The result shows that the accuracy of cluster labeling is about 89.14%.
Published: 2017
Full Text: View/download PDF

43. Cluster Labeling for the Blogosphere

Author: Claus Steuer, Patrick Hennig, Christia Wuerz, Christoph Meinel, and Philipp Berger
Subjects: Focus (computing), Information retrieval, Computer science, business.industry, Blogosphere, computer.software_genre, Hierarchical clustering, ComputingMethodologies_PATTERNRECOGNITION, Cluster labeling, Encyclopedia, Hierarchical organization, The Internet, Data mining, Cluster analysis, business, computer
Abstract: Hierarchical Cluster Labeling helps users to quickly understand and analyze hierarchical clusters. This may be used to enhance search engine results or interactive browsing like it is being used in the Blog Intelligence application. The hierarchical organization of data helps to represent different levels of detail. Hierarchical clustering may be quite common, but there are few good solutions for labeling those clusters. We decided to lay the focus of this work on labeling binary hierarchical clusters. Current approaches focus either on statistical features of the clustered documents or external sources like Wikipedia. We combined those ideas to profit from both advantages and created an algorithm, that can handle clustered documents as well as terms.
Published: 2014
Full Text: View/download PDF

44. Automatic Unsupervised Bug Report Categorization

Author: Nachai Limsettho, Hideaki Hata, Akito Monden, and Kenichi Matsumoto
Subjects: Topic model, Equations, string matching, Computer science, Process (engineering), topic modeling, pattern clustering, Logistics, unsupervised learning, Machine learning, computer.software_genre, Mathematical model, Labeling, automatic unsupervised bug report categorization, Cluster analysis, Accuracy, Structure (mathematical logic), cluster labeling, labeling process, business.industry, Supervised learning, text analysis, program debugging, Vectors, Categorization, Cluster labeling, Unsupervised learning, Artificial intelligence, business, computer, clustering, textual similarity, automated bug report categorization
Abstract: Background: Information in bug reports is implicit and therefore difficult to comprehend. To extract its meaning, some processes are required. Categorizing bug reports is a technique that can help in this regard. It can be used to help in the bug reports management or to understand the underlying structure of the desired project. However, most researches in this area are focusing on a supervised learning approach that still requires a lot of human afford to prepare a training data. Aims: Our aim is to develop an automated framework than can categorize bug reports, according to their hidden characteristics and structures, without the needed for training data. Method: We solve this problem using clustering, unsupervised learning approach. It can automatically group bug reports together based on their textual similarity. We also propose a novel method to label each group with meaningful and representative names. Results: Experiment results show that our framework can achieve performance comparable to the supervised learning approaches. We also show that our labeling process can label each cluster with representative names according to its characteristic. Conclusion: Our framework could be used as an automated categorization system that can be applied without prior knowledge or as an automated labeling suggestion system., 2014 6th International Workshop on Empirical Software Engineering in Practice, 12-13 Nov. 2014, Osaka, Japan
Published: 2014
Full Text: View/download PDF

45. Trajectory clustering for motion pattern extraction in aerial videos

Author: Bernhard Rinner, Tahir Nawaz, and Andrea Cavallaro
Subjects: Discrete wavelet transform, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Aerial video, Object (computer science), Motion (physics), ComputingMethodologies_PATTERNRECOGNITION, Cluster labeling, Computer vision, Artificial intelligence, business, Cluster analysis
Abstract: We present an end-to-end approach for trajectory clustering from aerial videos that enables the extraction of motion patterns in urban scenes. Camera motion is first compensated by mapping object trajectories on a reference plane. Then clustering is performed based on statistics from the Discrete Wavelet Transform coefficients extracted from the trajectories. Finally, motion patterns are identified by distance minimization from the centroids of the trajectory clusters. The experimental validation on four datasets shows the effectiveness of the proposed approach in extracting trajectory clusters. We also make available two new real-world aerial video datasets together with the estimated object trajectories and ground-truth cluster labeling.
Published: 2014
Full Text: View/download PDF

46. Automatic cluster labeling through Artificial Neural Networks

Author: Ricardo A. L. Rabelo, Lucas Araújo Lopes, and Vinicius Ponte Machado
Subjects: Artificial neural network, Discretization, Time delay neural network, Computer science, business.industry, Competitive learning, Deep learning, Semi-supervised learning, Machine learning, computer.software_genre, ComputingMethodologies_PATTERNRECOGNITION, Cluster labeling, Unsupervised learning, Artificial intelligence, Data mining, Types of artificial neural networks, Cluster analysis, Intelligent control, business, computer
Abstract: The clustering problem has been considered as one of the most important problems among those existing in the research area of unsupervised learning (a Machine Learning subarea). Although the development and improvement of algorithms that deal with this problem has been focused by many researchers, the main goal remains undefined: the understanding of generated clusters. As important as identifying clusters is to understand its meaning. A good cluster definition means a relevant understanding and can help the specialist to study or interpret data. Facing the problem of comprehend clusters - in other words, create labels - this paper presents a methodology to automatic labeling clusters based on techniques involving supervised and unsupervised learning plus a discretization model. Considering the problem from its inception, the problem of understanding clusters is dealt similar to a real problem, being initialized from clustering data. For this, an unsupervised learning technique is applied and then a supervised learning algorithm will detect which are the relevant attributes in order to define a specific cluster. Additionally, some strategies are used to create a methodology that presents a label (based on attributes and their values) for each cluster provided. Finally, this methodology is applied in four distinct databases presenting good results with an average above 88.79% of elements correctly labeled.
Published: 2014
Full Text: View/download PDF

47. A hybrid approach to improving clustering accuracy using SVM

Author: Abdun Naser Mahmood, Zubair Shah, and Abdul K. Mustafa
Subjects: Clustering high-dimensional data, business.industry, Correlation clustering, k-means clustering, Pattern recognition, computer.software_genre, ComputingMethodologies_PATTERNRECOGNITION, CURE data clustering algorithm, Cluster labeling, Canopy clustering algorithm, Sequential minimal optimization, Data mining, Artificial intelligence, Cluster analysis, business, computer, Mathematics
Abstract: Support Vector Machines (SVMs) have been used in many areas such as regression, classification and novelity detection due to its accuracy and generalizability. Recently SVMs have been proposed for clustering analysis as well. Support Vector Clustering (SVC) works by finding the minimum enclosing sphere of data points using SVM training. SVC is a boundary based clustering method, where the support information is used to construct cluster boundaries. In support vector-based clustering algorithms, the main computational bottle-neck is the high cluster labeling time for each data point. In addition, in many cases labeled data is not available for use with SVC. This tends to restrict the scalability of the method and results in decreased efficiency. This also decreases the applicability of the SVC method to real-life datasets most of which do not have any class labels.. In this paper we present a technique that could be used to utilize SVM to improve the accuracy of clustering without the need of labeled dataset. We have used K-Means clustering algorithm to generate initial labels from the data and in the next step we have trained a Sequential Minimal Optimization (SMO) classifier on these labels. The original data set is then tested using the trained SMO classifier to improve classification accuracy. This process is continued iteratively and stops when further improvement is not possible. The proposed approach is compared against the popular Stephen winters-Hilt [1] approach and achieves 94% accuracy when applied to benchmark datasets.
Published: 2013
Full Text: View/download PDF

48. Computational methods for evaluation of cell-based data assessment—Bioconductor

Author: Nolwenn Le Meur, Physiopathologie et pharmacologie cellulaires et moléculaires, and Université de Nantes (UN)-IFR26-Institut National de la Santé et de la Recherche Médicale (INSERM)
Subjects: Quality Control, Standardization, Computer science, Data management, Statistics as Topic, Biomedical Engineering, Bioengineering, Bioinformatics, computer.software_genre, Bioconductor, 03 medical and health sciences, Automation, 0302 clinical medicine, Artificial Intelligence, Cluster Analysis, Cluster analysis, 030304 developmental biology, 0303 health sciences, business.industry, Computational Biology, Reproducibility of Results, Flow Cytometry, [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], High-Throughput Screening Assays, Workflow, Cluster labeling, Anomaly detection, Data mining, business, computer, 030215 immunology, Biotechnology
Abstract: International audience; Recent advances in miniaturization and automation of technologies have enabled cell-based assay high-throughput screening, bringing along new challenges in data analysis. Automation, standardization, reproducibility have become requirements for qualitative research. The Bioconductor community has worked in that direction proposing several R packages to handle high-throughput data including flow cytometry (FCM) experiment. Altogether, these packages cover the main steps of a FCM analysis workflow, that is, data management, quality assessment, normalization, outlier detection, automated gating, cluster labeling, and feature extraction. Additionally, the open-source philosophy of R and Bioconductor, which offers room for new development, continuously drives research and improvement of theses analysis methods, especially in the field of clustering and data mining. This review presents the principal FCM packages currently available in R and Bioconductor, their advantages and their limits.
Published: 2013
Full Text: View/download PDF

49. A New Perspective of Support Vector Clustering with Boundary Patterns

Author: Huina Li, Yuan Ping, Yong Zhang, and Zhili Zhang
Subjects: Support vector machine, Matrix (mathematics), Optimization problem, business.industry, Computer science, Correlation clustering, Cluster labeling, Boundary (topology), Pattern recognition, Artificial intelligence, Function (mathematics), business, Cluster analysis
Abstract: To overcome the pricey computation required by redundant kernel function matrix and poor label performance, in a novel perspective, we present support vector clustering with boundary patterns (BPSVC for abbreviation) for efficiency. For the first phase, the conventional method of estimating the support vector function with the whole data is altered by only essential boundary patterns. Thence, BPSVC only need to solve a much simpler optimization problem. For the second phase of cluster labeling, both convex decomposition and cone cluster labeling method are employed by an ensemble labeling strategies for further improvements on accuracy and efficiency. Both theoretical analysis and experimental results show its superiorities in comparison of the state-of-the-art methods, especially for large-scale data analysis.
Published: 2013
Full Text: View/download PDF

50. Text Document Topical Recursive Clustering and Automatic Labeling of a Hierarchy of Document Clusters

Author: Jiyang Chen, Xiaoxiao Li, and Osmar R. Zaïane
Subjects: Hierarchy, Measure (data warehouse), ComputingMethodologies_PATTERNRECOGNITION, Information retrieval, Betweenness centrality, Computer science, Cluster labeling, Text document, Document clustering, Cluster analysis, Social network analysis
Abstract: The overwhelming amount of textual documents available nowadays highlights the need for information organization and discovery. Effectively organizing documents into a hierarchy of topics and subtopics makes it easier for users to browse the documents. This paper borrows community mining from social network analysis to generate a hierarchy of topically coherent document clusters. It focuses on giving the document clusters descriptive labels. We propose to use betweenness centrality measure in networks of co-occurring terms to label the clusters. We also incorporate keyphrase extraction and automatic titling in cluster labeling. The results show that the cluster labeling method utilizing KEA to extract keyphrases from the documents generates the best labels overall comparing to other methods and baselines.
Published: 2013
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

81 results on '"Cluster labeling"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources