Start Over

Öbek analizi algoritmaları

Authors :: Altun, Muhammet
Ercengiz, Ali
Mühendislik Bilimleri Ana Bilim Dalı
Publication Year :: 1998
Publisher :: Fen Bilimleri Enstitüsü, 1998.
Abstract: ÖZET Öbek analizi (cluster analysis) bir sınıflandırmanın oluşturulmasında kullanılabilen prosedürlerin geniş bir yelpazesine verilen genel bir isimdir. Bu prosedürler deneysel olarak oldukça benzer nesnelerin öbeklerini (cluster) veya gruplarını oluştururlar. Daha belirgin olarak, bir öbek analizi yöntemi, nesnelerin bir örneği hakkında bilgi içeren bir veri kümesi ile başlayan ve bu nesneleri göreceli olarak homojen gruplar olarak yeniden organize etmeye girişen çok değişkenli istatistiksel bir prosededürdür. Öbek analizi yapılması için günümüze kadar birçok çalışma yapılmış ve birçok algoritma üretilmiştir. Üretilen bu algoritmaların amacı, X = {xı, X2,..., xn} şeklinde verilen bir kümenin öbeklerinin elde edilmesidir. Bu algoritmalar için temel girdiler (input) X kümesi ve bu kümenin elemanları arasındaki ilişkidir (uzaklık ya da benzerliok ölçüsü). Bu algoritmaların çıktıları (output) temel olarak öbek sayısı ve öbekler olmalıdır. Fakat literatürdeki mevcut algoritmaları hepsi için bu geçerli değildir. Algoritmalar iki kategoriye ayrılmışlardır. Birinci kategoride öbek sayısının da girdi olarak verildiği algoritmalar, ikinci kategoride ise öbek sayısının bulunmasına yönelik algoritmalar mevcuttur. Bu tez kapsamında, genel olarak literatürde geçen öbek analizi algoritmaları üzerinde araştırmalar yapılarak, elde edilen önemli olan ve sıklıkla kullanılan algoritmalar verilmiştir. Ayrıca bu algoritmalar için bilgisayar programlan yazılmış ve bu programlar yardımıyla algoritmaların ürettikleri sonuçlar ekran çıktıları halinde verilmiştir. vıı SUMMARY CLUSTER ANALYSIS ALGORITHM The practice of classifying objects according to perceived similarities is the basis for much of science. Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. Cluster Analysis is the formal study of algorithms and methods for groupings, or classifying objects. An object is described either by a set of measurements or by relationship between object and other objects. Cluster analysis does not use category labels that tag object with prioridentifiers. The absence of category levels which distinguishes cluster analysis from discriminant analysis (and pattern recognition and decision analysis). The objective of cluster analysis is simply to find a convenient and valid organization of the data, not to establish rules for seperating future data into categories. Clustering algorithms are geared toward finding structure in the data. A cluster is comprised of a number of similar objects collected or grouped together. Everitt (1974) documents some of the following definitions of a cluster: l.`A cluster is set of entities which are alike, and entities from different clusters are not alike`. 2.`A cluster is an aggregation of points in the test spaces such that the distance between any two points in the cluster is less than the distance between any point in the cluster and any point not in it.` 3. `Clusters may be described as connected regions of a multi-dimensional space containing relatively high density of points, seperated from other such regions by a region containing a relatively low density of points`. The last two definitions assume that the objects to be clustered are represented as points in the measurements space. We recognize a cluster when see it in the plane, although it is not clear how we do it. While it is easy to give a functional definition of a cluster, it is very difficult to give an operational definition of a cluster. This is due to the fact that objects can be grouped into cluster with different purposes in mind. Data can reveal clusters of differing `shapes` and `sizes`. To compound the problem further, cluster membership can change over time, as is the case with star clusters (Dewdney, 1986), and the number of clusters often depends on the resolution (fine versus coarse) with which we view the data. Figure -1 illustrates some of concepts for two-dimensional point clusters. How many clusters are there in Figure- 1? At the global or higher level of similarity, we perceive four clusters in these data, but at the local level or a lower similarity threshold, we perceive twelwe clusters. Which answer is correct? Loking at the data at multiple scales may actually help in viiianalyzing its structure. Thus the crucial problem in identifying clusters in data is to specify what proximity is and how to measure it. As is to be expected, the notion of proximity is problem dependent. Figure -1 :Clusters of Point patterns in two dimension Clustering techniques offer several advantages over a manual grouping process.First, a clustering program can apply a specified objective criterion consistently to form the groups. Human beings are excellent cluster seekers in two and often in three dimensions, but different individuals do not always identify the same cluster in data. The proximity measure defining similarity among objects depends on an individual's educational and cultural background.Thus it is quite common for different human subjects to form different groups in the same data, especially when the group are not well sperated. Second, a clustering algorithm can form the groups in the fraction of time required by a manual grouping, particularly if ixa long list of descriptors or features is associated with each object. The speed, reliability, and consistency of a clustering algorithm in organizing data together constitute an overhelming reason to use it. A clustering algorithm relieves a scientist or data analyst of the treacherous job of `looking` at apattera matrix or a similarity matrix to detect clusters. A data analyst' time is better spent in analyzing or interpreting the results provided by a clustering algorithm. Clustering is also usefull in implementing the `divide an conquer` strategy to reduce the computational complexity of various decision-making algorithms in pattern recognition. For example, the nearest-neighbor decision rule is apopular technique in pattern recognition (Duda and Hart, 1973). However, finding the nearest neighbor of a test pattern can be very time consuming if the number of training patterns or prototypes is large. Fukunaga and Narendra (1975) used the well-known partitional clustering algorithm, ISODATA, to decompose the patterns, and then in conjunction with the branch-and-bound method obtained an efficient algorithm to compute nearest neighbors. Similarly Fukunaga and Short (1978) used clustering for problem localization, whereby a simple decidion rule can be implemented in local regions or clusters of the pattern space. The applications of clusterind continue to grow. Consider the problem of grouping various colleges and universities in the United States to illustrate the factors in clustering problems. Schools can be clustered based on their geographical location, size of the student body, size of the campus, tuition fee, or offering of various professional graduate programs. The factors depend on the goal of the analysis. The shapes and sizes of the clusters formed will depend on which particular attribute is used in defining the similarity between colleges. Interesting and challenging clustering problems arise when several attributes are taken together to construct clusters. One cluster could represent private, midwestera, and primarily liberal arts colleges with fewer than 1000 students and another can represent large state universities. The features or attributes that we have mentioned so far can easily be measured. What about such attributes as quality of education, quality of faculty, and the quality of campus life, which cannot be measured easily? One can poll alumni or a panel of experts to get either a numerical score (on a scale of, say, 1 to 10) for these factors or similarities must be averaged over all respondents because individual opinions differ. One can also measure subjectivity attributes indirectly. For example, faculty excellence in agraduate program can be estimated from the number of professional papers written and number of Ph.D. Degrees awarded. The example above illustrates the difference between decision making and clustering. Suppose that we want to partition computer science graduate programs in the United States into two categories based on such attributes as size of faculty, computing resources, external researc support, and faculty publications. In the decision-making paradigm, an `expert` must first define these two categories by identifying some computer science programs from each of the categories (these are the training samples in pattern recognition terminology). The attributes of these training samples will be used to construct decision boundaries (or simply thresholds on attributes values) that will separate the two types of programs. Once the decision boundry is available, the remaining computer science programs (those that were not labeled by the expert) will be assigned to one of the two categories. In the clusteringparadigm, no expert is available to define the categories. The objective is to determine ehether a two-category partition of the data, based on the given attributes, is reasonable, and if so, to determine the memberships of the two clusters. This can be achieved by forming similarities between all pairs of computer science graduate programs based on the given attributes and the given attributes and then constructing groups such that the within-group similarities are larger than the between-group similarities. Cluster analysis is one component of exploratory data analysis, which means sifting through data to make sense out of measurements by wahtever means are available. The information gained about a set of data from a cluster analysis should prod one's creativity, suggest new experiments, and provide fresh insight into the subject matter. The modern digital computer makes all this possible. Cluster analysis is a child of the computer revolution and frees the analyst from time-honored statistical models and procedures conceived when the human brain was aided only by pencil and paper. The development of clustering methodology has been truly interdisciplinary. Researchers in almost every area of science ststisticians, social scientists, and engineers. I.J. Good (1977) has suggested the new name botryology for the discipline of cluster analysis, from the Greek word for a cluster of grapes. In this thesis, in general, algorithms and mathematical bases are given in the result of cluster analysis. Algorithms are examined in two steps. At first step, we examined the algorithms where number of cluster inputs are given. Also we can classify these algorithms. Furthermore, hierarchical algorithms and nonhierarchical algorithms is to find the clusters in order to optimize the given objective functions. In these algorithms mostly `Hard c-means`, and `Fuzzy c-means` methods are used by the researchers. In the following flow chart the firs step algorithms are described. ALGORITHMS Hierarchical Algorithms Nonhierarchical Algorithms Single Linkage Complete Linkage Hard C-Means Fuzzy C-Means Average Linkage Figure 2. A Structure of Classification Algorithms. XIIn the second phase of research algorithms are examined in order to find the number of clusters. These algorithms can be also classified as hierarchical algorithms and nonhierarchical algorithms. As before hierarchical algorithms are examined with three methods, that is; `Single Linkage`, Complete Linkage`, and `Average Linkage`. On the other hand, the base of the nonhierarchical algorithms is the optimization of the validity functionals. In order to have executable algorithms in the first step as algorithm `Fuzzy c-means` is used. In this thesis, fuzzy sets and fuzzy partiton spaces are studied as well. After execution of the all algorithms, as a result fuzzy partiton matrix is produced. xn 58