Quality-based distance measures and applications to clustering

Authors :: Edward R. Dougherty
Darin Taverna
Yi Chen
Marcel Brun
Source :: 2006 IEEE/NLM Life Science Systems and Applications Workshop.
Publication Year :: 2006
Publisher :: IEEE, 2006.
Abstract: When analyzing biological data sets, a common approach is to partition the data into clusters. Examples of this include finding a subset of genes with co-regulated expression among experiments, grouping similar disease phenotypes, or implicating regions of genetic variation in disease. The ability to separate the data into subsets depends upon the structure of the distribution of points and the choice of clustering algorithm. Furthermore, the biological relevance of the clustering results is biased by the variation among the data points themselves. We introduce a mathematical quality-based distance metric which will allow all data, regardless of its error, to be included in analysis without the need to introduce a cutoff. This removes the need to exclude points or to change the dimensionality. The advantage of this approach is shown by clustering simulated data with added noise.