Back to Search Start Over

On the Design and Applicability of Distance Functions in High-Dimensional Data Space.

Authors :
Chih-Ming Hsu
Ming-Syan Chen
Source :
IEEE Transactions on Knowledge & Data Engineering. Apr2009, Vol. 21 Issue 4, p523-536. 14p. 7 Charts, 5 Graphs.
Publication Year :
2009

Abstract

Effective distance functions in high-dimensional data space, such as those used for clustering, nearest neighbor search, and indexing, are very important in solutions for many data mining problems. Recent research has shown that if the Pearson variation of the distance distribution converges to zero with increasing dimensionality, the distance function will become unstable (or meaningless) in high-dimensional space, even with the commonly used Lp, metric in the euclidean space. However, the necessary condition for unstability of a distance function, which is required for function design, remains unknown. In this paper, we shall prove that the following conditions are equivalent to unstability: 1) the Pearson variation of the distance distribution for any given query converges to 0 with increasing dimensionality, 2) the ratio of the distances of any two points to th query converges in probability to 1 with increasing dimensionality, and 3) the second moment coefficient converges to 1 with increasing dimensionality. With these results, we have the necessary and the sufficient conditions for unstability, whose negatives imply the sufficient and necessary conditions for stability. Based on these theoretical results, we employ some effective and valid indices for testing the stability of a distance function. In addition, this theoretical analysis inspires us that unstable phenomena are rooted in variation of the distance distribution. To demonstrate the theoretical results, we design a meaningful distance function, called the Shrinkage-Divergence Proximity (SDP), based on a given distance function. It is shown empirically that the SDP significantly outperforms other measures in terms of stability in high-dimensional data space, and is thus more suitable for distance-based applications. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
10414347
Volume :
21
Issue :
4
Database :
Academic Search Index
Journal :
IEEE Transactions on Knowledge & Data Engineering
Publication Type :
Academic Journal
Accession number :
37146030
Full Text :
https://doi.org/10.1109/TKDE.2008.178