1. Randomized Near Neighbor Graphs, Giant Components, and Applications in Data Science
- Author
-
Linderman, George C., Mishne, Gal, Kluger, Yuval, and Steinerberger, Stefan
- Subjects
FOS: Computer and information sciences ,Discrete Mathematics (cs.DM) ,Statistics - Machine Learning ,Computer Science - Data Structures and Algorithms ,Probability (math.PR) ,FOS: Mathematics ,Mathematics - Combinatorics ,Data Structures and Algorithms (cs.DS) ,Machine Learning (stat.ML) ,Combinatorics (math.CO) ,Mathematics - Probability ,Computer Science - Discrete Mathematics - Abstract
If we pick $n$ random points uniformly in $[0,1]^d$ and connect each point to its $k-$nearest neighbors, then it is well known that there exists a giant connected component with high probability. We prove that in $[0,1]^d$ it suffices to connect every point to $ c_{d,1} \log{\log{n}}$ points chosen randomly among its $ c_{d,2} \log{n}-$nearest neighbors to ensure a giant component of size $n - o(n)$ with high probability. This construction yields a much sparser random graph with $\sim n \log\log{n}$ instead of $\sim n \log{n}$ edges that has comparable connectivity properties. This result has nontrivial implications for problems in data science where an affinity matrix is constructed: instead of picking the $k-$nearest neighbors, one can often pick $k' \ll k$ random points out of the $k-$nearest neighbors without sacrificing efficiency. This can massively simplify and accelerate computation, we illustrate this with several numerical examples.
- Published
- 2017
- Full Text
- View/download PDF