The importance of data mining and management grows as a vast number of data are accumulated. However, traditional data mining algorithms, such as clustering and locality-sensitive hashing, do not scale well in the presence of large and complex data sets. This is primarily due to the fact that a custom similarity function, known as the “kernel function”, is used to compute distances between data points with complex data types and non-linear relationships. In order to use kernelization, the original data mining algorithm must be re-formulated into a form that uses inner-products between the given data points. Unfortunately, such a re-formulation comes at the cost of expensive operations such as an
eigen-decomposition of a large similarity matrix. In this work, we show how uniform sampling among the given data points can be used to address high computational complexities in kernelized data mining algorithms. In particular, we focus on three major algorithms used in data mining and retrieval: kernel k-means, kernel principal component analysis (KPCA), and locality-sensitive hashing (LSH).
For the kernel k-means clustering, we use uniform sampling to compute a (1 + n-δ)-approximation of the per-iteration cost with complexity O(n^{1+δ}), for any δ ∈ (0, 1).
For KPCA, we lower the complexity down to O(kn^{1+δ} +k^3), where k is the number of principal components, and prove that the reduced-size problem we solve is spectrally equivalent to the original KPCA problem.
For the LSH algorithm we present, we address an additional issue of distribution-sensitivity, where the query time and accuracy varies depending on the underlying distribution of the data. We show that Voronoi-partitioning the data set around centers chosen uniformly at random yields stable and fast query time while maintaining high accuracy with fewer hash tables. We also show that our algorithm satisfies the locality-sensitive property. Through extensive experiments, we confirm that our algorithms
are ve...