"High-dimensional Computational Geometry"
Piotr Indyk
Stanford University
(click here to download the PS file)
Abstract
Consider the following problem:
given a database of points in a multidimensional space, construct a data structure
which given any query point finds the database point(s) closest to it.
This problem,
called nearest neighbor search
has been of great importance to several areas of computer science, including
pattern recognition, databases, vector compression, computational statistics and data mining.
Many of the above applications involve data sets whose size and dimensionality are very large.
Therefore, it is crucial to design algorithms which scale well with the database size as well as with the dimension.
The nearest neighbor problem is an example of a large class of proximity problems .
Apart from the nearest neighbor, the class contains problems like closest pair, diameter (or furthest pair), minimum spanning tree and clustering problems.
In the latter case the goal is to find a partition of points into k clusters, in order to minimize a certain function.
Example functions are: the sum of the distances from each point to its nearest cluster representative (this problem is called k-median), the maximum such distance (k-center), the sum of all distances between points in same clusters (k-clustering), etc.
Since these problems are ubiquitous, they have been investigated in computer science for a long while (e.g., in computational geometry).
As a result of this research effort, many efficient solutions have been discovered for the case when the points lie in a space of small dimension .
Unfortunately, their running time grows exponentially with the dimension.
In this thesis we show that it is in fact possible to obtain efficient algorithms for the aforementioned problems, if we are satisfied with answers which are approximate .
The running time of our algorithms for the aforementioned problems has only
polynomial dependence on the dimension, and sublinear (for the nearest
neighbor problem) or subquadratic (for the closest pair, minimum spanning tree, clustering etc.) dependence on the number of input points.
The highlights of the results include:
- Several algorithms for the (1+epsilon)-approximate near neighbor problem, where the goal is to find a database point within a pre-specified distance from the query point.
In particular, for the Hamming metric, we give an algorithm with O(d n^{1/(1+epsilon)}) per query/insertion/deletion (where we use n to denote the number of data points and d to denote the dimensionality) as well as an algorithm with
O(d log n/epsilon^2) query time but large storage space.
For the maximum (i.e., l_infty) norm, we give an algorithm which achieves
O(log_{1+delta} log d)-approximation for any delta>0, with
O(d log n) query time and O(dn^{delta}) update; this in particular gives
a 3-approximate data structure using slightly superpolynomial storage
of size O(n^{log d+1}).
-
Efficient reductions to dynamic approximate nearest neighbor from a variety of problems, including nearest neighbor, furthest neighbor, closest pair, minimum spanning tree, facility location and bottleneck matching.
In particular, the sublinear-time algorithms for the near neighbor described above, together with the reductions, yield sublinear-time algorithms for nearest and furthest neighbor, as well as subquadratic-time algorithms for the remaining problems.
-
Efficient embeddings to Hamming metric and l_infty from other metrics, including Euclidean and Manhattan norms and Hausdorff metrics.
This allows us to extend the aforementioned results to these metrics as well
-
Applications of the techniques used to obtain the above results to give efficient algorithms for problems outside of the scope of ``high-dimensional computational geometry''.
These problems include: computing statistics for a stream of data using small space, finding spanning trees crossing few hyperplanes in low dimensions, and approximate pattern matching.
- Subquadratic-time approximation algorithms for certain problems defined over general metrics, including 1-median, k-median, and 2-clustering.
E.g., the algorithm for k-median works in time (roughly) O(nk).