Random Features for Large Scale Machine Learning

Ali Rahimi ()
with Benjamin Recht and Joseph Bradley
(last modified 3 Oct, 2009)


What are Random Features?

Random Features are a trick to speed up supervised learning algorithms so they can operate on big (millions of data points) data sets. Traditional learning algorithms work too hard because they optimize parameters that don't don't actually need to be optimized. Instead, one can randomize over some of the parameters and quickly optimize over the rest. Thanks to the concentration of measure phenomenon, this still provides excellent accuracy. This recipe makes for accurate learning algorithms that run extremely fast and are very easy to implement.

 % Training
w = randn(D, size(X,1));
Z = exp(i*w*X);
alpha = (eye(size(X,2)*lambda+Z*Z')\(Z*y);

 % testing
ztest = alpha(:)'*exp(i*w*xtest(:));
It's easy to implement learning algorithms that use Random Feature. Training with Random Feature learner that approximates the Gaussian kernel takes three lines of MATLAB code. Evaluating the resulting predictor takes one line of MATLAB code.

Random Features as an Alternative to Boosting

In prediction problems, we are given training data (inputs and their corresponding outputs ), and are asked to construct a function that returns a guess for the that corresponds to its argument . Many learning algorithms represent their predictor as a weighted sums of simpler functions:

.

The parameters of this model are the weights and the function parameters . Kernel Machines (like the Kernelized Support Vector Machine), Neural Networks, and Adaboost all returns model of this form. Let's focus on Adaboost, a greedy algorithm that approximately finds good settings for these parameters. After iterations, Adaboost produces a function consisting of terms that is guaranteed to get to within of the optimal solution (which may have infinitely many terms):

It takes a lot of effort for Adaboost to compute : for each of its iterations it has to touch the entire dataset (variants that touch only a subset of the dataset also exist. Some comparisons appear below). Furthermore, its solution is not necessarily optimal (finding the best parameters in general is actually NP-hard an inapproximable to within any constant).

The idea behind Random Features is to pick randomly, in batch, and to solve for the exactly via a simple batch convex optimization. This is much simpler to code up, it runs about 1000 times faster than Adaboost, and bounds on its convergence rate match that of Adaboost.

What Sampling Distribution and What Features Should I Use?

There are many ways to pick good random features (that is, the functions and a distribution from which to sample the parameters ). I like deriving random features that approximate the popular positive definite kernels that are used in Kernel Machines. This lets us efficiently leverage the performance of state of the art Kernel Machines on very large data sets. The following section explains how.

Random Features as an Alternative to the Kernel Trick

Kernel machines (such as the Kernelized Support Vector Machine) are state-of-the-art architectures for classifying data, but training them is slow. They can rarely be trained with more than ~50,000 training examples in practice. To scale training to millions of examples, we utilize a linear version of the machine, which can easily scale to millions of examples. To do this, we pass the data through random features, and train a linear machine on the featurized data. The random features are designed to ensure that the classifier generated in this way is approximately the same as the classifier returned by the original (slow-to-train) kernel machine.

While the kernel trick lets us do linear things in infinite-dimensional space without an infinite amount of work, it still incurs a high computational cost. Random Features provide a finite-dimensional alternative to the Kernel Trick by instead mapping the data to an equivalent randomized feature space. For a given shift-invariant kernel we can find a -dimensional random function so that

Some Experiments

Comparison of testing error and training time between ridge regression with random features, Core Vector Machine, and various state-of-the-art exact methods reported in the literature. For classification tasks, the percent of testing points incorrectly predicted is reported. For regression tasks, the RMS error normalized by the norm of the ground truth is reported.

Comparisons between Random Kitchen Sinks and Adaboosted decision stumps on adult (first row), activity (second row), and KDDCUP99 (third row). The first column plots test error of each classifier as a function of T, the number of random features. The accuracy of Random Kitchen Sinks catches up to that of Adaboost as T grows. The second column plots the total training and testing time as a function of T. For a given T, Random Kitchen Sinks is between two and three orders of magnitude faster than Adaboost. The third column combines the previous two columns. It plots testing+training time required to achieve a desired error rate. For a given error rate, Random Kitchen Sinks is between one and three orders of magnitude faster than Adaboost.
Comparison between Random Kitchen sinks with Fourier features on an activity recognition experiment. We attached a lot of sensors to "Dave" (accelerometers, gyros, barometers, light sensors, etc), and built a classifier to determine whether he's walking.
This experiment is due to Joseph Bradley, who was kind enough to spend a summer at Intel. Comparison between Random Kitchen sinks and boosting by filtering on MNIST. These are handwritten digits 0-9, with 60,000 training, and 10,000 test examples.
This experiment is due to Joseph Bradley, who was kind enough to spend a summer at Intel. Comparison between Random Kitchen sinks and boosting by filtering on a database of faces. Joseph assembled 20,514 face images (24x24 pixels) from 3 different face databases. It also contained 550 larger background images (896x592 pixels) for testing.

Links

Random Kitchen Sinks: Replacing Optimization with Randomization in Learning, Ali Rahimi, Ben Recht,
in Neural Information Processing Systems (NIPS) 2008. (pdf)

Uniform Approximation of Functions with Random Bases, Ali Rahimi, Ben Recht,
in Allerton 2008. (pdf)

Random Features for Large-Scale Kernel Machines, Ali Rahimi, Ben Recht,
in Neural Information Processing Systems (NIPS) 2007. (pdf)

Here is MATLAB code for Random Features. And some datasets and sample runs that appear in the above paper. There may be bugs in the code. If you have trouble getting state-of-the-art classification/regression results with this code, please drop me a note and I'll help debug.