)
Random Features are a trick to speed up supervised learning algorithms so they can operate on big (millions of data points) data sets. Traditional learning algorithms work too hard because they optimize parameters that don't don't actually need to be optimized. Instead, one can randomize over some of the parameters and quickly optimize over the rest. Thanks to the concentration of measure phenomenon, this still provides excellent accuracy. This recipe makes for accurate learning algorithms that run extremely fast and are very easy to implement.
% Training w = randn(D, size(X,1)); Z = exp(i*w*X); alpha = (eye(size(X,2)*lambda+Z*Z')\(Z*y); % testing ztest = alpha(:)'*exp(i*w*xtest(:)); |
In prediction problems, we are given training data (inputs
and their corresponding outputs
), and are asked to construct a function
that
returns a guess for the
that corresponds to its argument
. Many learning algorithms represent their predictor as a
weighted sums of simpler functions:
.
The parameters of this model are the weights
and
the function parameters
. Kernel Machines (like the
Kernelized Support Vector Machine), Neural Networks, and Adaboost all
returns model of this form. Let's focus on Adaboost, a greedy
algorithm that approximately finds good settings for these
parameters. After
iterations, Adaboost produces a function
consisting of
terms that is guaranteed to get
to within
of the optimal
solution (which may have infinitely many terms):

It takes a lot of effort for Adaboost to compute
: for
each of its
iterations it has to touch the entire dataset
(variants that touch only a subset of the dataset also exist. Some
comparisons appear below). Furthermore, its solution is not
necessarily optimal (finding the best parameters in general is
actually NP-hard an inapproximable to within any constant).
The idea behind Random Features is to pick
randomly, in batch, and to solve for the
exactly via a simple batch convex optimization.
This is much simpler to code up, it runs about 1000 times
faster than Adaboost, and bounds on its convergence rate match that of
Adaboost.
There are many ways to pick good random features (that is, the
functions
and a distribution from which to sample the
parameters
). I like deriving random features that
approximate the popular positive definite kernels that are used in
Kernel Machines. This lets us efficiently leverage the performance
of state of the art Kernel Machines on very large data sets. The
following section explains how.
Kernel machines (such as the Kernelized Support Vector Machine) are state-of-the-art architectures for classifying data, but training them is slow. They can rarely be trained with more than ~50,000 training examples in practice. To scale training to millions of examples, we utilize a linear version of the machine, which can easily scale to millions of examples. To do this, we pass the data through random features, and train a linear machine on the featurized data. The random features are designed to ensure that the classifier generated in this way is approximately the same as the classifier returned by the original (slow-to-train) kernel machine.
![]() |
While the kernel trick lets us do linear things in
infinite-dimensional space without an infinite amount of work, it
still incurs a high computational cost. Random Features provide a
finite-dimensional alternative to the Kernel Trick by instead
mapping the data to an equivalent randomized feature space. For a
given shift-invariant kernel
we can find a
-dimensional random function
so that
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
|
|
Random Kitchen Sinks: Replacing Optimization with
Randomization in Learning,
Ali Rahimi, Ben Recht,
in Neural Information Processing Systems (NIPS) 2008.
(pdf)
Uniform Approximation of Functions with Random Bases,
Ali Rahimi, Ben Recht,
in Allerton 2008.
(pdf)
Random Features for Large-Scale Kernel Machines,
Ali Rahimi, Ben Recht,
in Neural Information Processing Systems (NIPS) 2007.
(pdf)
Here is MATLAB code for Random Features. And some datasets and sample runs that appear in the above paper. There may be bugs in the code. If you have trouble getting state-of-the-art classification/regression results with this code, please drop me a note and I'll help debug.