CSE 250B LECTURE NOTES

October 30, 2006


ANNOUNCEMENT

The second project is due today.  Look for the third project description on the web site.  The new project will ask you to apply the perceptron algorithm to a subset of the Netflix contest data.  I'll talk more about the contest and its data at the end of today's lecture.

THE KERNEL TRICK

[Adapted from Wikipedia.]  The kernel trick is a method for converting a linear classifier learning algorithm into a non-linear one, by mapping the original observations into a higher-dimensional non-linear space so that linear classification in the new space is equivalent to non-linear classification in the original space.

The kernel trick transforms any algorithm that solely depends on dot products between pairs of vectors.  Wherever a dot product is used, it is replaced with a so-called kernel function K(x, y)  that is the dot product in a transformed space φ(x) :  K(x,y) = \varphi(x)\cdot\varphi(y).

The linear algorithm is then operating in the range space of φ.  However, because kernels are used, the φ function is never explicitly computed.  This is desirable, because the high-dimensional space may be infinite-dimensional.

The kernel trick was first published in 1964 but not explored until the 1990s with support vector machines.


KERNELIZING THE VOTED PERCEPTRON

Remember the algorithm:
n <- 1
w_1 <- 0
c_1 <- 0
Repeat for T epochs:
       for i = 1 to m                 (this is one epoch)
                if <x_i, y_i> is classified correctly then increment c_n
                else
                        increment n
                        w_n  <-  w_(n-1) + y_i*x_i
                        c_n   <-  1

Each vector w_n can be represented as a sum of data points   w_n = SUM_j y_j x_j   for j in J some subset.   So w_n.x = SUM_j y_j x_j.x.
If we have a kernel function K, we just write   K(w_n, x)  =  SUM_j y_j K(x_j, x).

Given m training points and one pass through the training data (T=1), how many kernel calculations do we need to do for a test point?  There could be up to m different vectors w_n, each of which could be the sum of a subset of up to m terms y_j*x_j!  But, each K(w_n,x) = K(w_n-1, x) + delta.  So, we only need to at most m kernel evaluations.

Suppose we use more than one epoch of training.  There are still only m different possible values for K(x_j, x).


THE POLYNOMIAL KERNEL

Suppose data points x live in d-dimensional Euclidean space.  Suppose we want to re-represent them with every quadratic combination of features, i.e.
phi(x)  =  < 1, c*x_1, ..., c*x_d, ... x_1^2, ..., x_d^2, c*x_1*x_2, ..., c*x_d-1*x_d >   where the constant c = sqrt(2)
This will let us learn separating surfaces that are quadratic.

What is the kernel function K for this phi?  Let's do the algebra for d=2.
phi(x)  =  < 1, cx_1, cx_2, x_1^2, x_2^2, cx_1x_2 >
phi(x) . phi(z)  =  1 + 2x_1z_1 + 2x_2z_2 + (x_1*z_1)^2 + (x_2*z_2)^2 + 2 ...)  =  (1 + x.z)^2

The same method works if we want to re-represent x using all products of length 3, 4, etc.


STRING KERNELS

We can define kernels for data points that are not real-valued vectors, for example sequences of varying lengths.  As long as phi(x) is a fixed-length vector, x need not be.

Let x be a string and let phi_n(x) be the number of times string number n appears as a substring in x.  This is well-defined given any total ordering of all strings in the alphabet.

How do we compute K(x,y)?  For each substring of x, count how often it appears in y.  Add up these counts.  We can do this in O(|x|*|y|) time using dynamic programming, where |y| is the length of string y.


MERCER'S THEOREM

Intuitively, a kernel is a similarity function.  Can we just use any similarity function and know it is a genuine kernel?  Not quite.  Mathematically, Mercer's theorem says when a function K is a kernel.

Theorem:  [James Mercer, 1909] Any function K(x, y) that is symmetric non-negative definite can be expressed as a dot product in a high-dimensional space.  Formally, if
\sum_{i,j} K(x_i,x_j) c_i c_j \ge 0

for every finite subset {x1, ..., xn} of X and every subset {c1, ..., cn} of real numbers, then there exists a function φ(x) such that   K(x,y) = \varphi(x)\cdot\varphi(y).

Note:  A symmetric positive definite matrix has positive eigenvalues.


GAUSSIAN KERNELS

Define K(x,y) = exp [- || x - y ||^2/s^2 ].   This is called a "radial basis function" with parameter s.  It is a variant of a Gaussian.

This function K is positive semi-definite so we can use it as a kernel.


CLOSURE PROPERTIES

Given two kernels we can make new ones by adding and multiplying them with each other, and by adding/multiplying scalar constants.
Also raising to a positive integer power, and exponentiating because it is the limit of a series of polynomials.

Also normalized kernels K(x,y)/sqrt( K(x,x) K(y,y) )

What is a good kernel?  Partial answers:  A bad kernel gives a diagonal kernel matrix.  A good kernel divides the data into disjoint subspaces, each of which has the same label.


SUPPORT VECTORS

See Sanjoy Dasgupta's kernel lecture notes.  Consider the "perceptron in action" picture.  There are 10 training points but only two actually enter into the final classifier:  w = x_1 - x_7.  These two training points can be called "support vectors.'"

Now see the picture on page 2 titled "Quadratic kernel."  Only three training points determine the decision surface (i.e. the ellipse, which is a quadratic curve).  Again these can be called the "support vectors."

If we use a Gaussian kernel, what are the support vectors?  Answer:  They are the centers of the radial basis functions.  The training algorithm automatiically chooses how many centers to use!


DIGRESSION: THE NETFLIX CONTEST

See http://www.netflixprize.com/

Here, each training example is a customer.  The data for each customer is his/her rating of each movie, so the dimensionality is the number of different movies.

One approach to the Netflix task is to view predicting ratings for each movie as a separate binary classification task (thumbs up or down).  For each movie, the relevant training data is different.  It is all customers who have provided a rating for this movie in the past.  Although the total number of Netflix training examples (i.e. customers) is very large, the number for each movie is only how many people have rated that movie.  This number may be small, i.e. smaller than the dimensionality.

The perceptron algorithm seems suitable for this task precisely because the proof of its convergence is independent of the dimensionality of the data.