CSE 250B LECTURE NOTES
October 30, 2006
ANNOUNCEMENT
The second project is due today. Look for the third project
description on the web site. The new project will ask you to
apply the perceptron algorithm to a subset of the Netflix contest data.
I'll talk more about the contest and its data at the end of today's lecture.
THE KERNEL TRICK
[Adapted from Wikipedia.] The kernel trick is a method for converting a linear classifier
learning algorithm into a non-linear one, by mapping the original observations
into a higher-dimensional non-linear space so that linear
classification in the new space is equivalent to non-linear
classification in the original space.
The kernel trick transforms any algorithm that solely depends
on dot products between pairs of vectors. Wherever a dot
product is used, it is
replaced with a so-called kernel function K(x, y) that is the dot product in a transformed space φ(x) : 
The linear algorithm is then operating in the range space of
φ. However, because kernels are used, the φ function is never
explicitly computed. This is desirable, because the high-dimensional
space may be infinite-dimensional.
The kernel trick was first published in 1964 but not explored until the 1990s with support vector machines.
KERNELIZING THE VOTED PERCEPTRON
Remember the algorithm:
n <- 1
w_1 <- 0
c_1 <- 0
Repeat for T epochs:
for i = 1 to m
(this is one
epoch)
if <x_i, y_i> is classified correctly then
increment c_n
else
increment n
w_n
<- w_(n-1) + y_i*x_i
c_n
<- 1
Each vector w_n can be represented as a sum of data points w_n =
SUM_j y_j x_j for j in J some subset. So w_n.x = SUM_j
y_j x_j.x.
If we have a kernel function K, we just write K(w_n, x) = SUM_j y_j K(x_j, x).
Given m training points and one pass through the training data (T=1),
how many kernel calculations do we need to do for a test point?
There could be up to m different vectors w_n, each of which
could be the sum of a subset of up to m terms y_j*x_j! But,
each K(w_n,x) = K(w_n-1, x) + delta. So, we only need to at
most m kernel evaluations.
Suppose we use more than one epoch of training. There are still only m different possible values for K(x_j, x).
THE POLYNOMIAL KERNEL
Suppose data points x live in d-dimensional Euclidean space.
Suppose we want to re-represent them with every quadratic
combination of features, i.e.
phi(x) = < 1, c*x_1,
..., c*x_d, ... x_1^2, ..., x_d^2, c*x_1*x_2, ..., c*x_d-1*x_d >
where the constant c = sqrt(2)
This will let us learn separating surfaces that are quadratic.
What is the kernel function K for this phi? Let's do the algebra for d=2.
phi(x) = < 1, cx_1, cx_2, x_1^2, x_2^2, cx_1x_2 >
phi(x) . phi(z) = 1 + 2x_1z_1 + 2x_2z_2 + (x_1*z_1)^2 + (x_2*z_2)^2 + 2 ...) = (1 + x.z)^2
The same method works if we want to re-represent x using all products of length 3, 4, etc.
STRING KERNELS
We can define kernels for data points that are not real-valued vectors,
for example sequences of varying lengths. As long as phi(x) is a
fixed-length vector, x need not be.
Let x be a string and let phi_n(x) be the number of times string
number n appears as a substring in x. This is well-defined given
any total ordering of all strings in the alphabet.
How do we compute K(x,y)? For each substring of x, count how
often it appears in y. Add up these counts. We can do this
in O(|x|*|y|) time using dynamic programming, where |y| is the length
of string y.
MERCER'S THEOREM
Intuitively, a kernel is a similarity function. Can we just use
any similarity function and know it is a genuine kernel? Not
quite. Mathematically, Mercer's theorem says when a function K is
a kernel.
Theorem: [James Mercer, 1909] Any function K(x, y) that is symmetric non-negative definite can be expressed as a dot product in a high-dimensional space. Formally, if

for every finite subset {x1, ..., xn} of X and every subset {c1, ..., cn} of real numbers, then there exists a function φ(x) such that 
Note: A symmetric positive definite matrix has positive eigenvalues.
GAUSSIAN KERNELS
Define K(x,y) = exp [- || x - y ||^2/s^2 ]. This is called a
"radial basis function" with parameter s. It is a variant of a
Gaussian.
This function K is positive semi-definite so we can use it as a kernel.
CLOSURE PROPERTIES
Given two kernels we can make new ones by adding and multiplying them
with each other, and by adding/multiplying scalar constants.
Also raising to a positive integer power, and exponentiating because it is the limit of a series of polynomials.
Also normalized kernels K(x,y)/sqrt( K(x,x) K(y,y) )
What is a good kernel? Partial answers: A bad kernel gives
a diagonal kernel matrix. A good kernel divides the data into
disjoint subspaces, each of which has the same label.
SUPPORT VECTORS
See Sanjoy Dasgupta's kernel lecture notes.
Consider the "perceptron in action" picture. There are 10
training points but only two actually enter into the final classifier:
w = x_1 - x_7. These two training points can be called
"support vectors.'"
Now see the picture on page 2 titled "Quadratic kernel." Only
three training points determine the decision surface (i.e. the ellipse,
which is a quadratic curve). Again these can be called the
"support vectors."
If we use a Gaussian kernel, what are the support vectors?
Answer: They are the centers of the radial basis functions.
The training algorithm automatiically chooses how many centers to
use!
DIGRESSION: THE NETFLIX CONTEST
See http://www.netflixprize.com/
Here, each training example is a customer. The data for each
customer is his/her rating of each movie, so the dimensionality is the
number of different movies.
One approach to the Netflix task is to view predicting ratings for each
movie as a separate binary classification task (thumbs up or
down). For each movie, the relevant training data is
different. It is all customers who have provided a rating for
this movie in the past. Although the total number of
Netflix training examples (i.e. customers) is very large, the
number for each movie is only how many people have rated that movie.
This number may be small, i.e. smaller than the dimensionality.
The perceptron algorithm seems suitable for this task precisely because
the proof of its convergence is independent of the dimensionality of
the data.