# Modernization Hub

##### Modernization and Improvement ## Lecture 14 – Support Vector Machines

ANNOUNCER: The following program
is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about validation,
which is a very important technique in machine learning for estimating the
out-of-sample performance. And the idea is that we start from
the data set that is given to us, that has N points. We set aside K points for validation,
for just estimation, and we train with the remaining points, N
minus K. Because we are training with a subset, we end up with a hypothesis
that we are going to label g minus, instead of g. And it is on this g minus,
that we are going to get an estimate of the out-of-sample
performance by the validation error. And then there is a leap of faith, when
we put back all the examples in the pot in order to come up with the best
possible hypothesis– to work with the most training examples. We are going to get g, and we are using
the validation error we had on the reduced hypothesis, if you will,
to estimate the out-of-sample performance on the hypothesis
we are actually delivering. And there is a question of how
accurate an estimate this would be for E_out. And we found out that K cannot be too
small, and cannot be too big, in order for this estimate to be reliable. And we ended up with a rule of thumb
of about 20% of the data set go to validation. That will give you
a reasonable estimate. Now this was an unbiased estimate. So we get an E_out. We can get better than E_out or worse
than E_out in general, as far as E_val estimating the performance
of g minus. On the other hand, once you use the
validation error for model selection, which is the main utility for
validation, you end up with a little bit of an optimistic bias, because you
chose a model that performs well on that validation error. Therefore, the validation error is not
going to necessarily be an unbiased estimate of the out-of-sample error. It will have a slight positive,
or optimistic, bias. And we showed an experiment where, using
very few examples in this case, in order to exaggerate the effect. We can see the impact of– the blue
curve is the validation error, and the red curve is the out-of-sample
error on the same hypothesis, just to pin down the bias. And we realize that, as we increase
the number of examples, the bias goes down. The difference between the
two curves goes down. And indeed, if you have a reasonable-size
validation set, you can afford to estimate a couple of parameters for
sure, without contaminating the data too much. So you can assume that the measurement
you’re getting from the validation set is a reliable estimate. Then, because the number of examples
turned out to be an issue, we introduced the cross-validation, which
is, by and large, the method of validation you’re going to be using
in a practical situation. Because it gets you the
best of both worlds. So in this case, we– illustrating a case where we have
10-fold cross-validation. So you divide the data
set into 10 parts. You train on nine, and validate
on the tenth, and keep that estimate of the error. And you keep repeating as you
choose the validation subset to be one of those. So you have 10 runs. And each of them gives you an estimate
on a small number of examples, 1/10 of the examples. And then by the time you average all of
these estimates, that will give you a general estimate of what the out-of-sample
error would be on 9/10 of the data, in spite of the fact that they
are different 9/10 each time. And in that case, the advantage of it
is that the 9/10 is very close to 1, so the estimate you are
getting is very close. And furthermore, the number of examples
taken into consideration in getting an estimate of the validation
error is really N. You got all of them, albeit in different runs. So this is really the way to go
in cross-validation. And invariably, in any learning
situation, you will need to choose a model, a parameter, something–
to make a decision. And validation is the method of choice
in that case, in order to make that. OK. So we move on to today’s lecture, which
is Support Vector Machines. Support vector machines are arguably
the most successful classification method in machine learning. And they are very nice, because
there is a principled derivation for the method. There is a very nice optimization
package that you can use in order to get the solution. And the solution also has a very
intuitive interpretation. So it’s a very, very neat piece
of work for machine learning. So the outline will be the following. We are going to introduce the notion
of the margin, which is the main notion in support vector machines. And we’ll ask a question of maximizing
the margin– getting the best possible margin. And after formulating the problem, we
are going to go and get the solution. And we’re going to do
that analytically. It will be a constrained
optimization problem. And we faced one before in
regularization, where we gave a geometrical solution, if you will. This time we are going to do it
analytically, because the formulation is simply too complicated to have
an intuitive geometric solution for. And finally, we are going to expand from
the linear case to the nonlinear case in the usual way, thus expanding
all of the machinery to a case where you can deal with nonlinear surfaces,
instead of just a line in a separable case, which is the main case
we are going to handle. So now let’s talk about
linear separation. Let’s say I have a linearly
separable data set. Just take four, for example. There are lines that will separate
the red from the blue. Now, when you apply perceptron,
you will get a line. When you apply any algorithm, you will
get a line, and separate– you get 0 training error. And everything is fine. And now there is a curious point
when you ask yourself: I can get different lines. Is there any advantage of choosing
one of the lines over the other? That is the new addition
to the problem. So let’s look at it. Here is a line. So I chose this line to
separate the two. You may not think that this
is the best line. And we’ll try to take our intuition
and understand why this is not the best line. So I’m going to think of a margin,
that is, if this line moves a little bit, when is it
going to cross over? When is it going to start
making an error? So in this case, let’s put it as
a yellow region around it. That’s the margin you have. So if you choose this line, this
is the margin of error. Sort of informal notion. Now you can look at this line. And it does seem to have
a better margin. And you can now look at the problem
closely and say: let me try to get the best possible margin. And then you get this line, which has
this margin, that is exactly right for the blue and red points. Now, let us ask ourselves
the following question. Which is the best line
for classification? As far as the in-sample error is
concerned, all of them give in-sample error 0. As far as generalization questions are
concerned, as far as our previous analysis has done, all of them
are dealing with linear model with four points. So generalization, as an estimate,
will be the same. Nonetheless, I think you will agree with
me that if you had your choice, you will choose the fat margin. Somehow it’s intuitive. So let’s ask two questions. The first one is: why is
a bigger margin better? Second one. If we are convinced that a bigger
margin is better, then you ask yourself: can I solve for w
that maximizes the margin? Now it is quite intuitive that the
bigger margin is better, because think of a process that is generating
the data. And let’s say that there
is noise in it. If you have the bigger margin, the
chances are the new point will still be on correct side of the line. Whereas, if I use this one, there’s
a chance that the next red point will be here, and it will be misclassified. Again, I’m not giving any proofs. I’m just giving you an intuition here. So it stands to logic that indeed,
the bigger margin is better. And now we’re going to argue that the
bigger margin is better for a reason that relates to our VC
analysis before. So anybody remember the growth
function from ages ago? What was that? So we take the dichotomies of the
line on points in the plane. And let’s say, we take three points. So on three points, you can get all
possible dichotomies by a line. The blue versus not-blue region. And you can see that by varying where
the line is, I can get all possible 2 to the 3 equals 8 dichotomies here. So you know that the growth
function is big. And we know that the growth function
being big is bad news for generalization. That was our take-home lesson. So now let’s see if this is
affected by the margin. So now we are taking dichotomies, not only
the line, but also requiring that the dichotomies have a fat margin. Let’s look at dichotomies,
and their margin. Now in this case, I’m putting
the same three points. And I’m putting a line that has the
biggest possible margin for the constellation of points I have. So you can see here. I put
it. It sandwiched them. Every time, it touches all the points. It cannot extend any further because
it will get beyond the points. And when you look at it, this
is a thin margin for this particular dichotomy. This is an intermediate one. This is a fat one. And this is a hugely fat one,
but that’s the constant one. That’s not a big deal. Now let’s say that I told you that you
are allowed to use a classifier, but you have to have at least that
margin for me to accept it. So now I’m requiring the margin
to be at least something. All of a sudden, these guys that used to
be legitimate dichotomies using my model, are no longer allowed. So effectively by requiring the margin
to be at least something, I’m putting a restriction on the growth function. Fat margins imply fewer
dichotomies possible. And therefore, if we manage to separate
the points with a fat dichotomy, we can say that fat
dichotomies have a smaller VC dimension, smaller growth function than
if I didn’t restrict them at all. And, although this is all informal, we
will come at the end of the lecture to a result that estimates the out-of-
sample error based on the margin. And we will find out that indeed, when
you have a bigger margin, you will be able to achieve better out-of-sample
performance. So now that I completely and irrevocably
convinced you that the fat margins are good, let us
try to solve for them. That is, find the w that not only
classifies the points correctly, but achieves so with the biggest
possible margin. So how are we going to do that? Well the margin is just the distance
from the plane to a point. So I’m going to take from the data set
the point x_n, which happens to be the nearest data point to the line, that we
have used in the previous example. And the line is given by the
linear equation– equals 0. And since we’re going to use a higher
dimensional thing, I’m not going to refer to it as a line. I’m going to refer to it as a plane– hyperplane really– but
just plane for short. So we’re talking about d-dimensional
space and a hyperplane that separates the points. So we would like to estimate that. And we ask ourselves: if I give you w
and the x’s, can you plug them into a formula and give me the distance
between that plane, that is described by w, and the point x_n? I’m now taking the nearest point,
because then that distance will be the margin that I’m talking about. Now there are two preliminary
technicalities that I’m going to invoke here. And they will simplify the
analysis later on. So here is the first one. The first one is to normalize
w. What do I mean by that? For all the points in the data set,
near and far, when you take w transposed times x_n, you will get
a number that is different from 0. And indeed, it will agree with the
label y_n, because the points are linearly separable. So I can take the absolute value of this,
and claim that it’s greater than 0 for every point. Now I would like to relate w to the
margin, or to the distance. But I realize that here, there is a minor
technicality that is annoying. Let’s say that I multiply the
vector w by a million. Does the plane that I’m
talking about change? No. This is the equation of it. I can multiply by any positive number,
and I get the same plane. So the consequence of that is that any
formula that takes w and produces the margin will have to have, built
in it, scale invariance. We’ll be dividing by something that
takes out that factor that does not affect which plane I’m talking about. So I’m going to do it now, in order
to simplify the analysis later. I’m going to consider all
representations of the same plane. And I’m going to pick one where this is
normalized, by requiring that for the minimum point, this fellow is 1. I can always do that. I can scale w up and down until
I get the closest one to have this equal to 1. There’s obviously no
loss in generality. Because in this case, this is a plane. And I have not missed any
planes by doing that. Now the quantity w x_n which is
the signal, as we talked about it, is a pretty interesting thing. So let’s look at it. I have the plane. So the plane has the signal equals 0. And it doesn’t touch any points. The points are linearly separable. Now when you get the signal
to be positive, you are moving in one direction. You hit the closest point. And then you hit more points, the
interior points, so to speak. And when you go in the other direction
and it’s negative, you hit the other points, the nearest point on the
negative side, and then the interior points which are further out. So indeed that signal actually relates
to the distance, but it’s not the Euclidean distance. It just has an order of the points,
according to which is nearest and which is furthest. But what I’d like to do, I would
like to actually get the Euclidean distance. Because I’m not comparing the
performance of this plane on different points. I’m comparing the performance of
different planes on the same point. So I have to have the same yardstick. And the yardstick I’m going to
use is the Euclidean distance. So I’m going to take this
as a constraint. And when I solve for it, I will find out
that the problem I’m now solving for is much easier to solve. And then I can get the plane. And the plane will be general
under this normalization. The second one is pure technicality. Remember that we had x being in
Euclidean space R to the d. And then we added this artificial
coordinate x_0 in order to take care of w_0 that was the threshold, if you
think of it as comparing with a number, or a bias if you think
of it as adding a number. And that was convenient just to have
the nice vector and matrix representation and so on. Now it turns out that, when you solve for
the margin, the w_1 up to w_d will play a completely different role
from the role w_0 is playing. So it is no longer convenient to
have them as the same vector. So for the analysis of support vector
machines, we’re going to pull w_0 out. So the vector w now is the
old vector w_1 up to w_d. And you take out w_0. And in order not to confuse it and call
it w, because it has a different role, we are going to call
it here b, for bias. OK? So now the equation for the plane is w,
our new w, times x plus b equals 0. And there is no x_0. x_0 used to be multiplied
by b, also known as w_0. So every w you will see in this
lecture will belong to this convention. And now if you can look at this– this
will be w transposed x_n plus b. Absolute value equals 1. And the plane will be w transposed
x plus b equals 0. Just a convention that will make
our math much more friendly. So these are the technicalities that
I wanted to get out of the way. Now, big box, because it’s
an important thing. It will stay with us. And then we go for computing
the distance. So now, we would like to get
the distance between x_n– we took x_n to be the nearest point, and therefore the distance
will be the margin. And we want to get the distance
from the plane. So let’s look at the geometry
of the situation. I have this as the equation
for the plane. And I have the conditions
that I talked about. This is the geometry. I have a plane. And I have a point x_n. And I’d like to estimate the distance. First statement. The vector w is perpendicular
to the plane. That should be easy enough if you have
seen any geometry before, but it’s not very difficult to argue. But remember now that the vector
w is in the X space. I’m not talking about
the weight space. I’m talking about w as you plug in
the values and you get a vector. And I’m looking at that vector
in the input space X. And I’m saying it’s perpendicular
to the plane. Why is that? Because let’s say that you
pick any two points– call them x dash and x double
dash– on the plane proper. So they are lying there. What do I know about these two points? Well, they are on the plane, so
they had better satisfy the equation of the plane. Right? So I can conclude that it must be that,
when I plug in x dash in that equation, I will get 0. And when I plug in x double
dash, I will get 0. Conclusion: If I take the difference between these
two equations, I will get w transposed times x dash minus
x double dash, equals 0. And now you can see that
good old b dropped out. And this is the reason why it has
a different treatment here. The other guys actually mattered. But the b plays a different role. So when you see an equation like
that, your conclusion is what? Your conclusion is that w, as a vector,
must be orthogonal to x dash minus x double dash, as a vector. So when you look at the plane, here is
the vector x dash minus x double dash. Let me magnify it. And this must be orthogonal
to the vector w. So the interesting thing is that we
didn’t make any restrictions on x dash and x double dash. These could be any two points
on the plane, right? So now the conclusion is that w, which is
the same w– the vector w that defines the plane, is orthogonal to
every vector on the plane. Right? Therefore, it is orthogonal
to the plane. So we got that much. We know that now w has
an interpretation. Now we can get the distance. Once you know they are orthogonal
to the plane, you probably can get the distance. Because what do we have? The distance between x_n and the plane,
and we put them here, is what? Can be computed as follows. Pick any point, one point,
on the plane. We just call it generic x. And then you take the projection of the
vector going from here to here. You project it on the direction which
is orthogonal to the plane. And that will be your distance. Right? So we just need to put the mathematics
that goes with that. So here’s the vector. And here is the other vector, which we
know that is orthogonal to the plane. Now if you project this fellow on this
direction, that length will give you the distance. Now in order to get the projection,
what do you do? You get the unit vector
in the direction. So you take w, which is this vector–
could be of any length– and you normalize it by its norm. And you get a unit vector under
which the projection would be simply a dot product. So now the w hat is a shorter w,
if the norm of w happens to be bigger than 1. And what you get– you get the distance being
simply the inner product. You take the unit vector, dot that. And that is your distance. Except for one minor issue. This could be positive or negative
depending on whether w is facing x or facing the other direction so in order
to get the distance proper, you need the absolute value. So we have a solution for it. Now we can write the distance as– this is the formula. Now I multiply it by w hat. I know what the formula for w hat is. I write it down. And now I have it in this form. Now this can be simplified if I add
the missing term, plus b minus b. Why is that? Can someone tell me what is w^T x plus
b, which is this quantity being subtracted here? This is the value of the equation of
the plane, for a point on the plane. So this will happen to be 0. How about this quantity, w^T x_n
plus b, for my point x_n. Well, that was the quantity
that we insisted on being 1. Remember when we normalized the w,
because w’s could go up and down. And we scaled them such that the
absolute value of this quantity is 1. So all of a sudden, this
thing is just 1. And you end up with the formula for the
distance, given that normalization, being simply 1 over the norm. That’s a pretty easy thing to do. So if you take the plane and insist on
a canonical representation of w by making this part 1 for the nearest
point, then your margin will simply be 1 over the norm of w you used. This I can use, in order now to choose
what combination of w’s will give me the best possible margin,
which is the next one. So let’s now formulate the problem. Here is the optimization
problem that resulted. We are maximizing the margin. The margin happens to
be 1 over the norm. So that is what we are maximizing. Subject to what? Subject to the fact that for the nearest
point, which happens to have the smallest value of those guys– so
the minimum over all points in the training set. I took the quantity here
and scaled w up or down in order to make that quantity 1. So I take this as a constraint. When you constrain yourself this way,
then you are maximizing 1 over w. And that is what you get. So what do we do with this? Well, this is not a friendly
optimization problem. Because if the constraints have
a minimum in them, that’s bad news. Minimum is not a nice
function to have. So what we are going to do now, we are
going to try to find an equivalent problem that is more friendly. Completely equivalent, by very
simple observations. So the first observation is that I
want to get rid of the minimum. That’s my biggest concern. So the first thing I notice that– not to mention the absolute value. So the absolute value of this
happens to be equal to this fellow. Why is that? Well, every point is
classified correctly. I’m only considering the points that
separate the data sets correctly. And I’m choosing between them, for the
one that maximizes the margin. Because they are classifying the points
correctly, it has to be that the signal agrees with the label. Therefore when you multiply, the label is
just +1 or -1, and therefore it takes care of the absolute value part. So now I can use this instead
of the absolute value. I still haven’t gotten
rid of the minimum. And I don’t particularly like dividing
1 over the norm, which has a square root in it. But that is very easily handled. Instead of maximizing 1 over the norm,
I’m going to minimize this friendly quantity, quadratic one. I’m minimizing now. So I’m maximizing 1 over,
minimizing that. Everybody sees that it’s equivalent. So now we can see. Does anybody see quadratic programming
coming up in the horizon? There’s our quadratic formula. The only thing I need to do is just have
the constraints being friendly constraints, not a minimum
and absolute value. Just inequality constraints
that are linear in nature. And I claim that you can do this by
simply taking subject to these. So this doesn’t bother me, because I
already established that it deals with the absolute value. But here, I’m taking greater than
or equal to 1 for all points. I can see that if the minimum
is 1, then this is true. But it is conceivable that I do this
optimization, and I end up with a quantity for which all of these guys happen
to be strictly greater than 1. That is a feasible point, according
to the constraints. And if this by any chance gives me the
minimum, then that is the minimum I’m going to get. And the problem with that is that this
is a different statement from the statement I made here. That’s the only difference. Well, is it possible that the minimum
will be achieved at a point where this is greater than 1 for all of them? A simple observation tells you: no,
this is impossible. Because let’s say that you
got that solution. You tell me: this is the minimum I
can get for w transposed w, right? And I got it for values where this
is strictly greater than 1. Then what I’m going to do, I’m going
to ask you: give me your solution. And I’m going to give you
a better solution. What am I going to do? I’m going to scale w and b
proportionately down until they touch the 1. You have a slack, right? So I can just pull all of them, just
slightly, until one of them touches 1. Now under those conditions, definitely,
if the original constraints were satisfied, the new
constraints will be satisfied. All of them are just proportional. I can pull out the factor, which
is a positive factor. And indeed, if this is the case, this
will be the case for the other one. And the point is that the w I got is
smaller than yours because I scaled them down, right? So it must be that my solution
is better than yours. Conclusion: When you solve this, the w that you will
get necessarily satisfies these with at least one of those
guys with equality. Which means that the minimum is 1. And therefore, this problem is
equivalent to this problem. This is really very nice. So we started from a concept, and geometry,
and simplification, and now we end up with this very friendly
statement that we are going to solve. And when you solve it, you’re going to
get the separating plane with the best possible margin. So let’s look at the solution. Formally speaking, let’s put it in
a constrained optimization question. The constrained optimization here– you
minimize this objective function subject to these constraints. We have seen those. And the domain you’re working on,
w happens to be in the Euclidean space R to the d. b happens to be a scalar, belongs
to the real numbers. That is the statement. Now when you have a constrained
optimization– we have a bunch of constraints here. And we will need to go an analytic
route in order to solve it. Geometry won’t help us very much. So what we’re going to do here, we are
going to ask ourselves: oh, constrained optimization. I heard of Lagrange. You form a Lagrangian, and then all of
a sudden the constrained become unconstrained, and you solve it, and
you get the multipliers lambda. Lambda is pretty much what we got
in regularization before. We did it geometrically. We didn’t do it explicitly
with Lagrange. But that’s what you get. Now the problem here is that the
constraints you have are inequality constraints, not equality constraints. That changes the game a little
bit, but just a little bit. Because what people did is simply look
at these and realize that there is a slack here. If I call the slack s squared,
I can make this equality. And then I can solve the old Lagrangian,
with equality. I can comment on that in the
Q&A session, because it’s a very nice approach. And that approach was derived
independently by two sets of people, Karush, which is the first K, and
Kuhn-Tucker, which is the KT. And the Lagrangian under the inequality
constraint is referred to as KKT. So now, let us try to solve this. And I’d like, before I actually go
through the mathematics of it, to remind you that we actually saw this
before in the constrained optimization we solved before under inequality constraints,
which was regularization. And it is good to look at that picture,
because it will put the analysis here in perspective. So in that case, you don’t
have to go through the details. We were minimizing something– you don’t have to worry about the
formula exactly– under a constraint. And the constraint is an inequality
constraint that resulted in weight decay, if you remember. And we had a picture
that went with it. And what we did was, we looked
at the picture and found a condition for the solution. And the condition for the solution
showed that the gradient of your objective function, of the thing you are
trying to minimize, becomes something that is related to
the constraint itself. In this case: normal. The most important aspect to realize is
that, when you solve the constrained problem here, the end result was
that the gradient is not 0. It would have been 0 if the
problem was unconstrained. If I asked you to minimize this, you
just go for gradient equals 0, and solve. So now, because of the constraint, the
constraint kicks in, and you have the gradient being something related
to the constraint. And that’s what will happen
exactly when we have the Lagrangian in this case. But one of the benefits of having– of reminding you of the regularization is that there’s
a conceptual dichotomy, no pun intended, between the
regularization and the SVM. SVM is what we’re doing here,
maximizing the margin, and regularization. So let’s look at both cases and ask
ourselves: what are we optimizing, and what is the constraint? If you remember in regularization,
we already have the equation. What we are minimizing is
the in-sample error. So we are optimizing E_in, under the
constraints that are related to w transposed w, the size of the weights. That was weight decay. If you look at the equation we just
found out in order to maximize the margin, what we are actually optimizing
is w transposed w. That is what you’re trying
to minimize. Right? And your constraint is that you’re
getting all the points right. So your constraint is that E_in is 0. So it’s the other way around. But again, because both of them will
blend in the Lagrangian, and you will end up doing something that is
a compromise, it’s conceptually not a big shock that we are reversing roles
here, and minimizing what is in our mind a constraint, and constraining
what is in our mind an objective function to be minimized. Back to the formulation. So now, let’s look at the
Lagrange formulation. And I would like you to pay
attention to this slide. Because once you get the formulation,
we’re not going to do much beyond getting a clean version of the
Lagrangian, and then passing it on to a package of quadratic programming
to give us a solution. But at least, arriving
there is important. So let’s look at it. We are minimizing– this is our objective function–
subject to constraints of this form. First step, take the inequality
constraints and put them in the 0 form. So what do I mean by that? Instead of saying that’s greater or
equal to 1, you put it as minus 1, and then require that this is greater
than or equal to 0. And now you see, it got multiplied
by a Lagrange multiplier. So think of this, since this should be
greater than 0, this is the slack. So the Lagrange multipliers get
multiplied by the slack. And then you add them up. And they become part of the objective. And they come out as a minus, simply
because the inequalities here are in the direction greater than or equal to. That’s what goes with the minus here. I’m not proving any of that. I’m just motivating for you that this
formula makes sense, but there’s mathematics that actually
pins it down exactly. And you’re minimizing this. So now let’s give it a name. It’s a Lagrangian. It is dependent on the variables
that I used to minimize with respect to, w and b. And now I have a bunch of new variables
which are the Lagrange multipliers, the vector alpha, which
is called lambda in other cases. Here it’s standard, alpha. And there are N of them. There’s a Lagrange multiplier
for every point in the set. We are minimizing this
with respect to what? With respect to w and b. So that was the original thing. The interesting part, which you should
pay attention to, is that you’re actually maximizing with
respect to alpha. Again, I’m not making a mathematical
proof that this method holds. But this is what you do. And it’s interesting because when we
had equality, we didn’t worry about maximization versus minimization. Because all you did, you get
the gradient equals 0. So that applies for both
maximum and minimum. So we didn’t necessarily
pay attention to it. Here you have to pay attention to it,
because you are maximizing with respect to alphas, but the alphas
have to be non-negative. Once you restrict the domain, you can’t
just get the gradient to be 0, because the function– if the function was all over and this
way, you get the minimum. And minimum has gradient 0. But if I tell you to stop here, the
function could be going this way. And this is the point you’re
going to pick. And the gradient here
is definitely not 0. So the question of maximizing versus
minimizing, you need to pay attention here. We are not going to pay too much
attention to it, because we’ll just tell the quadratic programming
guy, please maximize. And it will give us the solution. But that is the problem
we are solving. So now we do at least
the unconstrained part. With respect to w and b, you
are just minimizing this. So let’s do it. We’re going to take the gradient
of the Lagrangian with respect to w. So I’m getting partial by partial
for every weight that appears. And I get the equation here. How do I get that? I can differentiate. So I’m going to differentiate this. I get a w. The squared goes with the half. When I get this, I ask myself: what
is the coefficient of w? I get alpha, y_n, and x_n. Right? That one gets multiplied by w for
every n equals 1 to N. So I get that. And I have a minus sign
here, that comes here. Everything else drops out. So this is the formula. And what do I want the gradient to be? I want it to be the vector 0. So that’s a condition. What is the other one? I now get the derivative with
respect to b. b is a scalar. That’s the remaining parameter. And when I look at it,
can we do this? What gets multiplied by b? Oh I guess it’s just the alphas. Everything else drops out. So– oh, not just alphas! It’s y_n. So here’s the b. It gets multiplied
by y_n and alpha. And that’s what I get. And you get this to be equal
to the scalar 0. So optimizing this with respect to
w and b resulted in these two conditions. Now what I’m going to do, I’m going
to go back and substitute with these conditions in the original
Lagrangian, such that the maximization with respect to alpha– which is the tricky part, because alpha
has a range– will become free of w and b. And that formulation is referred to as
the dual formulation of the problem. So let’s substitute. Here are what I got from
the last slide. This one I got from the gradient
with respect to w equals 0. So w has to be this. And this one from the partial
by partial b, equals 0. I get those. And now I’m going to substitute
them in the Lagrangian. And the Lagrangian has that form. Now let’s do this carefully, because
things drop out nicely. And I get a very nice formula at the
end, which is function of alpha only. So this equals– first part, I get the summation
of the Lagrange multipliers. Where did I get that? I got that because I
have -1 here. It gets multiplied by alpha_n
for all of those. Canceled with this minus, so
I get summation over that. So this part I got. So let me kill the part
that I already used. So I kill the -1. So that part I got. Next. I look at this and say: I have
+b here, right? So when I take +b, it gets
multiplied by y_n alpha_n, summed up from n equals 1 to N. Now, I look at
this and say: oh, the summation of alpha_n y n from n
equals 1 to N is 0. So the guys that get multiplied
by b, will get to 0. And therefore, I can kill +b. Now when I have it down to this,
it’s very easy to see. Because you look at the form for w, when
you have w transposed w, you are going to get a quadratic version of this. You get some double summation,
alpha alpha y y x x, right? With the proper name of the dummy
variable, to get it right. And when you have here, well, you have
already alpha_n y_n and x n, and now when you substitute w by this, you’re
going to get exactly the same thing. You’re going to get another alpha,
another y, another x. So this will be exactly the same as
this, except that this one has a factor half, this has
a factor -1. So you add them up. And you end up with this. So we look at this: what
happened to w? What happened to b? All gone. We are now just function of
the Lagrange multipliers. And therefore, we can call
this L of alpha. Now this is a very nice quantity to
have, because this is a very simple quadratic form in the vector alpha. Alpha here appears as a linear guy. Here appears as a quadratic guy. That’s all. Now I need to put the constraints. I put back the things I took out. And let’s look at the maximization
with respect to alpha, subject to non-negative ones. This is a KKT condition. I have to look for solutions
under these conditions. And I also have to consider the
conditions that I inherited from the first stage. So I have to satisfy this, and I
have to satisfy this, for the solution to be valid. So this one is a constraint over the
alphas, and therefore I have to take it as a constraint here. But I don’t have to take the constraint
here, because that is vacuous as far as alphas are concerned. This does no constraint over
alphas whatsoever. You do your thing. You come up with alphas. And you call whatever that formula
is, the resulting w. Since w doesn’t appear in
optimization, I don’t worry about it at all. So I end up with this thing. Now if I didn’t have those annoying
constraints, I would be basically done. Because I look at this,
that’s pretty easy. I can express one of the alphas in
terms of the rest of the alphas. Right? Factor it out. Substitute for that alpha here. And all of a sudden, I have a purely
unconstrained optimization for a quadratic one. I solve it. I get something, maybe a pseudo inverse
or something, and I’m done. But I cannot do that simply because
I’m restricted to those choices. And therefore, I have to work with
a constrained optimization, albeit a very minor constrained optimization. Now let’s look at the solution. The solution goes with quadratic
programming. So the purpose of the slide here is
to translate the objective and the constraints we had into the coefficients
that you’re going to pass on to a package called quadratic
programming. So this is a practical slide. First, what we are doing is maximizing
with respect to alpha this quantity that we found, subject to
a bunch of constraints. Quadratic programming packages come
usually with minimization. So we need to translate this
into minimization. How are going to do that? We’re just going to get
the minus of that. So this would become this minus that. So let’s do that. We got the minus, minimum of this. So now it’s ready to go. Now the next step will
be pretty scary. Because what I’m going to do, I’m going
to expand this, isolating the coefficients from the alphas. The alphas are the parameters. You’re not passing alphas to
with a vector of variables that you called alpha. What you are passing are the
coefficients of your particular problems that are decided by these
numbers, that the quadratic programming will take, and then will
be able to give you the alphas that would minimize this quantity. So this is what it looks like. I have a quadratic term,
alpha transposed alpha. And these are the coefficients
in the double summation. These are numbers that you read
off your training data. You give me x_1 and y_1. I’m going to compute these numbers
for all of these combinations. And I end up with a matrix. That matrix gets passed to
the quadratic term, and asks you for the linear term. Where the linear term, just to be
formal, happens to be, since we are just taking minus alpha, it’s -1 transposed
alpha, which is the sum of those guys. So this is the bunch of linear
coefficients that you pass. And then the constraints– you put the constraints again
in the same way, subject to. So there’s a part which asks
you for constraints. And here again, the constraints– you
care about the coefficients of the constraints. So this is a linear equality
constraint. So we are going to pass the y
transposed, which are the coefficients here, as a vector. And it will ask you for, finally, the
range of alphas that you need. And the range of alphas that you need
happens to be between 0, so that would be the vector 0– would
gives you back an alpha. And if you’re completely discouraged by
this, let me remind you that all of this is just to give you what
to pass to the package. This actually looks exactly like this. That’s all you’re doing. A very simple quadratic function,
with a linear term. You’re minimizing it, subject to linear
equality constraint, plus a bunch of range constraints. And when you expand it, in terms of
numbers, this is what you get. And that’s what we’re going to use. So now we are done. We have done the analysis. We knew what to optimize. It fit one of the standard
optimization tools. It happens to be convex function in
this case, so that the quadratic programming will be very successful. And then we pass it, and
we get a number back. Just a word of warning
before we go there. You look at the size of this matrix. And it’s N by N. Right? So the dimension of the matrix depends
on the number of examples. Well, if you have a hundred
examples, no sweat. If you have 1000 examples, no sweat. If you have a million examples,
this is really trouble. Because this is really a dense matrix. These numbers could come
up with anything. So all the entries matter. And if you end up with a huge matrix,
quadratic programming will have pretty hard time finding the solution. To the level where there are tons of
heuristics to solve this problem when the number of examples is big. It’s a practical consideration, but
it’s an important consideration. But basically, if you’re working with
problems– the typical machine learning problem, where you have, let’s
say not more than 10,000, then it’s not formidable. 10,000 is flirting with danger,
but that’s what it is. So pay attention to the fact that, in
spite of the fact that there’s a standard way of solving it, and the
fact that it’s convex, so it’s friendly, it is not that easy when you
get a huge number of examples. And people have hierarchical methods
and whatnot, in order to deal with that case. So let’s say we succeeded. We gave the matrix and the vectors
to quadratic programming. Back comes what? Back comes alpha. This is your solution. So now we want to take this solution,
and solve our original problem. What is w, what is b, what is the
surface, what is the margin? You answer the questions that
all of this formalization was meant to tackle. So the solution is vector of alphas. And the first thing is that it is very
easy to get the w because, luckily, the formula for w being this was one of
the constraints we got from solving the original one. When we got the gradient with respect to
w, we found out this is the thing. So you get the alphas, you plug them
in, and then you’ll get the w. So you get the vector
of weights you want. Now I would like to tell you a condition
which is very important. And it will be the key to defining
support vectors in this case, which is another KKT condition that will
be satisfied at the minimum, which is the following. Quadratic programming hands you alpha. Let’s say that– alpha is the same length
as the number of examples– let’s say you have 1000 examples. So it gives you a vector of 1000 guys. You look at the vector,
and to your surprise– you don’t know yet whether it’s pleasant
or unpleasant surprise– a whole bunch of the alphas are just 0. The alphas are restricted
to be non-negative. They all have to be greater
than or equal to 0. If you find any one of them negative,
then you say quadratic programming made a mistake. But it won’t make a mistake. It will give you numbers
that are non-negative. But the remarkable part, out of the
1000, more than 900 are 0’s. So you say: something is wrong? Is there a bug in my
thing or something? No. Because of the following. The following condition holds. It looks like a big condition. But let’s read it. This is the constraint in the 0 form. So this is greater than or equal to 1. So minus 1 would be greater
than or equal to 0. This is what we called the slack. So the condition that is guaranteed to
be satisfied, for the point you’re going to get, is that either the slack is
0, or the Lagrange multiplier is 0. The product of them will
definitely be 0. So if there’s a positive slack, which
means that you are talking about an interior point. Remember that I have a plane,
and I have a margin. And the margin touches
on the nearest point. And that is what defines the margin. Then there are interior points, where
the slack is bigger than 1. At those points, the
slack is exactly 1. No, not the slack. The slack is 0. The value is 1. The other ones, the slack
will be positive. So for all the interior points, you’re
guaranteed that the corresponding Lagrange multiplier will be 0. OK? I claim that we saw this before, again
in the regularization case. Remember this fellow? We had a constraint which is to
be within the red circle. And we’re trying to optimize a function
that has equi-potentials around this. So this is the absolute minimum. And it grows and grows and grows. And because we are in the constraint,
we couldn’t get the absolute minimum when we went there. When we had the constraint being
vacuous, that is, the constraint doesn’t really constrain us, and the
absolute optimal is inside, we ended up with no need for regularization,
if you remember? And the lambda for regularization
in that case was 0. That is the case, where you have
an interior point, and the multiplier is 0. And then when you got a genuine guy that
you have to actually compromise, you ended up with a condition that
requires lambda to be positive. So these are the guys where
the constraint is active. And therefore you get a positive lambda,
while this guy is by itself 0. So now we come to an interesting
definition. So alpha is largely 0’s,
interior points. The most important points in the game
are the points that actually define the plane and the margin. And these are the ones for which
alpha_n’s are positive. And these are called support vectors. So I have N points. And I classify them, and I
got the maximum margin. And because it’s a maximum margin, it
touched on some of the +1 and some of the -1 points. Those points support the
plane, so to speak. And they’re called support vectors. And the other guys are
interior points. And the mathematics of it tells us that
we can identify those, because we can go for lambdas that happen to be
positive, the alphas in this case. And the alpha greater than 0 will
identify a support vector. Again, when I put a box, it’s
an important thing. So this is an important notion. So let’s talk about support vectors. I have a bunch of points
here to classify. And I go through the entire machinery. I formulate the problem. I get the matrix. I pass it to quadratic programming. I get the alpha back. I compute the w. All of the above. And this is what I get. So where are the support
vectors this picture? They are the closest ones
to the plane, where the margin region touched. And they happen to be these three. This one, this one, this one. So all of these guys that are here, and
all of these guys are here will just contribute nothing to the solution. They will get alpha equals
0 in this case. And the support vectors achieve
the margin exactly. They are the critical points. The other guys– their margin, if you will,
is bigger or much bigger. And for the support vectors, you
satisfy this with equal 1. So all of this is fine. Now, we used to compute w in terms of
the summation of alpha_n y_n x_n. Because we said that this is the
quantity we got, when we got the gradient with respect to w equals 0. So this is one of the equations. And this is our way to get the alphas
back, which is the currency we get back from quadratic programming, and
plug it in, in order to get the w. This goes from n equals 1 to N.
Now that I notice that many of the alphas are 0, and alpha is only positive
for support vectors, then I can say that I can sum this up over
only the support vectors. It looks like a minor technicality. So the other terms happen to
be 0, so you excluded them. You just made the notation
more clumsy in this case. But there’s a very important point. Think of alphas now as the
parameters of your model. When they’re 0s, they don’t count. Just expect almost all
of them to be 0. What counts is the actual values of
the parameters that will be some number bigger than 0. So now, your weight vector– it’s a d-dimensional vector– is
expressed in terms of the constants which are your data set, x_n
and their label. Plus few parameters, hopefully few
parameters, which is just the number of support vectors. So you have three support
vectors, then this– let’s say you’re working at
20-dimensional space. So I’m looking at 20-dimensional space. I’m getting a weight. Well, it’s 20-dimensional
space in disguise. Because of the constraint you put, you
got something that is effectively three-dimensional. And now you can realize
why there might be a generalization dividend here. Because I end up with fewer parameters
than the express parameters that are in the value I get. So, we can also– now that we have it– solve for the b. Because you want w and b– b is the bias,
or corresponding to the threshold term, if you will. And it’s very easy to do. Because all you need to do is take any
support vector, any one of them, and for any of them you know that
this equation holds. You already solved for w, by that. So you plug this in. And the only unknown in this
equation would be b. And as a check for you, take any
support vector and plug it in. And you have to find the
same b coming out. That was your check that everything
in the math went through. You take any of them,
and you solve for b. And now you have w and b, and you are ready
with the classification line or hyperplane that you have. Now let me close with the nonlinear
transforms, which will be a very short presentation that has
an enormous impact. We are talking about
a linear boundary. And we are talking about linearly
separable case, at least in this lecture. In the next lecture, I’m going to
go to the non-separable case. But a non-separable case could be
handled here in the same way we handled non-separable case
with the perceptrons. Instead of working in the X space,
we went to the Z space. And I’d like to see what happens to the
problem of support vector machines, as we stated it and solved it, when
you actually move to the higher dimensional space. Is the problem becoming
more difficult? Does it hold? Et cetera. So let’s look at it. So we’re going to work
with z instead of x. And we’re going to work in the Z space
instead of the X space. So let’s first put what we are doing. Analytically, after doing all of the
stuff, and I even forgot what the details are, all I care about is that:
would you please maximize this with respect to alpha, subject to a couple
of sets of constraints. So you look at here. And you can see, when I transform
from x to z, nothing happens to the y’s. The labels are the same. And these are the guys that probably
will be changed, because now I’m working in a new space. So I’m putting them in
a different color. So if I work in the X space, that’s
what I’m working with. And these are the guys that I’m going
to multiply in order to get the matrix that I pass on to
quadratic programming. Now let’s take the usual
nonlinear transform. So this is your X space. And in X space, I give you this data. Well, this data is not separable, not
linearly separable, and definitely not nearly linearly separable. This is the case where you need
a nonlinear transformation. And we did this nonlinear
transformation before. Let’s say you take just
x1 squared and x2 squared. And then you get this, and this
one is linearly separable. So all you’re doing now is
working in the Z space. And instead of getting just a generic
separator, you’re getting the best separator, according to SVM, and then
mapping it back, hoping that it will have dividends in terms
of the generalization. So you look at this. I’m moving from X to Z. So when I go back to here,
what do you do? All you need to do is replace
the x’s with z’s. And then you forget that there
was ever an X space. I have vector z. I do the inner product in order
to get these numbers. These numbers I’m going to pass
on to quadratic programming. And when I get the solution back, I have
the separating plane or line in the Z space. And then when I want to know what
the surface is in the X space, I map it back. I get the pre-image of it. And that’s what I get. The most important aspect to observe
here is that– OK, the solution is easy. Let’s say I move from two-dimensional
to two-dimensional here. Nothing happened. Let’s say I move from two-dimensional
to a million-dimensional. Let’s see how much more difficult
the problem became. What do I do? Now I have a million-dimensional vector,
inner product with a million dimensional vector. That doesn’t faze me at all. Just an inner product. I get a number. But when I’m done, how many
alphas do I have? This is the dimensionality of the
problem that I’m passing to quadratic programming. Exactly the same thing. It’s the number of data points. Has nothing to do with the
dimensionality of the space you’re working in. So you can go to an enormous space,
without paying the price for it in terms of the optimization
you’re going to do. You’re going to get
a plane in that space. You can’t even imagine, because
it’s million-dimensional. It has a margin. The margin will look very
interesting in this case. And supposedly it has good
generalization property. And then you map it back here. But the difficulty of solving
the problem is identical. The only thing that is different is
just getting those coefficients. You’ll be multiplying longer vectors. But that is the least of our concerns. The other one is that you’re going
to get the full matrix of this. And quadratic programming will have
to manipulate the matrix. And that’s where the price is paid. So that price is constant, as long
as you give it this number. It doesn’t care whether it was inner
product of 2 by 2, or inner product of a million by million. it will just hand you the alphas. And then you interpret the alphas in
the space that you created it from. So the w will belong to the Z space. Now let’s look at, if I do the nonlinear
transformation, do I have support vectors? Yes, you have support vectors
for sure in the Z space. Because you’re working exclusively in
the Z space, you get the plane there. You get the margin. The margin will touch some points. These are your support vectors
by definition. And you can identify them even without
looking geometrically at the Z space, because what are the support vectors? Oh, I look at the alphas I get. And the alphas that are positive, these
correspond to support vectors. So without even imagining what the Z
space is like, I can identify which guys happen to have the critical margin
in the Z space, just by looking at the alphas. So support vectors live in the space
you are doing the process in, in this case, the Z space. In the X space, there is
an interpretation. So let’s look at the X space here. If I have these guys, not linearly
separable, and you decided to go to a high-dimensional Z space. I’m not going to tell you what. And you solved the support
vector machine. You got the alphas. You got the line, or the hyperplane
in that space. And then you are putting the boundary
here that corresponds to this guy. And this is what the boundary
looks like. Now, we have alarm bells– overfitting, overfitting! Whenever you see something like
that, you say wait. That’s the big advantage you
get out of support vectors. So I get this surface. This surface is simply what the line in the
Z space with the best margin got. That’s all. So if I look at what the support vectors
are in the Z space, they happen to correspond to points here. They are just data points. Right? So let me identify them here, as
pre-images of support vectors. People will say they are
support vectors. But you need to be careful,
because the formal definition is in the Z space. So they may look like this. So let’s look at it. This is one. This is another. This is another. This is another. And usually they are when you turn. You would think that in the Z space,
this is being sandwiched. So this is what it’s likely to be. Now the interesting aspect here is that
if this is true, then one, two, three, four– I have only four support vectors. So I have only four parameters, really,
expressing w in the Z space. Because that’s what we did. We said that w equals summation, over
the support vectors, of the alphas. Now that is remarkable, because I just
went to a million-dimensional space. w is a million-dimensional vector. And when I did the solution,
and if I get four– only four, which is very lucky
if you are using a million dimensional, but just
for illustration. If I get four support vectors, then
effectively, in spite of the fact that I used the glory of the million-dimensional
space, I actually have four parameters. And the generalization behavior will
go with the four parameters. So this looks like a sophisticated
surface, but it’s a sophisticated surface in disguise. It was so carefully chosen that–
there are lots of snakes that can go around and mess up
the generalization. This one will be the best of them. And you have a handle on how good the
generalization is, just by counting the number of support vectors. And that will get us– Yeah, this is a good point
I forgot to mention. So the distance between the support
vectors and the surface here are not the margin. The margins are in the linear
space, et cetera. They’re likely, these guys, to
be close to the surface. But the distance wouldn’t be the same. And there are perhaps other points that
look like they should be support vectors, and they aren’t. What makes them support vectors or
not is that they achieve the margin in the Z space. This is just an illustrative
version of it. And now we come to the generalization
result that makes this fly. And here is the deal. Generalization result: E out is less
than or equal to something. So you’re doing classification. And you are using the classification
error, the binary error. So this is the probability of error in
classifying an out-of-sample point. The statement here is very
much what you expect. You have the number of support vectors,
which happens to be the number of effective parameters– the
alphas that survived. This is your guy. You divide it by N, well, N
minus 1 in this case. And that will give you
an upper bound on E_out. Now I wish this was exactly
the result. The result is very close to this. In order to get the correct result, you
need to run several versions and get an average in order
to guarantee this. So the real result has to do with
expected values of those guys. So for several runs,
the expected value. But if the expected value lives up to
its name, and you expect the expected value, then in that case, the E_out you
will get in a particular situation will be bounded above by this, which
is a very familiar type of a bound, number of parameters, degrees of
freedom, VC dimension, dot dot dot, divided by the number of examples. We have seen this before. And again, the most important aspect
the nature of the Z space. Could be million-dimensional. And that didn’t figure out in the
computational difficulty. It doesn’t figure out in the
million-dimensional space. You asked me, after you were done with
this entire machinery, how many support vectors did you get? If you have 1000 data points, and you
get 10 support vectors, you’re in pretty good shape regardless of
the dimensionality of the space that you visited. Because, then, 10 over 1000– that’s
a pretty good bound on E_out. On the other hand, it doesn’t say that
now I can go to any dimensional space and things would be fine. Because you still are dependent on
the number of support vectors. If you go through this machinery, and
then the number of support vectors out of 1000 is 500, you know
you are in trouble. And trouble is understood in this
case, because that snake will be really a snake– going around every point, going
around every point. So just trying to fit the data
hopelessly, getting so many support vectors that the generalization
question now becomes useless. But this is the main theoretical result
that makes people use support vectors, and support vectors with
the nonlinear transformation. You don’t pay for the computation of
going to the higher dimension. And you don’t get to pay for the
generalization that goes with that. And then when we go to kernel methods,
which is a modification of this next time, you’re not even going to pay for
the simple computational price of getting the inner product. Remember when I told you take an inner
product between a million-vector and itself, and that was minor,
even if it’s minor, we’re going to get away without it. And when we get away without it, we will
be able to do something rather interesting. The Z space we’re going to visit– we
are now going to take Z spaces that happen to be infinite-dimensional. Something completely unthought
of when we dealt with generalization in the old way. Because obviously, in an infinite-dimensional
space, I’m not going to be able to actually computationally
get the inner product. Thank you. So there has to be another way. And the other way will be the kernel. But that will open another set of
possibilities of working in a set of spaces we never imagined touching, and
still getting not only the computation being the same, but also the
generalization being dependent on something that we can measure, which
is the number of support vectors. I will stop here and take questions
after a short break. Let’s start the Q&A. MODERATOR: OK. Can you please first explain
again why you can normalize w transposed x plus b to be 1. PROFESSOR: OK. We would like to solve for
the margin given w. That has dependency on the combination
of w’s you get, which is like the angle that is the relevant one. And also w has an inherent
scale in it. So the problem is that the scale has
nothing to do with which plane you’re talking about. When I take w, the full w and b, and
take 10 times that, they look like different vectors as far as the analysis
is concerned, but they are talking about the same plane. So if I’m going to solve without the
normalization, I will get a solution. But the solution, whatever I’m
optimizing, will invariably have in its denominator something that takes
out the scale, so that the thing is scale-invariant. I cannot possibly solve. And it will tell me that w has to be
this, when in fact any positive multiple of it will serve
the same plane. So all I’m doing myself is simplifying
my life in the optimization. I want the optimization to be
as simple as possible. I don’t want it to be something
over something. Because then I will have trouble
actually getting the solution. Therefore, I started by putting
a condition that does not result in loss of generality. Because if I restrict myself
to w’s, not to planes– all planes are admitted. But every plane is represented
by an infinite number of w’s. And I’m picking one particular w
to represent them that happens to have that form. When I do that and put it as
a constraint, what I end up with, the thing that I’m optimizing happens to
be a friendly guy that goes with quadratic programming and
I get the solution. I could definitely have started
by not putting this condition. Except that I would run into
mathematical trouble later on. That’s all there is to it. Similarly, I could have left w0. And then all of a sudden, every time I
put something, I only tell you: take the norm of the first d guys, or w_1 up
to w_d, and forget the first one. So all of this was just pure technical
preparation that does not alter the problem at all, that makes the
solution friendly later on. MODERATOR: Many people are curious. What happens when the points
are not linearly separable? PROFESSOR: There are two cases. One of them: they are horribly not
linearly separable, like that. And in this case, you
go to a nonlinear transformation, as we have seen. And then there is a slightly
not linearly separable, as we’ve seen before. And in that case, you will see that the
method I described today is called hard-margin SVM. Hard-margin because the margin
is satisfied strictly. And then you’re going to get another
version of it, which is called soft margin, that allows for few errors
and penalizes for them. And that will be covered next. But basically, it’s very much in
parallel with the perceptron. Perceptron means linearly separable. If there are few, then
you apply something. Let’s say like the pocket
in that case. But if it’s terribly not linearly
separable, then you go to a nonlinear transformation. And nonlinear transformation here is very
attractive because of the particular positive properties that we discussed. But in general, you actually use
a nonlinear transformation together with the soft version, because you don’t want
the snake to go out of its way just to take care of an outlier. So we are better off just making
an error on the outlier, and making the snake a little bit less wiggly. And we will talk about that
when we get the details. MODERATOR: Could you explain once again
why in this case, just the number of support vectors gives an approximation
of the VC dimension, while in other cases the transform– PROFESSOR: The explanation
I gave was intuitive. It’s not a proof. There is a proof for these terms
that I didn’t even touch on. And the idea is the following. We have come to the conclusion that the
number of parameters, independent parameters or effective parameters, is
the VC dimension in many cases. So to the extent that you can actually
accept that as a rule of thumb, then you look at the alphas. I have as
many alphas as data points. So if these were actually my parameters,
I’d be in deep trouble. Because I have as many parameters as
points, so I’m basically memorizing the points. But the particulars of the problem
result in the fact that, in almost all the cases, the vast majority of the
parameters will be identically 0. So in spite of the fact that they were
open to be non-zero, the fact that the expectation is that almost all of them
will be 0, makes it more or less that the effective number of parameters are the
ones that end up being non-zero. Again, this is not an accurate
statement, but it’s a very reasonable statement. So the number of non-zero parameters,
which corresponds to the VC dimension, also happens to be the
number of the support vectors by definition. Because support vectors are the ones
that correspond to the non-zero Lagrange multipliers. And therefore, we get a rule, which
either counts the number of support vectors or the number of surviving
parameters, if you will. And this is the rule that we had at
the end, that I said that I didn’t prove, but actually gives
you a bound on E_out. MODERATOR: Is there any advantage in
considering the margin, but using a different norm? PROFESSOR: So there
are variations of this. And indeed, some of the aggregation
methods, like boosting, has a margin of its own. And then you can compare that. It’s really the question of the
ease of solving the problem. And if you have a reason for
using one norm or another, for a practical problem. For example, if I see that loss goes
with squared, or loss goes with the absolute value or whatever, and then I
design my margin accordingly, then we go back to the idea of a principled
error measure, in this case margin measure. On the other hand, in most of the cases,
there is really no preference. And it is the analytic considerations
that makes me choose one margin or another. But different measures for the margin,
with 1-norm, 2-norm, and other things, have been applied. And there is really no compelling reason
to prefer one over the other in terms of performance. So it really is the analytic
properties that usually dictate the choice. MODERATOR: Is there any pruning method
that can maybe get rid of some of the support vectors, or not really? PROFESSOR: So you’re not happy
with even reducing it to support vectors? You want to get rid of some of them.
Well– Offhand, I cannot think of a method
that I can directly translate into– as if it’s getting rid of
some of the support vectors. What happens for computational reasons
is that when you solve a problem that is huge in data set, you
cannot solve it all. So sometimes what happens is that
you take subsets, and you get the support vectors. And then you take the support vectors as
a union and get the support vectors of the support vectors,
and stuff like that. So these are really computational
considerations. But basically, the support vectors are
there to support the separating plane. So if you let one of them
go, the thing will fall! Obviously, I’m half-joking only. But because really, they are the ones
that dictate the margin, so their existence really tells you
that the margin is valid. And that’s really why they are there. MODERATOR: Some people are worried that
a noisy data set would completely ruin the performance of the SVM. So how does it deal with this? PROFESSOR: It will ruin as much
as it will ruin any other method. It’s not particularly susceptible
to noise. Except obviously when you have noise,
the chances of getting a cleanly linearly separable data is not there. And therefore, you’re using
the other methods. And if you’re using strictly
nonlinear transformation, but with hard margin, then I can see
the point of ruining. Because now the snake is
going around noise. And obviously that’s not good, because
you’re fitting the noise. But in those cases, and in almost all of
the cases, you use the soft version of this, which is remarkably similar. It’s different assumptions. But the solution is remarkably
similar. And therefore in that case, you will be
as vulnerable or not vulnerable to noise as you would by
using other methods. MODERATOR: All right. I think
that’s it. PROFESSOR: Very good. We will see you next week.

### 85 comments on “Lecture 14 – Support Vector Machines”

1. Nestor Hernandez says:

the luxury car!

2. Mortiffer says:

Support Vector Machine lecture starts at 4:14

3. Jiunjiun Ma says:

Wow, this is brilliant.

4. Rafael Reis says:

There's no god about it! Even so, congratulations!

5. kernel says:

Thank you very much for the best lecture on SVM in the world. Probably, Vapnik himself would be able to teach/deliver the SVM clearly as you do.

6. Ankit Arya says:

Thanks a lot, very well explained!

7. ChakarView says:

seriously dude this is awesome.. after many attempts finally I understand the SVM..

8. LegatoDi says:

I haven't seen previous lectures and I wonder why he call vector "w" as a "signal"?

9. Darshan Hegde says:

at 34:29, Observe closely, When Prof. Yaser is explaining the constrained optimization, there is an background music as his hand moves. "Boshooom"… ! It just sounds so natural, as if Prof. Did it !

10. Hyunguk Choi says:

Thanks a lot !! 🙂

11. junchen feng says:

The intuition is GREAT! Thx!

12. Johannes Stoll says:

awesome

13. Lynn Z. says:

Well explained! Thanks a lot!

14. CreativeWarlock says:

it's his hand touching the microphone

15. Amar C says:

Thanks a lot !

16. CyberneticOrganism01 says:

The kernel trick (part 3) is not explained in much detail…
I'm still looking for a clear and easy-to-understand explanation of it =)

17. Arinze Akutekwe says:

This is really very nice and helpful in my research work. I would have love to know more about the heuristics you talked about for handling large dataset with SVM

18. Zeinab Dastgheib says:

Thank you very much, very helpful !

19. kernel says:

I meant, Vapnik himself would not be able to teach the subject as clearly as you do.

20. Farhan Rahman says:

This lecture is sooo good! One of the cool things is that people here don't assume that you know everything unlike so many other places where they expect that you know about the basic concepts of optimisation and machine learning!

21. Anshuman Biswal says:

really nice video…understood SVM at last 🙂

22. Subhabrata Banerjee says:

All ML teachers are so boring. Good Mathematicians tell stories on numbers.

23. Omar Trejo says:

This is a very well produced lecture. Thank you for sharing. 🙂

24. Deniz Zorlu says:

haven't got there yet but kernel methods is the next lecture..

25. mnfchen says:

I don't quite understand KKT conditions; what foundations do I need to do so?

26. Ethan_AI says:

Watched a video on Lagrange Multipliers and now Im back again.

27. Abdullah Amer says:

your lecture cannot say about it less than amazing…Thank you so much…

28. DaJaguar says:

I love his accent! 🙂

29. Graig G says:

SVMs kick ass!

30. ABC2007YT says:

I rewinded this a number of times and i finally got it. really well explained!!

31. K Yudhanta says:

Can I used SVM for sentiment analysis classification?

32. TheHarperad says:

In sovjet rashiya, machine vector supports you.

33. Varga Robert says:

Nice, clean presentation.
"I can kill +b"

34. Abdalrahman Shlash says:

Very helpful !.. thanks a lot

35. Ali Ebrahimi says:

such a gentle man and inteligent Professor.

36. Mohamed El Ansari says:

Very nice presentation.
Thank you a lot

37. Sastry aditya says:

Warning: If your IQ is below 160 move on to some other video or Andrew NG's video's

38. Indra Firmansyah says:

Thank you for the lecture Professor!

39. Ali Mizan says:

Just wondering: at 43:26, is that -1 supposed to an identity matrix times scalar -1? That's what I assumed at first, but when I look at LAML, the java quadratic programming library that I'm using, it specifies that C needs to be an n x 1 matrix. So I guess c is just a column of N rows, with each entry being a -1?

40. Jules Wombat says:

Interesting and Inspiring. A great video, alongside other videos, to help comprehend a basic understanding of the SVM subject.

Still worried (my naïve intuition )that if it really comes down to being a calculation against those margin points, then surely more susceptible to noisy data and overfitting because I would have thought the noisy overfitting errors are what are on the margins.
So I guess look at sow 'soft' SVMs help.

41. Anand R says:

Thank you Professor for the very informative lecture..!
Can someone here tell me what lecture he covers VC dimensions in ?
Highly appreciate ur replies

42. Vedhas Pandit says:

Why at 33:43, the professor says alpha's are non-negative, all of a sudden????
Disclaimer: I haven't watched earlier lectures, in case that is relevant.

43. Vedhas Pandit says:

Summarized question: Why are we maximizing L w.r.t. alpha at 39:25?

Slide13 at 36:06: At extrema of L(w,b,alpha), dL/db=dL/dw=0, giving us w =sum(an*yn*xn) and sum(an*yn)=0. These substitutions make L(w,b,a)=L(alpha) in the slide 14 = extrema of L. Then why are we maximizing this w.r.t alpha?? He said something about that in slide 13 at 33:40, but I could not understand. Can anybody care to explain?

44. Abhijeet Kalpande says:

Really helpful explanation..got what SVM is..Thank you so much professor!

45. WahranRai says:

from 12:15
It means that you extended the features X with 1 and weights W with b as in perceptron.
And these extensions are removed from X and W after normalization.

46. yaseen alwesabi says:

so good

47. Dea Shehu says:

I loved loved loved all the lectures , you are an amazing professor !!!!

48. link mipha says:

Un mot merveilleux…

49. Rob Romijnders says:

What a charming prof. Like his teaching style. Thank you Caltech for sharing this

50. JAEYEON LEE says:

can anyone tell me the lecture where he teaches "generalization"??

51. mohammad kamal says:

Thanks Dr Yasser ,you are honor for every Egyptian

52. Maja Garbulinska says:

people like you save my life 🙂

53. D. Refaeli says:

I don't understand why we put constraints on alpha's to be greater than 0… If we take a simple example, say of 3 data points, 2 of positive class (yi=1): (1,2) (3,1) and one negative (yi=-1): (-1,-1) – and we calculate using Lagrange multipliers, we will get a perfect w (0.25,0.5) and b = -0.25, but one of our alphas was negative (a1 = 6/32, a2 = -1/32, a3 = 5/32). So why is this a problem?

54. Atif Faridi (M.Tech CS 2015) says:

One of the best machine learning lecture. I would like to know.
How to solve quadratric programming analytically. So that the whole process of getting hyperplane can be done analytically.

55. Aditya Gaykar says:

I bow to your teaching /. Thank you.

56. Rahul Singh Yadav says:

Wow! this is the best explanation to SVM's by far I've come across, with right mathematical rigor, lucid concepts and structured analytical thinking put's up a good framework to understanding this complex model in a fun and intuitive way.

57. Sergio A. Serrano says:

Bravo Dr. Yaser, excellent explanation! Now looking forward Kernel Methods lecture 🙂

58. Patrick Agostini says:

Is that an ashtray in front of the professor?

59. Muhammad Anwar Hussain says:

In the constraint condition of |w^T.xn +b| >=1 how is it guaranteed that for the nearest xn, the |w^T.xn +b| will be 1 ?

60. sakcee says:

I salute you Sir!. What a great way of teaching! I think, I understood most by just one viewing of these lectures.

Do you teach any other courses? Can you put them on youtube also?

61. ScottReflex92 says:

writing my bachelors thesis about SVMs atm. it's a great introduction and very helpful for understanding the main issues in a short time. Thankyou!

62. Yusuf Ahmed says:

Excellent lecture

63. 7justfun says:

Amazing how you unravel it , like a movie , the element of suspense , a preview and a resolution.

64. June Yang says:

10/10 would listen again

65. Philip Robinson says:

why is their preference between minimizing and maximizing for optimization?

66. Chiran Koirala says:

Never seen lecture like this. Thank you!

67. Biswa G Singh says:

How simply you explain things. Wonder I can explain complex things like you do.

68. Diego Cerda says:

Best explanation ever! thank you

69. Alexander Zubov says:

Thank you very much for sharing these wonderful lectures! I have some thoughts about the margin. It seems, that start of the PLA with weights defining the hyperplane placed between the two centers of mass of data points is better to achieve the maximum margin, than the start with all-zero weights. Let R1 and R2 be the centers of mass of data points of the "+1" and "-1" categories, respectively. Then the normal vector of the hyperplane is equal to R1 – R2 (direction is important) and the bias vector is equal to (R1 + R2)/2. Thereby, the vector part of the weights is initialized as w = R1-R2 and the scalar part as wo = -(R1-R2, R1+R2)/2 (the inner product of the normal and bias vector multiplied by -1).

70. Bhaskar Dhariyal says:

What does first preliminary technicality(12:43)mean |wTx|=1? How is it same as |wTx| >0?

71. Cr4y7 says:

this one was complicated

72. Amr Del says:

good courses have you got lecture on ADABOOST and its uses with svm or other weak learners

73. YC says:

FUCKING BRILLIANT!! Thanks 😀

74. Fikriansyah Adzaka says:

I have some questions:

1. in slide 6 at 13:53, I still don't understand the reason behind changing the inequation into equal 1. the professor just said so that we can restrict the way we choose w and the math will become friendly. but is there any other reason behind this? like, can we actually choose any number other than one, maybe equal 2 or equal 0.5? seems both of them will also restrict the way we choose w
2. in slide 9 at 24:56, why maximize 1/||w|| is equivalent to minimize 1/2 wt w? any math derivation behind this? because I think I don't get it at all

75. Андрей Серебро says:

Mm, why are we taking expected value of Eout on the last slide when Eout is already the epxected out of sample error? What is this value with respect to which we marginalize Eout? I just didn't catch it quite well. Is it about averaging over different transformations?

76. ECFMULTIMEDIA says:

The probability that a boy passes in front of the camera between 2` and 5' is too hight 🙂

77. EST says:

This is the best (most geometrically intuitive) SVM lecture I have found so far. Thank you!

78. Solstice two says:

This explanation is really great. However, much more intuitive and better developed is the one in the Machine Learning course by Columbia University NY in EdX.org. It worthy to revise it.

79. khalid khalifa says:

Thank you sir! BTW, I would have applauded at this moment of the lecture: 22:37

80. sepide tari says:

I did not understand what was explained about W, how it can be three dimension after replacing all x_n with X_n in SV, at minute 52.

81. Anoubhav Agarwaal says:

what does VC stand for?

82. ikhee shin says:

46:26 whole bunch of alphas are just zero

83. surflaweb says:

bla blab bla and the end you will Python with sklearn 🙁

84. klaudyul says:

I am still a bit confused on the minute 22:36 he talks about the distance of the point to the plane being set to 1 ( as wx+b=1 ), and still the distance is 1/|w|. What am I missing?

85. ultimateabhi says:

30:36 what was the pun ?