## Lecture 14 – Support Vector Machines

ANNOUNCER: The following program

is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we talked about validation,

which is a very important technique in machine learning for estimating the

out-of-sample performance. And the idea is that we start from

the data set that is given to us, that has N points. We set aside K points for validation,

for just estimation, and we train with the remaining points, N

minus K. Because we are training with a subset, we end up with a hypothesis

that we are going to label g minus, instead of g. And it is on this g minus,

that we are going to get an estimate of the out-of-sample

performance by the validation error. And then there is a leap of faith, when

we put back all the examples in the pot in order to come up with the best

possible hypothesis– to work with the most training examples. We are going to get g, and we are using

the validation error we had on the reduced hypothesis, if you will,

to estimate the out-of-sample performance on the hypothesis

we are actually delivering. And there is a question of how

accurate an estimate this would be for E_out. And we found out that K cannot be too

small, and cannot be too big, in order for this estimate to be reliable. And we ended up with a rule of thumb

of about 20% of the data set go to validation. That will give you

a reasonable estimate. Now this was an unbiased estimate. So we get an E_out. We can get better than E_out or worse

than E_out in general, as far as E_val estimating the performance

of g minus. On the other hand, once you use the

validation error for model selection, which is the main utility for

validation, you end up with a little bit of an optimistic bias, because you

chose a model that performs well on that validation error. Therefore, the validation error is not

going to necessarily be an unbiased estimate of the out-of-sample error. It will have a slight positive,

or optimistic, bias. And we showed an experiment where, using

very few examples in this case, in order to exaggerate the effect. We can see the impact of– the blue

curve is the validation error, and the red curve is the out-of-sample

error on the same hypothesis, just to pin down the bias. And we realize that, as we increase

the number of examples, the bias goes down. The difference between the

two curves goes down. And indeed, if you have a reasonable-size

validation set, you can afford to estimate a couple of parameters for

sure, without contaminating the data too much. So you can assume that the measurement

you’re getting from the validation set is a reliable estimate. Then, because the number of examples

turned out to be an issue, we introduced the cross-validation, which

is, by and large, the method of validation you’re going to be using

in a practical situation. Because it gets you the

best of both worlds. So in this case, we– illustrating a case where we have

10-fold cross-validation. So you divide the data

set into 10 parts. You train on nine, and validate

on the tenth, and keep that estimate of the error. And you keep repeating as you

choose the validation subset to be one of those. So you have 10 runs. And each of them gives you an estimate

on a small number of examples, 1/10 of the examples. And then by the time you average all of

these estimates, that will give you a general estimate of what the out-of-sample

error would be on 9/10 of the data, in spite of the fact that they

are different 9/10 each time. And in that case, the advantage of it

is that the 9/10 is very close to 1, so the estimate you are

getting is very close. And furthermore, the number of examples

taken into consideration in getting an estimate of the validation

error is really N. You got all of them, albeit in different runs. So this is really the way to go

in cross-validation. And invariably, in any learning

situation, you will need to choose a model, a parameter, something–

to make a decision. And validation is the method of choice

in that case, in order to make that. OK. So we move on to today’s lecture, which

is Support Vector Machines. Support vector machines are arguably

the most successful classification method in machine learning. And they are very nice, because

there is a principled derivation for the method. There is a very nice optimization

package that you can use in order to get the solution. And the solution also has a very

intuitive interpretation. So it’s a very, very neat piece

of work for machine learning. So the outline will be the following. We are going to introduce the notion

of the margin, which is the main notion in support vector machines. And we’ll ask a question of maximizing

the margin– getting the best possible margin. And after formulating the problem, we

are going to go and get the solution. And we’re going to do

that analytically. It will be a constrained

optimization problem. And we faced one before in

regularization, where we gave a geometrical solution, if you will. This time we are going to do it

analytically, because the formulation is simply too complicated to have

an intuitive geometric solution for. And finally, we are going to expand from

the linear case to the nonlinear case in the usual way, thus expanding

all of the machinery to a case where you can deal with nonlinear surfaces,

instead of just a line in a separable case, which is the main case

we are going to handle. So now let’s talk about

linear separation. Let’s say I have a linearly

separable data set. Just take four, for example. There are lines that will separate

the red from the blue. Now, when you apply perceptron,

you will get a line. When you apply any algorithm, you will

get a line, and separate– you get 0 training error. And everything is fine. And now there is a curious point

when you ask yourself: I can get different lines. Is there any advantage of choosing

one of the lines over the other? That is the new addition

to the problem. So let’s look at it. Here is a line. So I chose this line to

separate the two. You may not think that this

is the best line. And we’ll try to take our intuition

and understand why this is not the best line. So I’m going to think of a margin,

that is, if this line moves a little bit, when is it

going to cross over? When is it going to start

making an error? So in this case, let’s put it as

a yellow region around it. That’s the margin you have. So if you choose this line, this

is the margin of error. Sort of informal notion. Now you can look at this line. And it does seem to have

a better margin. And you can now look at the problem

closely and say: let me try to get the best possible margin. And then you get this line, which has

this margin, that is exactly right for the blue and red points. Now, let us ask ourselves

the following question. Which is the best line

for classification? As far as the in-sample error is

concerned, all of them give in-sample error 0. As far as generalization questions are

concerned, as far as our previous analysis has done, all of them

are dealing with linear model with four points. So generalization, as an estimate,

will be the same. Nonetheless, I think you will agree with

me that if you had your choice, you will choose the fat margin. Somehow it’s intuitive. So let’s ask two questions. The first one is: why is

a bigger margin better? Second one. If we are convinced that a bigger

margin is better, then you ask yourself: can I solve for w

that maximizes the margin? Now it is quite intuitive that the

bigger margin is better, because think of a process that is generating

the data. And let’s say that there

is noise in it. If you have the bigger margin, the

chances are the new point will still be on correct side of the line. Whereas, if I use this one, there’s

a chance that the next red point will be here, and it will be misclassified. Again, I’m not giving any proofs. I’m just giving you an intuition here. So it stands to logic that indeed,

the bigger margin is better. And now we’re going to argue that the

bigger margin is better for a reason that relates to our VC

analysis before. So anybody remember the growth

function from ages ago? What was that? So we take the dichotomies of the

line on points in the plane. And let’s say, we take three points. So on three points, you can get all

possible dichotomies by a line. The blue versus not-blue region. And you can see that by varying where

the line is, I can get all possible 2 to the 3 equals 8 dichotomies here. So you know that the growth

function is big. And we know that the growth function

being big is bad news for generalization. That was our take-home lesson. So now let’s see if this is

affected by the margin. So now we are taking dichotomies, not only

the line, but also requiring that the dichotomies have a fat margin. Let’s look at dichotomies,

and their margin. Now in this case, I’m putting

the same three points. And I’m putting a line that has the

biggest possible margin for the constellation of points I have. So you can see here. I put

it. It sandwiched them. Every time, it touches all the points. It cannot extend any further because

it will get beyond the points. And when you look at it, this

is a thin margin for this particular dichotomy. This is an intermediate one. This is a fat one. And this is a hugely fat one,

but that’s the constant one. That’s not a big deal. Now let’s say that I told you that you

are allowed to use a classifier, but you have to have at least that

margin for me to accept it. So now I’m requiring the margin

to be at least something. All of a sudden, these guys that used to

be legitimate dichotomies using my model, are no longer allowed. So effectively by requiring the margin

to be at least something, I’m putting a restriction on the growth function. Fat margins imply fewer

dichotomies possible. And therefore, if we manage to separate

the points with a fat dichotomy, we can say that fat

dichotomies have a smaller VC dimension, smaller growth function than

if I didn’t restrict them at all. And, although this is all informal, we

will come at the end of the lecture to a result that estimates the out-of-

sample error based on the margin. And we will find out that indeed, when

you have a bigger margin, you will be able to achieve better out-of-sample

performance. So now that I completely and irrevocably

convinced you that the fat margins are good, let us

try to solve for them. That is, find the w that not only

classifies the points correctly, but achieves so with the biggest

possible margin. So how are we going to do that? Well the margin is just the distance

from the plane to a point. So I’m going to take from the data set

the point x_n, which happens to be the nearest data point to the line, that we

have used in the previous example. And the line is given by the

linear equation– equals 0. And since we’re going to use a higher

dimensional thing, I’m not going to refer to it as a line. I’m going to refer to it as a plane– hyperplane really– but

just plane for short. So we’re talking about d-dimensional

space and a hyperplane that separates the points. So we would like to estimate that. And we ask ourselves: if I give you w

and the x’s, can you plug them into a formula and give me the distance

between that plane, that is described by w, and the point x_n? I’m now taking the nearest point,

because then that distance will be the margin that I’m talking about. Now there are two preliminary

technicalities that I’m going to invoke here. And they will simplify the

analysis later on. So here is the first one. The first one is to normalize

w. What do I mean by that? For all the points in the data set,

near and far, when you take w transposed times x_n, you will get

a number that is different from 0. And indeed, it will agree with the

label y_n, because the points are linearly separable. So I can take the absolute value of this,

and claim that it’s greater than 0 for every point. Now I would like to relate w to the

margin, or to the distance. But I realize that here, there is a minor

technicality that is annoying. Let’s say that I multiply the

vector w by a million. Does the plane that I’m

talking about change? No. This is the equation of it. I can multiply by any positive number,

and I get the same plane. So the consequence of that is that any

formula that takes w and produces the margin will have to have, built

in it, scale invariance. We’ll be dividing by something that

takes out that factor that does not affect which plane I’m talking about. So I’m going to do it now, in order

to simplify the analysis later. I’m going to consider all

representations of the same plane. And I’m going to pick one where this is

normalized, by requiring that for the minimum point, this fellow is 1. I can always do that. I can scale w up and down until

I get the closest one to have this equal to 1. There’s obviously no

loss in generality. Because in this case, this is a plane. And I have not missed any

planes by doing that. Now the quantity w x_n which is

the signal, as we talked about it, is a pretty interesting thing. So let’s look at it. I have the plane. So the plane has the signal equals 0. And it doesn’t touch any points. The points are linearly separable. Now when you get the signal

to be positive, you are moving in one direction. You hit the closest point. And then you hit more points, the

interior points, so to speak. And when you go in the other direction

and it’s negative, you hit the other points, the nearest point on the

negative side, and then the interior points which are further out. So indeed that signal actually relates

to the distance, but it’s not the Euclidean distance. It just has an order of the points,

according to which is nearest and which is furthest. But what I’d like to do, I would

like to actually get the Euclidean distance. Because I’m not comparing the

performance of this plane on different points. I’m comparing the performance of

different planes on the same point. So I have to have the same yardstick. And the yardstick I’m going to

use is the Euclidean distance. So I’m going to take this

as a constraint. And when I solve for it, I will find out

that the problem I’m now solving for is much easier to solve. And then I can get the plane. And the plane will be general

under this normalization. The second one is pure technicality. Remember that we had x being in

Euclidean space R to the d. And then we added this artificial

coordinate x_0 in order to take care of w_0 that was the threshold, if you

think of it as comparing with a number, or a bias if you think

of it as adding a number. And that was convenient just to have

the nice vector and matrix representation and so on. Now it turns out that, when you solve for

the margin, the w_1 up to w_d will play a completely different role

from the role w_0 is playing. So it is no longer convenient to

have them as the same vector. So for the analysis of support vector

machines, we’re going to pull w_0 out. So the vector w now is the

old vector w_1 up to w_d. And you take out w_0. And in order not to confuse it and call

it w, because it has a different role, we are going to call

it here b, for bias. OK? So now the equation for the plane is w,

our new w, times x plus b equals 0. And there is no x_0. x_0 used to be multiplied

by b, also known as w_0. So every w you will see in this

lecture will belong to this convention. And now if you can look at this– this

will be w transposed x_n plus b. Absolute value equals 1. And the plane will be w transposed

x plus b equals 0. Just a convention that will make

our math much more friendly. So these are the technicalities that

I wanted to get out of the way. Now, big box, because it’s

an important thing. It will stay with us. And then we go for computing

the distance. So now, we would like to get

the distance between x_n– we took x_n to be the nearest point, and therefore the distance

will be the margin. And we want to get the distance

from the plane. So let’s look at the geometry

of the situation. I have this as the equation

for the plane. And I have the conditions

that I talked about. This is the geometry. I have a plane. And I have a point x_n. And I’d like to estimate the distance. First statement. The vector w is perpendicular

to the plane. That should be easy enough if you have

seen any geometry before, but it’s not very difficult to argue. But remember now that the vector

w is in the X space. I’m not talking about

the weight space. I’m talking about w as you plug in

the values and you get a vector. And I’m looking at that vector

in the input space X. And I’m saying it’s perpendicular

to the plane. Why is that? Because let’s say that you

pick any two points– call them x dash and x double

dash– on the plane proper. So they are lying there. What do I know about these two points? Well, they are on the plane, so

they had better satisfy the equation of the plane. Right? So I can conclude that it must be that,

when I plug in x dash in that equation, I will get 0. And when I plug in x double

dash, I will get 0. Conclusion: If I take the difference between these

two equations, I will get w transposed times x dash minus

x double dash, equals 0. And now you can see that

good old b dropped out. And this is the reason why it has

a different treatment here. The other guys actually mattered. But the b plays a different role. So when you see an equation like

that, your conclusion is what? Your conclusion is that w, as a vector,

must be orthogonal to x dash minus x double dash, as a vector. So when you look at the plane, here is

the vector x dash minus x double dash. Let me magnify it. And this must be orthogonal

to the vector w. So the interesting thing is that we

didn’t make any restrictions on x dash and x double dash. These could be any two points

on the plane, right? So now the conclusion is that w, which is

the same w– the vector w that defines the plane, is orthogonal to

every vector on the plane. Right? Therefore, it is orthogonal

to the plane. So we got that much. We know that now w has

an interpretation. Now we can get the distance. Once you know they are orthogonal

to the plane, you probably can get the distance. Because what do we have? The distance between x_n and the plane,

and we put them here, is what? Can be computed as follows. Pick any point, one point,

on the plane. We just call it generic x. And then you take the projection of the

vector going from here to here. You project it on the direction which

is orthogonal to the plane. And that will be your distance. Right? So we just need to put the mathematics

that goes with that. So here’s the vector. And here is the other vector, which we

know that is orthogonal to the plane. Now if you project this fellow on this

direction, that length will give you the distance. Now in order to get the projection,

what do you do? You get the unit vector

in the direction. So you take w, which is this vector–

could be of any length– and you normalize it by its norm. And you get a unit vector under

which the projection would be simply a dot product. So now the w hat is a shorter w,

if the norm of w happens to be bigger than 1. And what you get– you get the distance being

simply the inner product. You take the unit vector, dot that. And that is your distance. Except for one minor issue. This could be positive or negative

depending on whether w is facing x or facing the other direction so in order

to get the distance proper, you need the absolute value. So we have a solution for it. Now we can write the distance as– this is the formula. Now I multiply it by w hat. I know what the formula for w hat is. I write it down. And now I have it in this form. Now this can be simplified if I add

the missing term, plus b minus b. Why is that? Can someone tell me what is w^T x plus

b, which is this quantity being subtracted here? This is the value of the equation of

the plane, for a point on the plane. So this will happen to be 0. How about this quantity, w^T x_n

plus b, for my point x_n. Well, that was the quantity

that we insisted on being 1. Remember when we normalized the w,

because w’s could go up and down. And we scaled them such that the

absolute value of this quantity is 1. So all of a sudden, this

thing is just 1. And you end up with the formula for the

distance, given that normalization, being simply 1 over the norm. That’s a pretty easy thing to do. So if you take the plane and insist on

a canonical representation of w by making this part 1 for the nearest

point, then your margin will simply be 1 over the norm of w you used. This I can use, in order now to choose

what combination of w’s will give me the best possible margin,

which is the next one. So let’s now formulate the problem. Here is the optimization

problem that resulted. We are maximizing the margin. The margin happens to

be 1 over the norm. So that is what we are maximizing. Subject to what? Subject to the fact that for the nearest

point, which happens to have the smallest value of those guys– so

the minimum over all points in the training set. I took the quantity here

and scaled w up or down in order to make that quantity 1. So I take this as a constraint. When you constrain yourself this way,

then you are maximizing 1 over w. And that is what you get. So what do we do with this? Well, this is not a friendly

optimization problem. Because if the constraints have

a minimum in them, that’s bad news. Minimum is not a nice

function to have. So what we are going to do now, we are

going to try to find an equivalent problem that is more friendly. Completely equivalent, by very

simple observations. So the first observation is that I

want to get rid of the minimum. That’s my biggest concern. So the first thing I notice that– not to mention the absolute value. So the absolute value of this

happens to be equal to this fellow. Why is that? Well, every point is

classified correctly. I’m only considering the points that

separate the data sets correctly. And I’m choosing between them, for the

one that maximizes the margin. Because they are classifying the points

correctly, it has to be that the signal agrees with the label. Therefore when you multiply, the label is

just +1 or -1, and therefore it takes care of the absolute value part. So now I can use this instead

of the absolute value. I still haven’t gotten

rid of the minimum. And I don’t particularly like dividing

1 over the norm, which has a square root in it. But that is very easily handled. Instead of maximizing 1 over the norm,

I’m going to minimize this friendly quantity, quadratic one. I’m minimizing now. So I’m maximizing 1 over,

minimizing that. Everybody sees that it’s equivalent. So now we can see. Does anybody see quadratic programming

coming up in the horizon? There’s our quadratic formula. The only thing I need to do is just have

the constraints being friendly constraints, not a minimum

and absolute value. Just inequality constraints

that are linear in nature. And I claim that you can do this by

simply taking subject to these. So this doesn’t bother me, because I

already established that it deals with the absolute value. But here, I’m taking greater than

or equal to 1 for all points. I can see that if the minimum

is 1, then this is true. But it is conceivable that I do this

optimization, and I end up with a quantity for which all of these guys happen

to be strictly greater than 1. That is a feasible point, according

to the constraints. And if this by any chance gives me the

minimum, then that is the minimum I’m going to get. And the problem with that is that this

is a different statement from the statement I made here. That’s the only difference. Well, is it possible that the minimum

will be achieved at a point where this is greater than 1 for all of them? A simple observation tells you: no,

this is impossible. Because let’s say that you

got that solution. You tell me: this is the minimum I

can get for w transposed w, right? And I got it for values where this

is strictly greater than 1. Then what I’m going to do, I’m going

to ask you: give me your solution. And I’m going to give you

a better solution. What am I going to do? I’m going to scale w and b

proportionately down until they touch the 1. You have a slack, right? So I can just pull all of them, just

slightly, until one of them touches 1. Now under those conditions, definitely,

if the original constraints were satisfied, the new

constraints will be satisfied. All of them are just proportional. I can pull out the factor, which

is a positive factor. And indeed, if this is the case, this

will be the case for the other one. And the point is that the w I got is

smaller than yours because I scaled them down, right? So it must be that my solution

is better than yours. Conclusion: When you solve this, the w that you will

get necessarily satisfies these with at least one of those

guys with equality. Which means that the minimum is 1. And therefore, this problem is

equivalent to this problem. This is really very nice. So we started from a concept, and geometry,

and simplification, and now we end up with this very friendly

statement that we are going to solve. And when you solve it, you’re going to

get the separating plane with the best possible margin. So let’s look at the solution. Formally speaking, let’s put it in

a constrained optimization question. The constrained optimization here– you

minimize this objective function subject to these constraints. We have seen those. And the domain you’re working on,

w happens to be in the Euclidean space R to the d. b happens to be a scalar, belongs

to the real numbers. That is the statement. Now when you have a constrained

optimization– we have a bunch of constraints here. And we will need to go an analytic

route in order to solve it. Geometry won’t help us very much. So what we’re going to do here, we are

going to ask ourselves: oh, constrained optimization. I heard of Lagrange. You form a Lagrangian, and then all of

a sudden the constrained become unconstrained, and you solve it, and

you get the multipliers lambda. Lambda is pretty much what we got

in regularization before. We did it geometrically. We didn’t do it explicitly

with Lagrange. But that’s what you get. Now the problem here is that the

constraints you have are inequality constraints, not equality constraints. That changes the game a little

bit, but just a little bit. Because what people did is simply look

at these and realize that there is a slack here. If I call the slack s squared,

I can make this equality. And then I can solve the old Lagrangian,

with equality. I can comment on that in the

Q&A session, because it’s a very nice approach. And that approach was derived

independently by two sets of people, Karush, which is the first K, and

Kuhn-Tucker, which is the KT. And the Lagrangian under the inequality

constraint is referred to as KKT. So now, let us try to solve this. And I’d like, before I actually go

through the mathematics of it, to remind you that we actually saw this

before in the constrained optimization we solved before under inequality constraints,

which was regularization. And it is good to look at that picture,

because it will put the analysis here in perspective. So in that case, you don’t

have to go through the details. We were minimizing something– you don’t have to worry about the

formula exactly– under a constraint. And the constraint is an inequality

constraint that resulted in weight decay, if you remember. And we had a picture

that went with it. And what we did was, we looked

at the picture and found a condition for the solution. And the condition for the solution

showed that the gradient of your objective function, of the thing you are

trying to minimize, becomes something that is related to

the constraint itself. In this case: normal. The most important aspect to realize is

that, when you solve the constrained problem here, the end result was

that the gradient is not 0. It would have been 0 if the

problem was unconstrained. If I asked you to minimize this, you

just go for gradient equals 0, and solve. So now, because of the constraint, the

constraint kicks in, and you have the gradient being something related

to the constraint. And that’s what will happen

exactly when we have the Lagrangian in this case. But one of the benefits of having– of reminding you of the regularization is that there’s

a conceptual dichotomy, no pun intended, between the

regularization and the SVM. SVM is what we’re doing here,

maximizing the margin, and regularization. So let’s look at both cases and ask

ourselves: what are we optimizing, and what is the constraint? If you remember in regularization,

we already have the equation. What we are minimizing is

the in-sample error. So we are optimizing E_in, under the

constraints that are related to w transposed w, the size of the weights. That was weight decay. If you look at the equation we just

found out in order to maximize the margin, what we are actually optimizing

is w transposed w. That is what you’re trying

to minimize. Right? And your constraint is that you’re

getting all the points right. So your constraint is that E_in is 0. So it’s the other way around. But again, because both of them will

blend in the Lagrangian, and you will end up doing something that is

a compromise, it’s conceptually not a big shock that we are reversing roles

here, and minimizing what is in our mind a constraint, and constraining

what is in our mind an objective function to be minimized. Back to the formulation. So now, let’s look at the

Lagrange formulation. And I would like you to pay

attention to this slide. Because once you get the formulation,

we’re not going to do much beyond getting a clean version of the

Lagrangian, and then passing it on to a package of quadratic programming

to give us a solution. But at least, arriving

there is important. So let’s look at it. We are minimizing– this is our objective function–

subject to constraints of this form. First step, take the inequality

constraints and put them in the 0 form. So what do I mean by that? Instead of saying that’s greater or

equal to 1, you put it as minus 1, and then require that this is greater

than or equal to 0. And now you see, it got multiplied

by a Lagrange multiplier. So think of this, since this should be

greater than 0, this is the slack. So the Lagrange multipliers get

multiplied by the slack. And then you add them up. And they become part of the objective. And they come out as a minus, simply

because the inequalities here are in the direction greater than or equal to. That’s what goes with the minus here. I’m not proving any of that. I’m just motivating for you that this

formula makes sense, but there’s mathematics that actually

pins it down exactly. And you’re minimizing this. So now let’s give it a name. It’s a Lagrangian. It is dependent on the variables

that I used to minimize with respect to, w and b. And now I have a bunch of new variables

which are the Lagrange multipliers, the vector alpha, which

is called lambda in other cases. Here it’s standard, alpha. And there are N of them. There’s a Lagrange multiplier

for every point in the set. We are minimizing this

with respect to what? With respect to w and b. So that was the original thing. The interesting part, which you should

pay attention to, is that you’re actually maximizing with

respect to alpha. Again, I’m not making a mathematical

proof that this method holds. But this is what you do. And it’s interesting because when we

had equality, we didn’t worry about maximization versus minimization. Because all you did, you get

the gradient equals 0. So that applies for both

maximum and minimum. So we didn’t necessarily

pay attention to it. Here you have to pay attention to it,

because you are maximizing with respect to alphas, but the alphas

have to be non-negative. Once you restrict the domain, you can’t

just get the gradient to be 0, because the function– if the function was all over and this

way, you get the minimum. And minimum has gradient 0. But if I tell you to stop here, the

function could be going this way. And this is the point you’re

going to pick. And the gradient here

is definitely not 0. So the question of maximizing versus

minimizing, you need to pay attention here. We are not going to pay too much

attention to it, because we’ll just tell the quadratic programming

guy, please maximize. And it will give us the solution. But that is the problem

we are solving. So now we do at least

the unconstrained part. With respect to w and b, you

are just minimizing this. So let’s do it. We’re going to take the gradient

of the Lagrangian with respect to w. So I’m getting partial by partial

for every weight that appears. And I get the equation here. How do I get that? I can differentiate. So I’m going to differentiate this. I get a w. The squared goes with the half. When I get this, I ask myself: what

is the coefficient of w? I get alpha, y_n, and x_n. Right? That one gets multiplied by w for

every n equals 1 to N. So I get that. And I have a minus sign

here, that comes here. Everything else drops out. So this is the formula. And what do I want the gradient to be? I want it to be the vector 0. So that’s a condition. What is the other one? I now get the derivative with

respect to b. b is a scalar. That’s the remaining parameter. And when I look at it,

can we do this? What gets multiplied by b? Oh I guess it’s just the alphas. Everything else drops out. So– oh, not just alphas! It’s y_n. So here’s the b. It gets multiplied

by y_n and alpha. And that’s what I get. And you get this to be equal

to the scalar 0. So optimizing this with respect to

w and b resulted in these two conditions. Now what I’m going to do, I’m going

to go back and substitute with these conditions in the original

Lagrangian, such that the maximization with respect to alpha– which is the tricky part, because alpha

has a range– will become free of w and b. And that formulation is referred to as

the dual formulation of the problem. So let’s substitute. Here are what I got from

the last slide. This one I got from the gradient

with respect to w equals 0. So w has to be this. And this one from the partial

by partial b, equals 0. I get those. And now I’m going to substitute

them in the Lagrangian. And the Lagrangian has that form. Now let’s do this carefully, because

things drop out nicely. And I get a very nice formula at the

end, which is function of alpha only. So this equals– first part, I get the summation

of the Lagrange multipliers. Where did I get that? I got that because I

have -1 here. It gets multiplied by alpha_n

for all of those. Canceled with this minus, so

I get summation over that. So this part I got. So let me kill the part

that I already used. So I kill the -1. So that part I got. Next. I look at this and say: I have

+b here, right? So when I take +b, it gets

multiplied by y_n alpha_n, summed up from n equals 1 to N. Now, I look at

this and say: oh, the summation of alpha_n y n from n

equals 1 to N is 0. So the guys that get multiplied

by b, will get to 0. And therefore, I can kill +b. Now when I have it down to this,

it’s very easy to see. Because you look at the form for w, when

you have w transposed w, you are going to get a quadratic version of this. You get some double summation,

alpha alpha y y x x, right? With the proper name of the dummy

variable, to get it right. And when you have here, well, you have

already alpha_n y_n and x n, and now when you substitute w by this, you’re

going to get exactly the same thing. You’re going to get another alpha,

another y, another x. So this will be exactly the same as

this, except that this one has a factor half, this has

a factor -1. So you add them up. And you end up with this. So we look at this: what

happened to w? What happened to b? All gone. We are now just function of

the Lagrange multipliers. And therefore, we can call

this L of alpha. Now this is a very nice quantity to

have, because this is a very simple quadratic form in the vector alpha. Alpha here appears as a linear guy. Here appears as a quadratic guy. That’s all. Now I need to put the constraints. I put back the things I took out. And let’s look at the maximization

with respect to alpha, subject to non-negative ones. This is a KKT condition. I have to look for solutions

under these conditions. And I also have to consider the

conditions that I inherited from the first stage. So I have to satisfy this, and I

have to satisfy this, for the solution to be valid. So this one is a constraint over the

alphas, and therefore I have to take it as a constraint here. But I don’t have to take the constraint

here, because that is vacuous as far as alphas are concerned. This does no constraint over

alphas whatsoever. You do your thing. You come up with alphas. And you call whatever that formula

is, the resulting w. Since w doesn’t appear in

optimization, I don’t worry about it at all. So I end up with this thing. Now if I didn’t have those annoying

constraints, I would be basically done. Because I look at this,

that’s pretty easy. I can express one of the alphas in

terms of the rest of the alphas. Right? Factor it out. Substitute for that alpha here. And all of a sudden, I have a purely

unconstrained optimization for a quadratic one. I solve it. I get something, maybe a pseudo inverse

or something, and I’m done. But I cannot do that simply because

I’m restricted to those choices. And therefore, I have to work with

a constrained optimization, albeit a very minor constrained optimization. Now let’s look at the solution. The solution goes with quadratic

programming. So the purpose of the slide here is

to translate the objective and the constraints we had into the coefficients

that you’re going to pass on to a package called quadratic

programming. So this is a practical slide. First, what we are doing is maximizing

with respect to alpha this quantity that we found, subject to

a bunch of constraints. Quadratic programming packages come

usually with minimization. So we need to translate this

into minimization. How are going to do that? We’re just going to get

the minus of that. So this would become this minus that. So let’s do that. We got the minus, minimum of this. So now it’s ready to go. Now the next step will

be pretty scary. Because what I’m going to do, I’m going

to expand this, isolating the coefficients from the alphas. The alphas are the parameters. You’re not passing alphas to

the quadratic programming. Quadratic programming works

with a vector of variables that you called alpha. What you are passing are the

coefficients of your particular problems that are decided by these

numbers, that the quadratic programming will take, and then will

be able to give you the alphas that would minimize this quantity. So this is what it looks like. I have a quadratic term,

alpha transposed alpha. And these are the coefficients

in the double summation. These are numbers that you read

off your training data. You give me x_1 and y_1. I’m going to compute these numbers

for all of these combinations. And I end up with a matrix. That matrix gets passed to

quadratic programming. And quadratic programming asks you for

the quadratic term, and asks you for the linear term. Where the linear term, just to be

formal, happens to be, since we are just taking minus alpha, it’s -1 transposed

alpha, which is the sum of those guys. So this is the bunch of linear

coefficients that you pass. And then the constraints– you put the constraints again

in the same way, subject to. So there’s a part which asks

you for constraints. And here again, the constraints– you

care about the coefficients of the constraints. So this is a linear equality

constraint. So we are going to pass the y

transposed, which are the coefficients here, as a vector. And it will ask you for, finally, the

range of alphas that you need. And the range of alphas that you need

happens to be between 0, so that would be the vector 0– would

be your lower bound. Infinity will be your upper bound. So you read off this slide. You give it to the quadratic

programming. And the quadratic programming

gives you back an alpha. And if you’re completely discouraged by

this, let me remind you that all of this is just to give you what

to pass to the package. This actually looks exactly like this. That’s all you’re doing. A very simple quadratic function,

with a linear term. You’re minimizing it, subject to linear

equality constraint, plus a bunch of range constraints. And when you expand it, in terms of

numbers, this is what you get. And that’s what we’re going to use. So now we are done. We have done the analysis. We knew what to optimize. It fit one of the standard

optimization tools. It happens to be convex function in

this case, so that the quadratic programming will be very successful. And then we pass it, and

we get a number back. Just a word of warning

before we go there. You look at the size of this matrix. And it’s N by N. Right? So the dimension of the matrix depends

on the number of examples. Well, if you have a hundred

examples, no sweat. If you have 1000 examples, no sweat. If you have a million examples,

this is really trouble. Because this is really a dense matrix. These numbers could come

up with anything. So all the entries matter. And if you end up with a huge matrix,

quadratic programming will have pretty hard time finding the solution. To the level where there are tons of

heuristics to solve this problem when the number of examples is big. It’s a practical consideration, but

it’s an important consideration. But basically, if you’re working with

problems– the typical machine learning problem, where you have, let’s

say not more than 10,000, then it’s not formidable. 10,000 is flirting with danger,

but that’s what it is. So pay attention to the fact that, in

spite of the fact that there’s a standard way of solving it, and the

fact that it’s convex, so it’s friendly, it is not that easy when you

get a huge number of examples. And people have hierarchical methods

and whatnot, in order to deal with that case. So let’s say we succeeded. We gave the matrix and the vectors

to quadratic programming. Back comes what? Back comes alpha. This is your solution. So now we want to take this solution,

and solve our original problem. What is w, what is b, what is the

surface, what is the margin? You answer the questions that

all of this formalization was meant to tackle. So the solution is vector of alphas. And the first thing is that it is very

easy to get the w because, luckily, the formula for w being this was one of

the constraints we got from solving the original one. When we got the gradient with respect to

w, we found out this is the thing. So you get the alphas, you plug them

in, and then you’ll get the w. So you get the vector

of weights you want. Now I would like to tell you a condition

which is very important. And it will be the key to defining

support vectors in this case, which is another KKT condition that will

be satisfied at the minimum, which is the following. Quadratic programming hands you alpha. Let’s say that– alpha is the same length

as the number of examples– let’s say you have 1000 examples. So it gives you a vector of 1000 guys. You look at the vector,

and to your surprise– you don’t know yet whether it’s pleasant

or unpleasant surprise– a whole bunch of the alphas are just 0. The alphas are restricted

to be non-negative. They all have to be greater

than or equal to 0. If you find any one of them negative,

then you say quadratic programming made a mistake. But it won’t make a mistake. It will give you numbers

that are non-negative. But the remarkable part, out of the

1000, more than 900 are 0’s. So you say: something is wrong? Is there a bug in my

thing or something? No. Because of the following. The following condition holds. It looks like a big condition. But let’s read it. This is the constraint in the 0 form. So this is greater than or equal to 1. So minus 1 would be greater

than or equal to 0. This is what we called the slack. So the condition that is guaranteed to

be satisfied, for the point you’re going to get, is that either the slack is

0, or the Lagrange multiplier is 0. The product of them will

definitely be 0. So if there’s a positive slack, which

means that you are talking about an interior point. Remember that I have a plane,

and I have a margin. And the margin touches

on the nearest point. And that is what defines the margin. Then there are interior points, where

the slack is bigger than 1. At those points, the

slack is exactly 1. No, not the slack. The slack is 0. The value is 1. The other ones, the slack

will be positive. So for all the interior points, you’re

guaranteed that the corresponding Lagrange multiplier will be 0. OK? I claim that we saw this before, again

in the regularization case. Remember this fellow? We had a constraint which is to

be within the red circle. And we’re trying to optimize a function

that has equi-potentials around this. So this is the absolute minimum. And it grows and grows and grows. And because we are in the constraint,

we couldn’t get the absolute minimum when we went there. When we had the constraint being

vacuous, that is, the constraint doesn’t really constrain us, and the

absolute optimal is inside, we ended up with no need for regularization,

if you remember? And the lambda for regularization

in that case was 0. That is the case, where you have

an interior point, and the multiplier is 0. And then when you got a genuine guy that

you have to actually compromise, you ended up with a condition that

requires lambda to be positive. So these are the guys where

the constraint is active. And therefore you get a positive lambda,

while this guy is by itself 0. So now we come to an interesting

definition. So alpha is largely 0’s,

interior points. The most important points in the game

are the points that actually define the plane and the margin. And these are the ones for which

alpha_n’s are positive. And these are called support vectors. So I have N points. And I classify them, and I

got the maximum margin. And because it’s a maximum margin, it

touched on some of the +1 and some of the -1 points. Those points support the

plane, so to speak. And they’re called support vectors. And the other guys are

interior points. And the mathematics of it tells us that

we can identify those, because we can go for lambdas that happen to be

positive, the alphas in this case. And the alpha greater than 0 will

identify a support vector. Again, when I put a box, it’s

an important thing. So this is an important notion. So let’s talk about support vectors. I have a bunch of points

here to classify. And I go through the entire machinery. I formulate the problem. I get the matrix. I pass it to quadratic programming. I get the alpha back. I compute the w. All of the above. And this is what I get. So where are the support

vectors this picture? They are the closest ones

to the plane, where the margin region touched. And they happen to be these three. This one, this one, this one. So all of these guys that are here, and

all of these guys are here will just contribute nothing to the solution. They will get alpha equals

0 in this case. And the support vectors achieve

the margin exactly. They are the critical points. The other guys– their margin, if you will,

is bigger or much bigger. And for the support vectors, you

satisfy this with equal 1. So all of this is fine. Now, we used to compute w in terms of

the summation of alpha_n y_n x_n. Because we said that this is the

quantity we got, when we got the gradient with respect to w equals 0. So this is one of the equations. And this is our way to get the alphas

back, which is the currency we get back from quadratic programming, and

plug it in, in order to get the w. This goes from n equals 1 to N.

Now that I notice that many of the alphas are 0, and alpha is only positive

for support vectors, then I can say that I can sum this up over

only the support vectors. It looks like a minor technicality. So the other terms happen to

be 0, so you excluded them. You just made the notation

more clumsy in this case. But there’s a very important point. Think of alphas now as the

parameters of your model. When they’re 0s, they don’t count. Just expect almost all

of them to be 0. What counts is the actual values of

the parameters that will be some number bigger than 0. So now, your weight vector– it’s a d-dimensional vector– is

expressed in terms of the constants which are your data set, x_n

and their label. Plus few parameters, hopefully few

parameters, which is just the number of support vectors. So you have three support

vectors, then this– let’s say you’re working at

20-dimensional space. So I’m looking at 20-dimensional space. I’m getting a weight. Well, it’s 20-dimensional

space in disguise. Because of the constraint you put, you

got something that is effectively three-dimensional. And now you can realize

why there might be a generalization dividend here. Because I end up with fewer parameters

than the express parameters that are in the value I get. So, we can also– now that we have it– solve for the b. Because you want w and b– b is the bias,

or corresponding to the threshold term, if you will. And it’s very easy to do. Because all you need to do is take any

support vector, any one of them, and for any of them you know that

this equation holds. You already solved for w, by that. So you plug this in. And the only unknown in this

equation would be b. And as a check for you, take any

support vector and plug it in. And you have to find the

same b coming out. That was your check that everything

in the math went through. You take any of them,

and you solve for b. And now you have w and b, and you are ready

with the classification line or hyperplane that you have. Now let me close with the nonlinear

transforms, which will be a very short presentation that has

an enormous impact. We are talking about

a linear boundary. And we are talking about linearly

separable case, at least in this lecture. In the next lecture, I’m going to

go to the non-separable case. But a non-separable case could be

handled here in the same way we handled non-separable case

with the perceptrons. Instead of working in the X space,

we went to the Z space. And I’d like to see what happens to the

problem of support vector machines, as we stated it and solved it, when

you actually move to the higher dimensional space. Is the problem becoming

more difficult? Does it hold? Et cetera. So let’s look at it. So we’re going to work

with z instead of x. And we’re going to work in the Z space

instead of the X space. So let’s first put what we are doing. Analytically, after doing all of the

stuff, and I even forgot what the details are, all I care about is that:

would you please maximize this with respect to alpha, subject to a couple

of sets of constraints. So you look at here. And you can see, when I transform

from x to z, nothing happens to the y’s. The labels are the same. And these are the guys that probably

will be changed, because now I’m working in a new space. So I’m putting them in

a different color. So if I work in the X space, that’s

what I’m working with. And these are the guys that I’m going

to multiply in order to get the matrix that I pass on to

quadratic programming. Now let’s take the usual

nonlinear transform. So this is your X space. And in X space, I give you this data. Well, this data is not separable, not

linearly separable, and definitely not nearly linearly separable. This is the case where you need

a nonlinear transformation. And we did this nonlinear

transformation before. Let’s say you take just

x1 squared and x2 squared. And then you get this, and this

one is linearly separable. So all you’re doing now is

working in the Z space. And instead of getting just a generic

separator, you’re getting the best separator, according to SVM, and then

mapping it back, hoping that it will have dividends in terms

of the generalization. So you look at this. I’m moving from X to Z. So when I go back to here,

what do you do? All you need to do is replace

the x’s with z’s. And then you forget that there

was ever an X space. I have vector z. I do the inner product in order

to get these numbers. These numbers I’m going to pass

on to quadratic programming. And when I get the solution back, I have

the separating plane or line in the Z space. And then when I want to know what

the surface is in the X space, I map it back. I get the pre-image of it. And that’s what I get. The most important aspect to observe

here is that– OK, the solution is easy. Let’s say I move from two-dimensional

to two-dimensional here. Nothing happened. Let’s say I move from two-dimensional

to a million-dimensional. Let’s see how much more difficult

the problem became. What do I do? Now I have a million-dimensional vector,

inner product with a million dimensional vector. That doesn’t faze me at all. Just an inner product. I get a number. But when I’m done, how many

alphas do I have? This is the dimensionality of the

problem that I’m passing to quadratic programming. Exactly the same thing. It’s the number of data points. Has nothing to do with the

dimensionality of the space you’re working in. So you can go to an enormous space,

without paying the price for it in terms of the optimization

you’re going to do. You’re going to get

a plane in that space. You can’t even imagine, because

it’s million-dimensional. It has a margin. The margin will look very

interesting in this case. And supposedly it has good

generalization property. And then you map it back here. But the difficulty of solving

the problem is identical. The only thing that is different is

just getting those coefficients. You’ll be multiplying longer vectors. But that is the least of our concerns. The other one is that you’re going

to get the full matrix of this. And quadratic programming will have

to manipulate the matrix. And that’s where the price is paid. So that price is constant, as long

as you give it this number. It doesn’t care whether it was inner

product of 2 by 2, or inner product of a million by million. it will just hand you the alphas. And then you interpret the alphas in

the space that you created it from. So the w will belong to the Z space. Now let’s look at, if I do the nonlinear

transformation, do I have support vectors? Yes, you have support vectors

for sure in the Z space. Because you’re working exclusively in

the Z space, you get the plane there. You get the margin. The margin will touch some points. These are your support vectors

by definition. And you can identify them even without

looking geometrically at the Z space, because what are the support vectors? Oh, I look at the alphas I get. And the alphas that are positive, these

correspond to support vectors. So without even imagining what the Z

space is like, I can identify which guys happen to have the critical margin

in the Z space, just by looking at the alphas. So support vectors live in the space

you are doing the process in, in this case, the Z space. In the X space, there is

an interpretation. So let’s look at the X space here. If I have these guys, not linearly

separable, and you decided to go to a high-dimensional Z space. I’m not going to tell you what. And you solved the support

vector machine. You got the alphas. You got the line, or the hyperplane

in that space. And then you are putting the boundary

here that corresponds to this guy. And this is what the boundary

looks like. Now, we have alarm bells– overfitting, overfitting! Whenever you see something like

that, you say wait. That’s the big advantage you

get out of support vectors. So I get this surface. This surface is simply what the line in the

Z space with the best margin got. That’s all. So if I look at what the support vectors

are in the Z space, they happen to correspond to points here. They are just data points. Right? So let me identify them here, as

pre-images of support vectors. People will say they are

support vectors. But you need to be careful,

because the formal definition is in the Z space. So they may look like this. So let’s look at it. This is one. This is another. This is another. This is another. And usually they are when you turn. You would think that in the Z space,

this is being sandwiched. So this is what it’s likely to be. Now the interesting aspect here is that

if this is true, then one, two, three, four– I have only four support vectors. So I have only four parameters, really,

expressing w in the Z space. Because that’s what we did. We said that w equals summation, over

the support vectors, of the alphas. Now that is remarkable, because I just

went to a million-dimensional space. w is a million-dimensional vector. And when I did the solution,

and if I get four– only four, which is very lucky

if you are using a million dimensional, but just

for illustration. If I get four support vectors, then

effectively, in spite of the fact that I used the glory of the million-dimensional

space, I actually have four parameters. And the generalization behavior will

go with the four parameters. So this looks like a sophisticated

surface, but it’s a sophisticated surface in disguise. It was so carefully chosen that–

there are lots of snakes that can go around and mess up

the generalization. This one will be the best of them. And you have a handle on how good the

generalization is, just by counting the number of support vectors. And that will get us– Yeah, this is a good point

I forgot to mention. So the distance between the support

vectors and the surface here are not the margin. The margins are in the linear

space, et cetera. They’re likely, these guys, to

be close to the surface. But the distance wouldn’t be the same. And there are perhaps other points that

look like they should be support vectors, and they aren’t. What makes them support vectors or

not is that they achieve the margin in the Z space. This is just an illustrative

version of it. And now we come to the generalization

result that makes this fly. And here is the deal. Generalization result: E out is less

than or equal to something. So you’re doing classification. And you are using the classification

error, the binary error. So this is the probability of error in

classifying an out-of-sample point. The statement here is very

much what you expect. You have the number of support vectors,

which happens to be the number of effective parameters– the

alphas that survived. This is your guy. You divide it by N, well, N

minus 1 in this case. And that will give you

an upper bound on E_out. Now I wish this was exactly

the result. The result is very close to this. In order to get the correct result, you

need to run several versions and get an average in order

to guarantee this. So the real result has to do with

expected values of those guys. So for several runs,

the expected value. But if the expected value lives up to

its name, and you expect the expected value, then in that case, the E_out you

will get in a particular situation will be bounded above by this, which

is a very familiar type of a bound, number of parameters, degrees of

freedom, VC dimension, dot dot dot, divided by the number of examples. We have seen this before. And again, the most important aspect

is that, pretty much like quadratic programming didn’t worry about

the nature of the Z space. Could be million-dimensional. And that didn’t figure out in the

computational difficulty. It doesn’t figure out in the

generalization difficulty. You didn’t ask me about the

million-dimensional space. You asked me, after you were done with

this entire machinery, how many support vectors did you get? If you have 1000 data points, and you

get 10 support vectors, you’re in pretty good shape regardless of

the dimensionality of the space that you visited. Because, then, 10 over 1000– that’s

a pretty good bound on E_out. On the other hand, it doesn’t say that

now I can go to any dimensional space and things would be fine. Because you still are dependent on

the number of support vectors. If you go through this machinery, and

then the number of support vectors out of 1000 is 500, you know

you are in trouble. And trouble is understood in this

case, because that snake will be really a snake– going around every point, going

around every point. So just trying to fit the data

hopelessly, getting so many support vectors that the generalization

question now becomes useless. But this is the main theoretical result

that makes people use support vectors, and support vectors with

the nonlinear transformation. You don’t pay for the computation of

going to the higher dimension. And you don’t get to pay for the

generalization that goes with that. And then when we go to kernel methods,

which is a modification of this next time, you’re not even going to pay for

the simple computational price of getting the inner product. Remember when I told you take an inner

product between a million-vector and itself, and that was minor,

even if it’s minor, we’re going to get away without it. And when we get away without it, we will

be able to do something rather interesting. The Z space we’re going to visit– we

are now going to take Z spaces that happen to be infinite-dimensional. Something completely unthought

of when we dealt with generalization in the old way. Because obviously, in an infinite-dimensional

space, I’m not going to be able to actually computationally

get the inner product. Thank you. So there has to be another way. And the other way will be the kernel. But that will open another set of

possibilities of working in a set of spaces we never imagined touching, and

still getting not only the computation being the same, but also the

generalization being dependent on something that we can measure, which

is the number of support vectors. I will stop here and take questions

after a short break. Let’s start the Q&A. MODERATOR: OK. Can you please first explain

again why you can normalize w transposed x plus b to be 1. PROFESSOR: OK. We would like to solve for

the margin given w. That has dependency on the combination

of w’s you get, which is like the angle that is the relevant one. And also w has an inherent

scale in it. So the problem is that the scale has

nothing to do with which plane you’re talking about. When I take w, the full w and b, and

take 10 times that, they look like different vectors as far as the analysis

is concerned, but they are talking about the same plane. So if I’m going to solve without the

normalization, I will get a solution. But the solution, whatever I’m

optimizing, will invariably have in its denominator something that takes

out the scale, so that the thing is scale-invariant. I cannot possibly solve. And it will tell me that w has to be

this, when in fact any positive multiple of it will serve

the same plane. So all I’m doing myself is simplifying

my life in the optimization. I want the optimization to be

as simple as possible. I don’t want it to be something

over something. Because then I will have trouble

actually getting the solution. Therefore, I started by putting

a condition that does not result in loss of generality. Because if I restrict myself

to w’s, not to planes– all planes are admitted. But every plane is represented

by an infinite number of w’s. And I’m picking one particular w

to represent them that happens to have that form. When I do that and put it as

a constraint, what I end up with, the thing that I’m optimizing happens to

be a friendly guy that goes with quadratic programming and

I get the solution. I could definitely have started

by not putting this condition. Except that I would run into

mathematical trouble later on. That’s all there is to it. Similarly, I could have left w0. And then all of a sudden, every time I

put something, I only tell you: take the norm of the first d guys, or w_1 up

to w_d, and forget the first one. So all of this was just pure technical

preparation that does not alter the problem at all, that makes the

solution friendly later on. MODERATOR: Many people are curious. What happens when the points

are not linearly separable? PROFESSOR: There are two cases. One of them: they are horribly not

linearly separable, like that. And in this case, you

go to a nonlinear transformation, as we have seen. And then there is a slightly

not linearly separable, as we’ve seen before. And in that case, you will see that the

method I described today is called hard-margin SVM. Hard-margin because the margin

is satisfied strictly. And then you’re going to get another

version of it, which is called soft margin, that allows for few errors

and penalizes for them. And that will be covered next. But basically, it’s very much in

parallel with the perceptron. Perceptron means linearly separable. If there are few, then

you apply something. Let’s say like the pocket

in that case. But if it’s terribly not linearly

separable, then you go to a nonlinear transformation. And nonlinear transformation here is very

attractive because of the particular positive properties that we discussed. But in general, you actually use

a nonlinear transformation together with the soft version, because you don’t want

the snake to go out of its way just to take care of an outlier. So we are better off just making

an error on the outlier, and making the snake a little bit less wiggly. And we will talk about that

when we get the details. MODERATOR: Could you explain once again

why in this case, just the number of support vectors gives an approximation

of the VC dimension, while in other cases the transform– PROFESSOR: The explanation

I gave was intuitive. It’s not a proof. There is a proof for these terms

that I didn’t even touch on. And the idea is the following. We have come to the conclusion that the

number of parameters, independent parameters or effective parameters, is

the VC dimension in many cases. So to the extent that you can actually

accept that as a rule of thumb, then you look at the alphas. I have as

many alphas as data points. So if these were actually my parameters,

I’d be in deep trouble. Because I have as many parameters as

points, so I’m basically memorizing the points. But the particulars of the problem

result in the fact that, in almost all the cases, the vast majority of the

parameters will be identically 0. So in spite of the fact that they were

open to be non-zero, the fact that the expectation is that almost all of them

will be 0, makes it more or less that the effective number of parameters are the

ones that end up being non-zero. Again, this is not an accurate

statement, but it’s a very reasonable statement. So the number of non-zero parameters,

which corresponds to the VC dimension, also happens to be the

number of the support vectors by definition. Because support vectors are the ones

that correspond to the non-zero Lagrange multipliers. And therefore, we get a rule, which

either counts the number of support vectors or the number of surviving

parameters, if you will. And this is the rule that we had at

the end, that I said that I didn’t prove, but actually gives

you a bound on E_out. MODERATOR: Is there any advantage in

considering the margin, but using a different norm? PROFESSOR: So there

are variations of this. And indeed, some of the aggregation

methods, like boosting, has a margin of its own. And then you can compare that. It’s really the question of the

ease of solving the problem. And if you have a reason for

using one norm or another, for a practical problem. For example, if I see that loss goes

with squared, or loss goes with the absolute value or whatever, and then I

design my margin accordingly, then we go back to the idea of a principled

error measure, in this case margin measure. On the other hand, in most of the cases,

there is really no preference. And it is the analytic considerations

that makes me choose one margin or another. But different measures for the margin,

with 1-norm, 2-norm, and other things, have been applied. And there is really no compelling reason

to prefer one over the other in terms of performance. So it really is the analytic

properties that usually dictate the choice. MODERATOR: Is there any pruning method

that can maybe get rid of some of the support vectors, or not really? PROFESSOR: So you’re not happy

with even reducing it to support vectors? You want to get rid of some of them.

Well– Offhand, I cannot think of a method

that I can directly translate into– as if it’s getting rid of

some of the support vectors. What happens for computational reasons

is that when you solve a problem that is huge in data set, you

cannot solve it all. So sometimes what happens is that

you take subsets, and you get the support vectors. And then you take the support vectors as

a union and get the support vectors of the support vectors,

and stuff like that. So these are really computational

considerations. But basically, the support vectors are

there to support the separating plane. So if you let one of them

go, the thing will fall! Obviously, I’m half-joking only. But because really, they are the ones

that dictate the margin, so their existence really tells you

that the margin is valid. And that’s really why they are there. MODERATOR: Some people are worried that

a noisy data set would completely ruin the performance of the SVM. So how does it deal with this? PROFESSOR: It will ruin as much

as it will ruin any other method. It’s not particularly susceptible

to noise. Except obviously when you have noise,

the chances of getting a cleanly linearly separable data is not there. And therefore, you’re using

the other methods. And if you’re using strictly

nonlinear transformation, but with hard margin, then I can see

the point of ruining. Because now the snake is

going around noise. And obviously that’s not good, because

you’re fitting the noise. But in those cases, and in almost all of

the cases, you use the soft version of this, which is remarkably similar. It’s different assumptions. But the solution is remarkably

similar. And therefore in that case, you will be

as vulnerable or not vulnerable to noise as you would by

using other methods. MODERATOR: All right. I think

that’s it. PROFESSOR: Very good. We will see you next week.

the luxury car!

Support Vector Machine lecture starts at 4:14

Wow, this is brilliant.

There's no god about it! Even so, congratulations!

Thank you very much for the best lecture on SVM in the world. Probably, Vapnik himself would be able to teach/deliver the SVM clearly as you do.

Thanks a lot, very well explained!

seriously dude this is awesome.. after many attempts finally I understand the SVM..

I haven't seen previous lectures and I wonder why he call vector "w" as a "signal"?

at 34:29, Observe closely, When Prof. Yaser is explaining the constrained optimization, there is an background music as his hand moves. "Boshooom"… ! It just sounds so natural, as if Prof. Did it !

Thanks a lot !! 🙂

The intuition is GREAT! Thx!

awesome

Well explained! Thanks a lot!

it's his hand touching the microphone

Thanks a lot !

The kernel trick (part 3) is not explained in much detail…

I'm still looking for a clear and easy-to-understand explanation of it =)

This is really very nice and helpful in my research work. I would have love to know more about the heuristics you talked about for handling large dataset with SVM

Thank you very much, very helpful !

I meant, Vapnik himself would not be able to teach the subject as clearly as you do.

This lecture is sooo good! One of the cool things is that people here don't assume that you know everything unlike so many other places where they expect that you know about the basic concepts of optimisation and machine learning!

really nice video…understood SVM at last 🙂

All ML teachers are so boring. Good Mathematicians tell stories on numbers.

This is a very well produced lecture. Thank you for sharing. 🙂

haven't got there yet but kernel methods is the next lecture..

I don't quite understand KKT conditions; what foundations do I need to do so?

Watched a video on Lagrange Multipliers and now Im back again.

your lecture cannot say about it less than amazing…Thank you so much…

I love his accent! 🙂

SVMs kick ass!

I rewinded this a number of times and i finally got it. really well explained!!

Can I used SVM for sentiment analysis classification?

In sovjet rashiya, machine vector supports you.

Nice, clean presentation.

"I can kill +b"

Very helpful !.. thanks a lot

such a gentle man and inteligent Professor.

Very nice presentation.

Thank you a lot

Warning: If your IQ is below 160 move on to some other video or Andrew NG's video's

Thank you for the lecture Professor!

Just wondering: at 43:26, is that -1 supposed to an identity matrix times scalar -1? That's what I assumed at first, but when I look at LAML, the java quadratic programming library that I'm using, it specifies that C needs to be an n x 1 matrix. So I guess c is just a column of N rows, with each entry being a -1?

Interesting and Inspiring. A great video, alongside other videos, to help comprehend a basic understanding of the SVM subject.

Still worried (my naïve intuition )that if it really comes down to being a calculation against those margin points, then surely more susceptible to noisy data and overfitting because I would have thought the noisy overfitting errors are what are on the margins.

So I guess look at sow 'soft' SVMs help.

Thank you Professor for the very informative lecture..!

Can someone here tell me what lecture he covers VC dimensions in ?

Highly appreciate ur replies

Why at 33:43, the professor says alpha's are non-negative, all of a sudden????

Disclaimer: I haven't watched earlier lectures, in case that is relevant.

Let me know please!!!!!

Summarized question: Why are we maximizing L w.r.t. alpha at 39:25?

Slide13 at 36:06: At extrema of L(w,b,alpha), dL/db=dL/dw=0, giving us w =sum(an*yn*xn) and sum(an*yn)=0. These substitutions make L(w,b,a)=L(alpha) in the slide 14 = extrema of L. Then why are we maximizing this w.r.t alpha?? He said something about that in slide 13 at 33:40, but I could not understand. Can anybody care to explain?

Really helpful explanation..got what SVM is..Thank you so much professor!

from 12:15

It means that you extended the features X with 1 and weights W with b as in perceptron.

And these extensions are removed from X and W after normalization.

so good

I loved loved loved all the lectures , you are an amazing professor !!!!

Un mot merveilleux…

What a charming prof. Like his teaching style. Thank you Caltech for sharing this

can anyone tell me the lecture where he teaches "generalization"??

Thanks Dr Yasser ,you are honor for every Egyptian

people like you save my life 🙂

I don't understand why we put constraints on alpha's to be greater than 0… If we take a simple example, say of 3 data points, 2 of positive class (yi=1): (1,2) (3,1) and one negative (yi=-1): (-1,-1) – and we calculate using Lagrange multipliers, we will get a perfect w (0.25,0.5) and b = -0.25, but one of our alphas was negative (a1 = 6/32, a2 = -1/32, a3 = 5/32). So why is this a problem?

One of the best machine learning lecture. I would like to know.

How to solve quadratric programming analytically. So that the whole process of getting hyperplane can be done analytically.

I bow to your teaching

/. Thank you.Wow! this is the best explanation to SVM's by far I've come across, with right mathematical rigor, lucid concepts and structured analytical thinking put's up a good framework to understanding this complex model in a fun and intuitive way.

Bravo Dr. Yaser, excellent explanation! Now looking forward Kernel Methods lecture 🙂

Is that an ashtray in front of the professor?

In the constraint condition of |w^T.xn +b| >=1 how is it guaranteed that for the nearest xn, the |w^T.xn +b| will be 1 ?

I salute you Sir!. What a great way of teaching! I think, I understood most by just one viewing of these lectures.

Do you teach any other courses? Can you put them on youtube also?

writing my bachelors thesis about SVMs atm. it's a great introduction and very helpful for understanding the main issues in a short time. Thankyou!

Excellent lecture

Amazing how you unravel it , like a movie , the element of suspense , a preview and a resolution.

10/10 would listen again

why is their preference between minimizing and maximizing for optimization?

Never seen lecture like this. Thank you!

How simply you explain things. Wonder I can explain complex things like you do.

Best explanation ever! thank you

Thank you very much for sharing these wonderful lectures! I have some thoughts about the margin. It seems, that start of the PLA with weights defining the hyperplane placed between the two centers of mass of data points is better to achieve the maximum margin, than the start with all-zero weights. Let R1 and R2 be the centers of mass of data points of the "+1" and "-1" categories, respectively. Then the normal vector of the hyperplane is equal to R1 – R2 (direction is important) and the bias vector is equal to (R1 + R2)/2. Thereby, the vector part of the weights is initialized as w = R1-R2 and the scalar part as wo = -(R1-R2, R1+R2)/2 (the inner product of the normal and bias vector multiplied by -1).

What does first preliminary technicality(12:43)mean |wTx|=1? How is it same as |wTx| >0?

this one was complicated

good courses have you got lecture on ADABOOST and its uses with svm or other weak learners

FUCKING BRILLIANT!! Thanks 😀

I have some questions:

1. in slide 6 at 13:53, I still don't understand the reason behind changing the inequation into equal 1. the professor just said so that we can restrict the way we choose w and the math will become friendly. but is there any other reason behind this? like, can we actually choose any number other than one, maybe equal 2 or equal 0.5? seems both of them will also restrict the way we choose w

2. in slide 9 at 24:56, why maximize 1/||w|| is equivalent to minimize 1/2 wt w? any math derivation behind this? because I think I don't get it at all

any answer will be appreciated

Mm, why are we taking expected value of Eout on the last slide when Eout is already the epxected out of sample error? What is this value with respect to which we marginalize Eout? I just didn't catch it quite well. Is it about averaging over different transformations?

The probability that a boy passes in front of the camera between 2` and 5' is too hight 🙂

This is the best (most geometrically intuitive) SVM lecture I have found so far. Thank you!

This explanation is really great. However, much more intuitive and better developed is the one in the Machine Learning course by Columbia University NY in EdX.org. It worthy to revise it.

Thank you sir! BTW, I would have applauded at this moment of the lecture: 22:37

I did not understand what was explained about W, how it can be three dimension after replacing all x_n with X_n in SV, at minute 52.

what does VC stand for?

46:26 whole bunch of alphas are just zero

bla blab bla and the end you will Python with sklearn 🙁

I am still a bit confused on the minute 22:36 he talks about the distance of the point to the plane being set to 1 ( as wx+b=1 ), and still the distance is 1/|w|. What am I missing?

30:36 what was the pun ?