## Lecture 03 -The Linear Model I

ANNOUNCER: The following program

is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we discussed the

feasibility of learning. And we realized that learning is

indeed feasible, but only in a probabilistic sense. And we modeled that probabilistic sense

in terms of a bin that has an out-of-sample performance. We already mapped that to the

out-of-sample performance. The performance we don’t know. And in order to be able to tell what E_out

of h is– h is the hypothesis that corresponds to that particular bin– we look at the in-sample. And we realize that the in-sample tracks

the out-of-sample well through the mathematical relationship which is

the Hoeffding Inequality, that tells us that the probability that E_in

deviates from E_out by more than our specified tolerance is a small number. And that small number is a negative

exponential in N. So the bigger the sample, the

more reliable that E_in will track E_out well. That was the basic building block. But then we realized that this

applies to a single bin. And a single bin corresponds

to a single hypothesis. So now we go for a case where we

have a full model, h_1 up to h_M. And we take the simple case of

a finite hypothesis set. And we ask ourselves, what

would apply in this case? We realized that the problem with having

multiple hypotheses is that the probability of something bad

happening could accumulate. Because if there is a 0.5% chance that

the first hypothesis is bad, in the sense of bad generalization, and 0.5% for the second one, we could

be so unlucky as to have these 0.5% accumulate, and end up with a significant

probability that one of the hypotheses will be bad. And when one of the hypotheses will be

bad, if we are further unlucky, and this is the hypothesis we pick as our

final hypothesis, then E_in will not track E_out for the hypothesis

we pick. So we need to accommodate the case where

we have multiple hypotheses. And the argument was extremely simple. g is our notation for the

final hypothesis. It is one of these guys that

the algorithm will choose. Well, the probability that E_in doesn’t

track E_out will obviously be included in the fact that E_in for h_1

doesn’t track the out-of-sample for that one, or E_in for h_2 doesn’t track,

or E_in of h_M doesn’t track. The reason is very simple. g is one of the guys. If something bad happens with g, it must

happen with one of these guys at least, the one that was picked. So we can always say that this implies

these things, which is this or this or this or this or this. And after that, we apply a very simple

mathematical rule, which is the union bound. The probability of an event or another

event or another event is at most the sum of the probabilities. That rule applies regardless of the

correlation between these events, because it takes the worst-case scenario. If all the bad events happen

disjointly, then you add up the probabilities. If there is some correlation,

and they overlap, you will get a smaller number. In all of the cases, the probability of

this big event will be less than or equal to the sum of the individual

probabilities. And this is useful because in the coin

flipping case, which started this argument, the events are independent. In the case of the hypotheses of

a model, the events may not be independent, because we

have the same sample. And we are only changing

the hypotheses. So it could be that the deviation here

is related to the deviation here. But the union bound doesn’t care. Regardless of such correlations, you

will be able to get a bound on the probability of this event. And therefore, you will be able to bound

the probability that you care about, which has to do with the

generalization, to the individual Hoeffding applied to each of those. And since you have M of them,

you have an added M factor. So the final answer is that the

probability of something bad happening after learning is less than or equal to

this quantity, which is a helpful small quantity, times M. And we realize

that now we have a problem because if you use a bigger hypothesis

set, M will be bigger. And therefore, the right-hand side here

will become bigger and bigger when you add the M. And therefore, at

some point, it will even become meaningless. And we are not even worried yet about

M being infinity, which will be true for many hypothesis sets,

in which case, this is totally meaningless. However, we weren’t establishing

the final result in learning. We were establishing the principle

that, through learning, you can generalize. And we have established that. It will take us a couple of weeks to

get from that to the ability to say that a general learning model, an infinite

one, will generalize. And we will get the bound on generalization. That’s what the theory of generalization

will address. So today the subject is linear models. And as I mentioned at the beginning,

this is out of sequence. If I was following the logical sequence,

I would go immediately to the theory and take M, which takes care

of the finite case, and then generalize it to

the more general case. However, as I mentioned, I decided to

give you something concrete and practical to work with early on. And then we will go back to

the theory after that. The linear model is one of the most

important models in machine learning. And what we are going to

do in this lecture, we’re going to start with a practical

data set that we are going to use over and over in this class. And then, if you remember the perceptron

that we introduced in the first lecture, the perceptron

is a linear model. So here is the sequence

of the lecture. We are going to take the perceptron and

generalize it to non-separable data. That’s a relief, because we already

admitted that separable data is very rare. And we would like to see what

will happen when we have non-separable data. Then, we are going to generalize this

further to the case where the target function is not a binary classification

function, but a real-valued function. That also is a very important

generalization. And linear regression, as you will

see, is one of the most important techniques that is applied mostly in

statistics and economics, and also in machine learning. Finally, as if we didn’t do enough

generalization already, we are going to take this and generalize

it to a nonlinear case. All in a day’s work,

all in one lecture. It’s a pretty simple model. And at the end of the lecture, you will

be able to actually deal with very general situations. And you may ask yourself, why am I

calling the lecture Linear Model when I’m going to talk about nonlinear

transformation? Well, you’ll realize that nonlinear

transformation remains within the realm of linear models. That’s not obvious. We will see how that materializes. So that’s the plan. Now, let’s look at a real data set that

we are going to use, and will be available to you to try

different ideas on. And it’s very important to try

your ideas on real data. Regardless of how sure you are when

you have a toy data set that you generate, you should always go for

real data sets and see how the system that you thought of performs

in reality. So here is the data set. It comes from ZIP codes

in the postal office. So people write the ZIP code. And you extract individual characters,

individual digits. And you would like to take the image,

which happens to be 16 by 16 gray level pixels, and be able to decipher

what is the number in it. Well, that looks easy except that

people write digits in so many different ways. And if you look at it, there will

be some cases like this fellow. Is this a 1 or a 7? Is this a 0 or an 8? So you can see that there

is a problem. And indeed, if you get a human operator

to actually read these things and classify them, they will probably

be making an error of about 2.5%. And we would like to see if machine

learning can at least equal that, which means that we can automate the

process, or maybe beat that. So this is a data set that we

are going to work with. Let’s look at it a little bit more

closely to see how we input it to our algorithm. We have one algorithm so far, which is

the perceptron learning algorithm. We are going to try on this. And then we are going to generalize

it a little bit. The first item is the question

of input representation. What do I mean? This is your input, the raw

input, if you will. Now this is 16 pixels by 16 pixels. So there are 256 real numbers

in that input. If you look at the raw input x, this

would be x_1, x_2, x_3, dot, dot, dot, dot, and x_256. That’s a very long input to encode

such a simple object. And we add our mandatory x_0. Remember, in linear models, we have

this constant coordinate, x_0 equals 1, we add in order to

take care of the threshold. So this will always be in the

background whether we mention it or not. If you take this raw input and try

the perceptron directly on it, you realize that the linear model in this

case, which has a bunch of parameters, has really just too many parameters. It has 257 parameters. If you are working in

a 257th-dimensional space, that is a huge space. And the poor algorithm is trying to

simultaneously determine the values of all of these w’s based on your set. So the idea of input representation is

to simplify the algorithm’s life. We know something about the problem. We know that it’s not really

the individual pixels that matter. You can probably extract some features

from the input, and then give those to the learning algorithm and let

the learning algorithm figure out the pattern. So this gives us the idea of features. What are features? Well, you extract the

useful information. And as a suggestion, very simple one,

let’s say that in this particular case, instead of giving the raw input

with all of the pixel values, you extract some descriptors of

what the image is like. For instance, you look at this. Depending on whether this is the digital

8 or the digit 1, et cetera, there is a question of the intensity,

average intensity. 1 doesn’t have too many black pixels. 8 has a lot. 5 has some. So if you simply add up the intensity

of all the pixels, you probably will get a number that

is related to the identity. It doesn’t uniquely determine

it, but it’s related. It’s a higher-level representation

of the raw information there. Same as symmetry– if you think of the digit

1, 1 will be symmetric. If you flip it upside down, or you flip

it right and left, you will get something that overlaps

significantly with it. So you can also define a symmetry

measure, which means that you take the symmetric difference between something

and its flipped versions, and you see what you get. If something is symmetric, things will

cancel because it’s symmetric. You’ll get a very small value. And if something is not symmetric, let’s

say like the 5, you will get lots of values in the symmetric

difference. And you will get a high

value for that. So what you are measuring

is the anti-symmetry. You take the negative of that,

and you get the symmetry. So you get another guy,

which is the symmetry. So now, x_1 is the intensity variable, x_2 is the symmetry variable. Now admittedly, you have lost

information in that process. But the chances are you lost as much

irrelevant information as relevant information. So this is a pretty good representation

of the input, as far as the learning algorithm is concerned. And you went from 257 dimensional

to 3 dimensional. That’s a pretty good situation. And you probably realize that having

257 parameters is bad news for generalization, if you extrapolate

from what we said. Having 3 is a much better situation. So this is what we are

going to work with. When you take the linear model

in this case, you just have w_0, w_1, and w_2. And that’s what the perceptron algorithm,

for example, needs to use– to determine. Now let’s look at the illustration

of these features. You have these as your inputs. And x_1 is the intensity, x_2 is the symmetry. What do they look like? They look like this. This is a scatter diagram. Every point here is a data point. It’s one of the digits, one

of the images you have. And I’m taking the simple case of just

distinguishing the 1’s from the 5’s. So I’m only taking digits

that are 1’s or 5’s. And you can always take other

digits versus each other, and then combine the decision. If you can solve this unit problem,

you can generalize it to the other problem. So when you put all the 1’s and all the

5’s in a scatter diagram, you realize for example that the intensity on the

5’s is usually more than the intensity on the 1’s. There are more pixels occupied

by the 5’s than the 1’s. This is the coordinate

which is the intensity. And indeed, the red guys, which happen

to be the 5’s, are tilted a little bit more to the right,

corresponding to the intensity. If you look at the other coordinate,

which is symmetry, the 1 is often more symmetric than the 5. Therefore, the guys that happen to be

the 1’s, that are the blue, tend to be higher on the vertical coordinate. And just by these two coordinates, you

already see that this is almost linearly separable. Not quite, but it’s separable enough

that if you pass a boundary here, you will be getting

most of them right. Now you realize that it’s impossible

really to ask to get all of them right because, believe it or not, this fellow

is a 5, at least meant to be a 5 by the guy who wrote it. So we have to accept the fact that

there will be stuff that is completely undoable. And we will accept an error. It’s not a zero error. But hopefully, it’s a small error. So this is what the features

look like. Now what does the perceptron

learning algorithm do? What it does is this complicated figure,

which takes the evolution of E_in and E_out as a function

of iteration. When you apply the perceptron learning

algorithm, you apply it only to E_in. E_in is the only value you have. E_out is sitting out there. We don’t know what it is. We just hope that E_in tracks it well. Let’s look at the figure. These are the iteration numbers. So this is the first misclassified

example. You go and apply the perceptron learning

algorithm again, again, again for 1000 times. As you do that, E_in, which is the

green curve, will go down and sometimes will go up. We realize that the perceptron learning

algorithm takes care of one point at a time, and therefore may mess

up other points while it’s taking care of a point. So in general, it can go up or down. But the bad news here is that the

data is not linearly separable. And we made the remark that the

perceptron learning algorithm behaves very badly when the data is

not linearly separable. It can go from something pretty

good to something pretty bad, in just one iteration. So this is a very typical behavior of

the perceptron learning algorithm. Because the data is not linearly

separable, the perceptron learning algorithm will never converge. So what do we do? We force it to terminate

at iteration 1000. That is, we stop at 1000 and take

whatever weight vector we have. And we call this g, the final

hypothesis of the perceptron learning algorithm. Now we obviously look at this, and we

say, if I only took this guy. This is a better guy than the other. But you know, you’re just applying

the algorithm and cutting it off. Now, one of the things you observe

from here, I plotted E_out. You’re not going to be able to plot E_out

in a real problem that you deal with, if E_out is really

an unknown function. You may be able to estimate it using

some test examples. But all you need to know here is that

E_out is drawn here for illustration, just to tell you what is happening

in reality as you work on the in-sample error. And in this case, you find that E_out

actually tracks the E_in pretty well. There is a difference. So if we go from here to here,

that’s our epsilon. It’s a big epsilon. But the good news is

that it tracks it. When this goes down, this goes down. When this goes up, this goes up. So if you make your decision based on E_in,

the decision based on E_out will also be good. That’s good for generalization. And that is one of the advantages of

something as simple as the perceptron learning algorithm. It doesn’t have too many parameters. And because of our efforts in getting

only three features, it has even three parameters now. So the chances are that it will generalize

well, which it does. Now what does the final

boundary look like? This is only the illustration here, it’s just– this is the evolution. Eventually, you end up

with a hypothesis. The hypothesis would separate the points

in the scatter diagram you saw. So what does it look like? Well, it looks like this. This is your boundary. This is the final hypothesis, that

corresponds to the hypothesis you got at the final iteration. Well, it’s OK, but definitely

not good. It’s too deep into the blue region. You would have been better

off doing this. And the chances are maybe earlier

guys that had better in-sample error will do that. But that’s what you have to

live with, if you apply the perceptron learning algorithm. So now we go and try to modify the

perceptron learning algorithm in a very simple way, that is the simplest

modification you can ever imagine. So let’s see what happens. This is what that PLA did, right? And when we looked at it, we said:

if we only could keep this value. Well, this value is not a mystery. It happened in your algorithm. You can measure it explicitly. It’s an in-sample error. And you know that it’s better than

the value you ended up with. So in spite of the fact that you’re

doing these iterations according to the prescribed perceptron learning

algorithm rule– modify the weights according to one misclassified point–

you can keep track of the total in-sample error of the intermediate

hypothesis you got. Right? And only keep the guy that happens

to be the best throughout. So you’re going to continue

as if it’s really the perceptron learning algorithm. But when you are at the end, you keep

this guy and report it as the final hypothesis. What an ingenious idea! Now the reason the algorithm is called

the pocket algorithm is because the whole idea is to put the best solution

so far in your pocket. And when you get a better one, you take

the better one, put it in your pocket, and throw the old one. And when you are done, report

the guy in your pocket. We can do that. What does this diagram look like,

when you are looking at the pocket algorithm? Much better. You can look at these values, and

it is the best value so far. Here, we went down. And here, we indeed went down. Here, we went up. You see this green thing? Here, we didn’t, because the good guy is

in our pocket and that’s what we’re reporting the value for. And we continued with it

until we dropped again. And we dropped again. And we never changed that, because

there was never a better guy than this guy. So when we come to iteration

1000, we have this fellow. Now when you do that, you can use

perceptron learning algorithm with non-separable data, terminate it by

force at some iteration, and report the pocket value. And that will be your

pocket algorithm. And if you look at the classification

boundary, PLA versus pocket, this is what we had with the perceptron

learning algorithm. We complained a little bit that it’s

too deep in the blue region. And when you look at the other guy,

which is the pocket algorithm, it looks better. It actually does what we

thought it would do. It separates them better. Still, obviously, it cannot

separate them perfectly. Nothing can, because they are

not linearly separable. On the other hand, this is

a good hypothesis to report. So with this very simple algorithm,

you can actually deal with general inseparable data, but inseparable

data in the sense that it’s basically separable. However, it really is– this guy is bad, and this guy is bad. There’s nothing we can do about them. But there are few, so we will

just settle for this. We’ll see that there are other cases

of inseparable data that is truly inseparable, in which we have to do

something a little bit more drastic. So that’s as far as the classification

is concerned. Now we go to linear regression. The word regression simply

means real-valued output. There is absolutely no other

connotation to it. It’s a glorified way of saying

my output is real-valued. And it comes from earlier

work in statistics. And there’s so much work on it that

people could not get rid of that term. And it is now the standard term. Whenever you have a real-valued

function, you call it a regression problem. So that’s, with that, out of the way. Now, linear regression is

used incredibly often in statistics and economics. Every time you say: are these

variables related to that variable, the first thing that comes to

mind is linear regression. Let me give an example. Let’s say that you would like to relate

your performance in different types of courses, to your

future earnings. This is what you do. You look at– here are the courses I took. Here is the math, science, engineering,

humanities, physical education, other. And you get your GPA in each of them. So here, I got 3.5 Here, I got 3.8 Here, I got 3.2 Here, I got 2.8 2.8? No, no. That doesn’t happen at Caltech! You go for the other one, et cetera. So you just have the GPA’s for the

different groups of courses. Now, you say– someone graduates. I’m going to look 10 years

after graduation, and see their annual income. So the inputs are the GPA’s in the

courses at the time they graduated. The output is how much money they

make per year 10 years away from graduation. Now you ask yourself: how do these

things affect the output? So apply linear regression, as you will

see it in detail, and you finally find maybe the math and sciences

are more important. Or maybe all of that is an illusion. It was actually the humanities

that are important. You don’t know. You will see the data, and the data

will tell you what affects what. And any other situation like that,

people simply resort to linear regression. So in order to build it up, we are going

to use the credit example again, in order to be able to contrast it with

the classification problem we have seen before. What do we have? We have in the classification– we have the credit approval,

yes or no. That’s a classification function, binary

function, which says the output is +1 or -1. In the case of regression, we will

have real-valued function. And the interpretation in this case is

that you’re trying to predict the proper credit line for a customer. The customer applies. And it’s not a question of approving

the credit or not. Do you give them credit limit of $800

or $1,200 or $30,000 or what, depending on their input? So this is a real-valued function. And we are going to apply regression. Now you take the input. This is the same input as we had before,

data from the applicant that are related to the credit behavior,

so the age, the salary. I suspect that the salary will figure

very significantly now when you’re trying to tell the credit line, because

if someone is making 30,000 a year, you probably are not going to give

them a credit line of 200,000. So you can see that this will

probably be affected. And there are other guys that merely

have to do with the stability of the person. Years in residence. If the person has been in the same

residence for 10 years, they are unlikely to skip town. On the other hand, if they have been

there for only one month, well, you don’t know– that type of thing. So you have these variables. You encode them as the input x. And then your output in this case, which

is the linear regression output, is a hypothesis form which takes

this particular form. Let’s spend some time with

it to understand it. First, it’s regression because

the output is real. It’s linear regression because the form,

in terms of the input, is linear. Now, we have seen this before. We sum up from basically 1 to d. These are the genuine inputs,

the weighted version of the input variables. And then we add the mandatory x_0, which

is 1, which takes care of the threshold, which is w_0. This is the form we have seen before,

except that when we saw it before, we took this as a signal that

we only care about its sign. If it’s plus, we approve credit. If it’s minus, we don’t

approve credit. And we treated it as a credit

score, per se, when you take out the threshold. Now in this case, this is the output. We don’t threshold it. We don’t say it’s +1 or -1. There is w_0 in. But we don’t take it as

+1 or -1. We take it as a real number. And this is the dollar amount

we are going to give you as a credit line. Now the signal here will play

a very important role in all the linear algorithms. This is what makes the

algorithm linear. And whether you leave it alone as in

linear regression, you take a hard threshold as in classification or, as we

will see later, you can take a soft threshold, and you get a probability

and all of that– All of these are considered

linear models. And the algorithm depends on this

particular part, which is the signal being linear. We also took the trouble to

put it in vector form. And the vector form will simplify the

calculus that we do in this lecture in order to derive the linear

regression algorithm. But you can always– if you hate

the vector form, you can always go back to this. There is nothing mysterious

about this. This simply has a bunch of parameters,

w_0, w_1, up to w_d. And if I’m trying to minimize something,

you can minimize it with respect to scalar variables, which

applies very primitive calculus. But we obviously will do it in the

shorthand version, which is the vector or the matrix form, in order to

be able to get the derivation in an easier way. So that’s the problem. What is the data set in this case? Well, it’s historical data, but it’s

a different set of historical data. The credit line is decided

by different officers. Someone sits down and evaluates your

application and decides that this person gets 1000 limit, this person

gets 5000 limit, and whatnot. All we are trying to do in this

particular example is to replicate what they’re doing. We don’t want the credit

officer to do that. The credit officers sometimes are

inconsistent from one another. They may have a good day or a bad day. So we’d like to figure out what pattern

they collectively have in deciding the credit, and have

an automated system decide that. That’s what the linear regression

system will do for us. The historical data here are again

examples from previous customers. And the previous customers–

this is x_1, and this is y_1. So this is the application

that the customer gave. And this is the credit line

that was given to them. No tracking of credit behavior, we’re

just trying to replicate what the experts do in this case. And then you realize that each of these

y’s is actually a real number, which is the credit line that

is given to customer x_n. And that real number will likely

be a positive integer. It’s a credit line. It’s a dollar amount. And what we are doing is trying

to replicate that. That’s the statement of the problem. So what does linear regression do? First, we have to measure the error. We didn’t talk about that in

the case of classification, because it was so simple. Here, it’s a little bit less simple. And then, we’ll be able to discuss

the error function for classification as well. What do we mean by that? You will have an algorithm that tries

to find the optimal weights. These are the weights you’re

going to have. These weights are going to determine

what hypothesis you get. Some hypotheses will

approximate f well. Some hypotheses will not. We would like to quantify that, to give

a guidance to the algorithm in order to move from one hypothesis

to another. So we will define an error measure. And the algorithm will try to minimize

the error measure by moving from one hypothesis to the next. If you take linear regression, the

standard error function used there is the squared error. Let me write it down. Well, if you had a classification, there

is only a simple agreement on a particular example. You either got it right

or got it wrong. There is nothing else. Therefore, in that case, we

just defined binary error. Did you get it right or wrong? And we found the frequency

of getting it right. And we got the E_in and E_out. Here, you are estimating

a credit line. So if the guy gets 1000, and you tell

them 900, that’s not too bad. If the guy gets 1000, and you

tell them 5000, that’s bad. So you need to measure how

bad the situation is. And you define an error measure,

and you define it by the simple squared error. Now, squared error doesn’t have

an inherent merit here. It just happens to be the standard

error function used with linear regression. And its merit really is the simplicity

in the analytic solution that we are going to get. But when we discuss error measures in

the next lecture, we will go back to the principle, does error

measure matter? Why? How do we choose it? Et cetera. This will be answered in

a principled way next time. But for this time, let’s take this

as a standard error measure we are going to use. When you look at the in-sample error,

you use the error measure. On the particular example n,

n from 1 to N. For each example, this is the contribution

of the error. Each of these is affected by the

same w, because h depends on w. So as you change w, this value will

change for every example. And this is the error in that example. And if you want to get all the in-sample

error, you simply take the average of those. That will give me a snapshot

of how my hypothesis is doing on the data set. And now, we are going to ask our

algorithm to take this error and minimize it. Let’s actually just look at what

happens as an illustration. This is the simplest case

for linear regression. The input is one-dimensional. I

have only one relevant variable. I want to relate your overall GPA to

your earnings 10 years from now. Your overall GPA is x. Your earnings 10 years from now is y. That’s it. OK? [CHUCKLES] I would have properly called this

x_1 according to our notation. And then there would be an x_0,

which is the constant 1. But I didn’t bother, because

I have only one variable. But this is what we have. So you look at this. And you see that, for different

x’s, you have these guys. Wow. Your earnings are going down with– Well, that may not have been the

example that is drawn here. What linear regression does is it tries

to produce a line, which is what you have here, that tries to fit

this data according to the squared-error rule. So it may look like this. And in this case, the threshold

here depends on w_0. The slope depends on w_1, which

is the weight for x. And that is the solution you have. Now you didn’t get it right, but

what you got is some errors. And you realize that– this is

the error on the first example. This is the error on

the second example. And if you sum up the squares of the

lengths of these bars, that is what we called the in-sample error that we defined

in the previous viewgraph. Well, linear regression can apply

to more than one dimension. And I can plot 2 dimensions here

just to illustrate it. It’s the same principle. What you have here is you have x_1. If I can get the pointer– OK, we’ll leave it to rest. We have x_1 and x_2. And in this case, the linear

thing is really a plane. And you’re again not separating, but

trying to estimate these guys. And you’re making errors. And in general, when you go to

a higher-dimensional space, the line– which is the reason why we call it

linear– is not really a line. It’s a hyperplane, one dimension short

of the space you are working with. And that’s what you are trying to

use to approximate the guys. Now let’s look at the

expression for E_in. And that is the analytic expression

we are going to try to minimize. And that will make us derive the

linear regression algorithm. We wrote this before. And you have the value of the

hypothesis minus y_n squared. That is because it’s a squared error. And because it’s linear regression, this

value, h of x_n, happens to be w transposed x_n. It’s a linear function of x_n. Now let us try to write this

down in a vector form. I will explain this in detail. But let’s look at this. Instead of the summation, all of

a sudden, I have a norm squared of something that is– Capital X, I haven’t seen

capital X before. I haven’t seen vector y before. Well, it’s basically a consolidation

of the different x_n’s here. x_n is a vector. So you put the vectors in a matrix. You call it X. And you put the

scalars, the y_n, in a vector. And you call it y. The definition of capital X and

the vector y is as follows. For the matrix X, what you do– you

put your first example here. So this would be the constant coordinate 1,

the first coordinate, second coordinate, up to the d-th

coordinate, the last coordinate. And then you go for the second

example, and do the same and construct this matrix. And for y, you put the

corresponding output. This is the output for the first

example, output for the second example, output for the last example. Now one thing to realize

about the matrix X is that it’s pretty tall. The typical situation is that

you have few parameters. We reduced them to three, for example,

in the case of the classification of the digits. But you usually have many, many

examples, in the 1000’s. So this will be a very,

very long matrix. Now the way you take this– well, the

norm squared will be simply this vector transposed times itself. And when you do it, you realize that

what you are doing is summing up contributions from the

different components. And each component happens to be exactly

what you are having here. So this becomes a shorthand for

writing this expression. Now, let’s look at minimizing E_in. When you look at minimizing, you realize

that the matrix X, which has the inputs of the data, and y, which has

the outputs of the data, are, as far as we are concerned, constants. This is the data set someone gave me. The parameter I’m actually playing

with in order to get a good hypothesis is w. So E_in is of w. And w appears here. And the rest are constants. If I do any calculus of minimization,

it is with respect to w. So I try to minimize this. And what you do– you get the derivative

and equate it with 0, except here, it’s a glorified

derivative. You get the gradient, which is

the derivative on a bunch of them all at once. And there is a formula for it, which

is pretty simple in this case. I will explain it. By the way, if you hate this,

and you want to make sure, because linear regression is so

important. And you want to verify that it’s true, you can always go for the

scalar form, get partial E by partial every w: partial w_0, partial

w_1, partial w_d, get a formula that is a pretty hairy

one, and then try to reduce it. And– surprise, surprise– you will get the

solution here that we have in matrix form in two steps. Now if you look at this, deal with it in

terms of calculus as if it was just a simple square. If this was a simple square, and w

was the variable, what would the derivative be? You will get 2 sitting outside. Well, you’ve got it here. And then you will get the same

thing in a linear form. You got it here. And then you will get whatever constant

was multiplied by w to sit outside, which you got here. You just got here with a transpose,

because this is really not a square. This is the transpose of

this times itself. That’s where you get the transpose. Pretty straightforward and

standard matrix calculus. So that’s what you have. And then you equate this

to 0, but it’s a fat 0. It’s a vector of 0’s. You want all the derivatives

to be 0 all at once. And that will define a point where

this achieves a minimum. Now, you would suspect that the solution

will be simple, because this is a very simple quadratic form. And indeed, the solution is simple. And if you look at it, you realize that

if I want this to be 0, then I want this to cancel out. So when I multiply X transposed X w, I get

the same thing as X transposed y. So they cancel out, and I get my 0. So you write this down, and you find

that this is the situation. I want this term to be

equal to this term. And that will give me the 0. The interesting thing is that in spite

of the fact that the matrix X is a very tall matrix, definitely

not square, hence not invertible, X transposed X is actually a square

matrix, because X transposed is this way and X is this way. You multiply them, and you get

a pretty small square matrix. And as we will see, the chances are

overwhelming that it will be invertible. So you can actually solve this very

simply, by inverting this. You multiply by the inverse

in this direction. You multiply by this. This will disappear, and you will get

an explicit formula for w, which you were trying to solve for. And when you do that, you will get w

equals this funny symbol, X dagger. What is X dagger? This is simply a shorthand

for writing this. So I got the inverse of that, and

then multiplied it by here. So this is really what I get

to be multiplied by y. I call it X dagger. And indeed, it gets multiplied

by y to give me my w. Now the X dagger is a pretty

interesting notion. It’s called the pseudo-inverse of X.

X, being a non-invertible matrix, does not have an inverse. But it does have a pseudo-inverse. And the pseudo-inverse has

interesting properties. For example, if you take this, the X

dagger, and multiply it by X– so X dagger times X– what do you get? You add X here. You get X transposed X. Oh, I have

X transposed X to the -1 here. So they cancel out, and

I get an identity. So when I multiply X dagger

by X, I get the identity. So it’s OK to call it

an inverse of sorts. It doesn’t work the other way around. The other way around gives us

an interesting matrix, which we’ll talk about later. But basically, this is

the essence of it. If we were in a trivial situation

where X was a square– I have 3 parameters, and I have

3 examples to determine them– that can be solved perfectly. I can actually get this to be 0. And how would you get it to be 0? You would just multiply by the proper

inverse of X in this case, and you will get X inverse y. So this is pretty much similar,

when X is a tall one. And we are not going to get a 0. We’re just going to get a minimum

using the pseudo-inverse. Now I would like you to appreciate the

pseudo-inverse from a computational point of view. This is the formula for the pseudo-inverse

that you will need to compute, in order to get the solution

for linear regression. So let’s look at it. Something is inverted. And when you see inversion in matrix,

you say, oh, computation, computation. If this was a million by a million,

I’m in trouble. If this is 5 by 5, I’m in good shape. So we’d like to know, what kind

of matrix do we have here? Well, nothing mysterious about

what’s inside this. You have this fellow, which

is X transposed. It’s d plus 1, d is the

length of your input, 1 is the added constant variable. So these are the number of parameters. This would be 3 in the digit

classification guy. We have only x_1 and x_2, so d equals 2. d plus 1 equals 3, which corresponds

to x_0, x_1, x_2, or to w_0, w_1, w_2. So this is 3 times N.

N is the scary one. That’s the number of examples. That could be in the thousands. Now you multiply this by X,

and that’s what you have. The multiplication will be– multiplication is

not that difficult. Even if this is 10,000, I can

multiply this by 10,000. But the good news is that when I go to

this guy, I will have to be dealing with a simpler guy. Let’s just complete the formula first. This is what you have. This is what you are computationally

doing. And if you look at what’s inside

here, it completely shrinks. That is what the matrix inside is. It’s just 3 by 3 in our case. You can invert that. Just accumulating it is the one that

you have to go through all of the examples. And there’s a very simple

way of doing it. It’s not that difficult

to get this fellow. And you can see now that, oh, good

thing that we had 3 parameters. If we had the 257 parameters to begin

with, this would have been 257 by 257. Not that this will discourage us. But if you go for some raw inputs, you

can get something really in the thousands or sometimes

even more than that. So the computational aspect

of this is very simple. And there are so many packages for

computing the pseudo-inverse, or outright getting the solution for linear

regression, that you will never have to do that yourself, except

if you’re doing something very specialized. If you do have something very

specialized, it’s not that bad. So that is the final matrix. And the final matrix will have the

same dimensions as this guy. And if you look at it, this will

be multiplied by what? Multiplied by y, which is y_1,

y_2, y_3, y_N, corresponding to different outputs. And then, as a result of that,

you will get the w’s– w_0, w_1, up to w_d. Indeed, if you multiply this by an N

tall vector, you will get a d plus 1 tall vector, and that’s

what we expect. Let’s now flash the full linear

regression algorithm here. That’s a crowded slide. That is what you do. The first thing is you take the data

that is given to you, and put them in the proper form. What is the proper form? You construct the matrix

X and the vector y. And these are what we

introduced before. This will be the input data matrix, and

this will be the target vector. And once you construct them, you are

basically done, because all you are going to do– you plug this into

a formula, which is the pseudo-inverse. And then you will return the value w,

that is the multiplication of that pseudo-inverse with y. And you are done. Now you can call this one-step

learning if you want. With the perceptron learning algorithm,

it looked more like learning, because I have

an initial hypothesis. And then I take one example at a time,

and try to figure out what is going on, move this around, et cetera. And after 1000 iterations,

I get something. It looks more like

what we learn. We learn in steps. This looks like cheating. You give me the thing,

and [MAKES SOUND]. And you have the answer. Well, as far as we are concerned,

we don’t care how you got it. If it’s correct and gives you a correct

E_out, you have learned. And because this is so simple, this is

a very popular algorithm that is used often, and used often as a building

block for other guys. We can afford to use it as a building

block, because the step here will be so simple that we can become more

sophisticated in using it. Just one remark about the inversion– this has to be invertible in order

for this formula to hold. Now the chances are, that this will be

invertible in a real application you have, is close to 1. The reason is the following. Usually, you use very few parameters

and tons of examples. You will be very, very, very unlucky

to have these so dependent on each other that you cannot even capture

the dimensionality which is the number of columns. The number of columns is 3, 5, 10,

and you have 10,000 of those. So the chances are overwhelming in

a real problem that this will be invertible. Nonetheless, if it is not invertible,

you can still define the pseudo-inverse. It will not be unique and has

some elaborate features, but it’s not a big deal. That is not a situation you will

encounter in practice. So now we have linear regression. I’m going to tell you that you can

use linear regression not only for a real-valued function, for

regression problems. But you’re also going to be able

to use it for classification. Maybe the perceptron is now

going out of business. It has a competitor now. And the competitor has a very

simple algorithm. So let’s see how this works. The idea is incredibly simple. Linear regression learns

a real-valued function. Yeah, we know that. That is the real-valued function. The value belongs to the real numbers. Fine. Now the main observation, the ingenious

observation, is that binary-valued functions, which are the

classification functions, are also real-valued. +1 and -1, among other things,

happen to be real numbers. So linear regression is not going to

refuse to learn them as real numbers. Right? So what do we do? You use linear regression in order to

get a solution, such that the solution is approximately y_n in the

mean squared sense. For every example, the actual value

of the signal is close to the numerical +1 and the numerical -1. That’s what linear regression does. Now, having done that with y_n equals +1

or -1, you realize that in this case, if you take the

classification version of it– you take the sign of that signal in order

to be able to classify as +1 or -1. If the value is genuinely close to

+1 or -1 numerically, then the chances are when it’s +1,

this would be positive. And when it’s -1, it’s negative. The chances are– you’re getting close to

a number, you’ll probably cross the zero in doing that. And if you cross the zero, the

classification will be correct. So if you take this, and then plug it in

as weights for classification, you will likely get something

that will give you– likely to agree with +1 or -1. That’s a pretty simple trick,

because it’s almost free. All you need to do– I have a classification problem. Let’s run linear regression. It’s almost for free. Do this one-step learning, get

a solution, and use it for classification. Now, let’s see if this is

as good as it sounds. Well, the weights are good for

classification, so to speak, just by conjecture. But they also may serve as good initial

weights for classification. Remember that the perceptron algorithm,

or the pocket algorithm, are really very slow to get there. You start with a random guy. Half the guys are misclassified. And it just goes around, tries to

correct one, messes up the others, until it gets to the

region of interest. And then it converges. Why not give it a jump start? Why not run linear regression

first, get the w’s. We know that the w’s are OK, but they

are not really tailored toward classification. But they’re good initial condition. Feed those to the pocket algorithm, and

let it run to the solution, which is a classification solution. That’s a pretty nice idea. So let’s actually look at the

linear regression boundary. Now, I take an example here. Again, I have the +1 class

and the -1 class. And I applied– we’re trying to find, what is the

linear regression solution? Now, we remember, the blue region

and the pink region belong to classification. When you talk about linear regression,

you have the value here. And the signal is 0 here. The signal is positive, more positive,

more positive, more positive. And here, the signal is negative,

more negative, more negative, more negative. There is a real-valued function that

we are trying to interpret as a classification by taking the sign. Now, if you look at what the linear

regression is trying to do when you use it for classification,

all of these guys have a target value -1. It is actually trying to make

the numerical value equal -1 to all of them. So the chances are, these

will be -1. This will be -2, -3. And the linear regression algorithm

is very sad about that. It considers it an error, in spite of

the fact that, when we plug it into the classification, it just

has the correct sign. And that’s all we care about. But we are applying linear regression. It is actually trying very hard to make

all of them -1 at the same time, which obviously it cannot. And you can see now the problem

with linear regression. In its attempt to make this -8,

-1, it moved the boundary to the level where it’s in the middle

of the red region. And now, it’s very happy because it

minimized its error function. But that’s not really

the classification. Nonetheless, it’s a good

starting point. And then you take the classification now,

that forgets about the values and tries to adjust it according

to the classification. And you will get a good boundary. That’s the contrast between applying

linear regression for classification and linear classification outright. Now we are done. I’m going to start on nonlinear

transformation. And I’m going to give you a very

interesting tool to play with. Here is the deal. You probably realized that, even when

dealing with non-separable data, we are dealing with non-separable data that

are really basically separable with few exceptions. But in reality, when you take a real

problem, a real-life problem, you will find that the data you are going

to get could be anything. It could be, for example, something

that looks like this. So you want to classify these as

+1’s and these as -1’s. Let’s take the classification

paradigm here. Now I can put the line anywhere. And obviously, I’m in trouble because

this is not linearly separable, even by a long shot. You can look at this and say:

I can see the pattern here. Closer to the center, you have blues. Closer to the peripherals,

you have reds. So it would be very nice if

I could apply a hypothesis that looks like this. Yes. The only problem is that

that’s not linear. We don’t have the tools

to deal with that, yet. Wouldn’t it be nice if in two viewgraphs,

you can use linear regression and linear classification, the

perceptron or the pocket, to apply it to this guy? That’s what will happen. I told you this is

a practical lecture. So we take another example

of nonlinearity. We take the credit line. Now if you look at the credit line, the

credit line is affected by years in residence. We argued that if someone has been in

the same residence for a long time, there is stability and

trustworthiness. And someone has been a short time,

there’s a question mark. Now one thing is to say that this is

a variable that affects the output. Another thing to say is that

this is a variable that affects the output linearly. It would be strange if I’m trying to

determine a credit line, to decide that the credit line will be proportional

to the time you have lived in residence. If you have 10 years, 20 years, I will

give you twice the credit line. It doesn’t make sense. Because stability is established

probably by the time you get to 5 years. After that, it’s diminishing returns. So it would be very nice if I can

instead of using the linear one, define nonlinear features,

which is the following. Let’s take the condition, the logical

condition, that the years in residence are less than 1. And in my mind, I’m considering that

this is not very stable. You haven’t been there for very long. And another guy, which is x_i greater

than 5, you have been there for more than 5 years. So you are stable. The notation here, when I put something

between these brackets, means that this returns 1 if the condition is true, and

returns 0 if the condition is false. So this is 1, 0, and this is 1, 0. Now if I had those as variables in my

linear regression, they would be much more friendly to the linear formula in

deciding the credit line, rather than the crude input. But these are nonlinear

functions of x_i. And again, we have the nonlinearity. And we wonder if we can apply the same

techniques to a nonlinear case. This is the question. Can we use linear models? The key question to ask

is, linear in what? What do I mean? Look at linear regression. What does it implement? It implements this. This is indeed a linear formula. And when you look at the linear

classification counterpart, it implements this. This is a linear formula, and the

algorithm being simple depends on this part being linear. And then you just make a decision

based on that signal. Now, these you would think are called

linear because they are linear in the x’s, which they are. Yeah, I get these inputs. And I combine them linearly. And I get my surface. That’s why I’m calling it linear. However, you will realize that,

more importantly, these guys are linear in w. Now when you go from the definition

of a function to learning, the roles are reversed. The inputs, which are supposed to be

the variable when you evaluate a function, are now constants. They are dictated by the training set. They’re just a bunch of numbers

someone gave me. The real variables, as far as learning

is concerned, are the parameters. The fact that it’s linear in the

parameters is what matters in deriving the perceptron learning algorithm, and

the linear regression algorithm. If you go back to the derivation, it

didn’t matter what the x’s were. The x’s were sitting

there as constants. And their linearity in w is what

enabled the derivation. That results in the algorithm

working, because of linearity in the weights. Now that opens a fantastic possibility,

because now I can take the inputs, which are just constants. Someone gives me data. And I can do incredible nonlinear

transformations to that data. And it will just remain more elaborate

data, but constant. When I get to learn using the

nonlinearly transformed data, I’m still in the realm of linear models,

because the weight that will be given to the nonlinear feature will

have a linear dependency. Let’s look at an example. Let’s say that you take x_1 and x_2. I omitted the constant x_0

here, for simplicity. And these are the guys

that gave us trouble. These are the coordinates. This is x_1. This is x_2. These guys should map to +1. These guys should map to -1. I don’t have a linear separator. OK, fine. These are data, right? So everything that appears within this

box is just a bunch of constant x’s and corresponding constants y. Now I’m going to take

a transformation. I’m going to call it phi. Every point in that space, I’m going

to transform to another space. And my formula for transformation

will be this. I’m assuming here that the origin of

the coordinate system is here. So I’m taking x_1 squared

and x_2 squared. And you can see where I’m leading,

because now I’m measuring distances from the origin. And that seems to be

a helpful guy here. Now in doing this, all I did was take

constants and produce other constants. Now, you can look at this and say:

this is my training data. I take your original training data, do

the transformation, and forget about the original one. Can you solve the problem

in the new space? Oh, yes you can, because that’s what

they look like in the new space. All of a sudden, the red guys, which

happen to be far away, will have bigger values for x_1 squared

and x_2 squared. They will sit here. And the guys that are closer to the

origin, by the time they transform them, they will have smaller

values here. So this is now your new data set. Can you separate this

using a perceptron? Yes, I can. I can put a line going through here. Great. When you get a new point to classify,

transform it the same way, classify it here, and then report that. That’s the game. And there is really no limit,

at least computationally, in terms of what you can do here. You can dream up really elaborate

nonlinear transformations, transform the data, and then do

the classification. There is a catch. And it’s a big catch. I will stop here. And we’ll continue with the nonlinear

transformation at the beginning of the next lecture. And we’ll take a short break now, before

we go to the Q&A session. We have from the online audience. MODERATOR: A popular question is how to figure out in

a systematic way the nonlinear transformations,

instead of from the data. PROFESSOR: I said

that the nonlinear transformation is a loaded question. And there will be two steps

in dealing with it. I will talk about it a little bit more

elaborately at the beginning of next lecture. And then we are going to talk about the

guidelines for choice, and what you can do and what you cannot do, after we

develop the theory of generalization because it is very sensitive to

the generalization issue. And that should not come as a surprise,

because I can see that I can take the input, which is, let’s say, two

variables corresponding to two parameters. And I want the transformation to be as

elaborate as possible, in order to stand a good chance of being able

to separate them linearly. So I’m going to go all out. I’m just going to keep getting

nonlinear coordinates– x_1, x_1 squared, x_1 cubed, x_1

squared x_2, e to the x, just go on. Now at some point, you should smell

a rat, because you realize that I have this very, very long vector and

corresponding number of parameters. And generalization may become an issue,

which it will become an issue. So there are guidelines for

how far you can go. And also, there are guidelines

for how you can choose them. Do I look at the data and figure

out what is a good nonlinear transformation? Is this allowed? Is this not allowed? What the ramifications are? All of these will become clear only

after you look at the theory part. MODERATOR: OK. There’s a question about slide 15. So regarding the expression of E_in. How does the in-sample error here, or

the out-of-sample error, relate to the probabilistic definition

of last time? PROFESSOR: OK. Here we dealt only with

the in-sample error. So we decided on E_in. And in general in learning, you only

have the in-sample error to deal with. You have on the side a guarantee that

when you do well in-sample, you will do well out-of-sample. So you never handle the out-of-sample

explicitly. You just handle the in-sample, and have

the theoretical guarantee that what you are doing will help

you out-of-sample. Now, the error measure here

was a squared error. Therefore, when you define the in-sample

error, you get the squared error and average it. And when you define the out-of-sample

error, it’s really the expected value of the squared error. Now in the case of the binary

classification, the error was binary. You’re either right or wrong. So you can always define the

in-sample error as also the average of the question. Am I right or wrong on every point? So if you are right, there’s no error and you get 0. If you are wrong, you get 1. So you ask yourself: what is the

frequency of 1’s in-sample? And that would give you

the in-sample error. The expected value of that

error happens to be the probability of error. That’s why we simply, without going into

expectation and in-sample average versus out-of-sample expected value– in the case of classification, we simply

talked about frequency of error and probability of error, not because

they are different, but just because they are simple to state. But in reality, the aspect of them that

made them qualify as in-sample and out-of-sample is that the probability

is the expected value of an error measure that happens to

be a binary error measure. And the frequency of error happens

to be the average value of that error measure. STUDENT: So you showed us a very nice

graph with negative slope about dependence of future income and– PROFESSOR: This

is unintentional. I didn’t think of the income at

the time I drew the graph. So any implication that you should

really do worse in school in order to gain more money is– I disown any such conclusion! STUDENT: OK. But you mentioned the example of

determining future income from grade point average, or at least finding

some correlation. So the question I’m interested

in is, where can we get data? PROFESSOR: You can get– obviously, the alumni association

of every school keeps track of the alumni. And they send them questionnaires. And they have some of the inputs,

and how much money they make. There are a number of parameters. So there will be a number of

schools that have that. And actually, this is actually used. If you realize that something is

related to success or something, you can go back and revise your curriculum

or revise your criteria. So the data is indeed available,

if that’s the question. STUDENT: I mean, it’s available

in principle. But can we get it? PROFESSOR: Oh, we get it. I thought it was generic we. I don’t– obviously, the data will be

anonymous after a while. You’ll just get the GPA and the income,

without knowing who the person is. You are dependent on the kindness of the

alumni associations at different schools, I guess. Or maybe there are some available

in public domain. I have not looked. So my understanding is that you want

to run linear regression, see what happens, and then focus your time

on the courses that matter. That’s the idea now? That’s your feedback? MODERATOR: A technical question. Why is the w_0 included in

the linear regression. So there’s a confusion about this. And

also in that point, what do you do specifically in the binary case? How do you incorporate the

+1’s or -1? There’s some people asking about this. PROFESSOR: Let me

answer one at a time. I’ll talk about the threshold first. Why the threshold is there, right? Let’s look here. If you look at the line here,

the linear regression line. The linear regression line is

not a homogeneous line. It doesn’t pass by the origin. If I told you that you cannot use

a threshold, then the constant part of the equation goes away, and the

line you have will have to pass through the origin. Can you imagine if you were trying

to fit this with a line? Obviously, it would be down there

if you have the negative slope, or if you want to pass through

the points up there. So obviously, I need the constant

in order to get a proper model. And in general, there is

an offset depending on the values of these variables. And the offset is compensated

for by the threshold. That’s why we need the threshold

for linear regression. What is the second question? MODERATOR: In the binary case, when

you use y as +1 or -1, why does that just work? PROFESSOR: Well, if you apply

linear regression, you have the following guarantee at the end. The hypothesis you have has the

least squared error from the targets on the examples. That’s what has been achieved by the

linear regression algorithm. Now the outputs of the examples

being +1 or -1, we can put that together with

the first statement. And then we realize that the output

of my hypothesis is closest to the value +1 or -1 with

a mean squared error. The leap of faith is that, if you are

close to +1 versus -1, then the chances are when you are close to

+1, you are at least positive. And when you are close to -1,

you are at least negative. If you accept that leap of faith, then

the conclusion is that, when you take the threshold of the value of the signal

from linear regression, you will get the classification right

because positive will give you +1. Negative will give you -1. This is not quite the case, because in

the attempt to numerically replicate all the points, the signal for linear

regression can become– let’s say as I mentioned, +7 for some points

and -7 for another point. And the linear regression is trying to

push the w, which is what will end up being the boundary, in order to

capture that numerical value. So in attempting to fit stuff that is

irrelevant to the classification, it may mess up the classification. And that’s why the suggestion is, don’t

use it as a final thing for classification. Just use it as an initial weight, and

then use a proper classification, something as simple as the pocket

algorithm, in order to fine-tune it further in order to get the classification

part, without having to suffer from the numerical angle. MODERATOR: So also on that, does it

make a difference what you use? +1, -1, or something else? PROFESSOR: OK. If it’s plus something and minus the same

thing, it’s a matter of scale. If it’s plus and minus, and not

symmetric, it will be absorbed in the threshold. So it really doesn’t matter. It will just make things

look different. MODERATOR: Regarding the first part of the lecture, how

do you usually come up with features? PROFESSOR: OK. The best approach is to look at the

raw input, and look at the problem statement, and then try to infer

what would be a meaningful feature for this problem? For example, the case where I talked

about the years in residence. It does make sense to derive some

features that are closer to the linear dependency. There is no general algorithm

for getting features. This is the part where you work

with the problem, and you try to represent the input in a better way. And the only catch is, if you look

at the data in order to try to derive the features, there is a problem there that

will become apparent when we come to the theory. But the bottom line is that, if you don’t

look at the data, and you study the problem and derive features based

on that, that will almost always be helpful if you don’t have

too many of them. If you have too many of them, it

starts becoming a problem. But something– first order, usually when I get

a problem, I look at the data. And I probably can think of less

than a dozen variables that will be helpful. And I put all of them. And usually, a dozen variables in

this case doesn’t increase the input space by much. These are big problems. So I don’t suffer much from

the generalization issue. MODERATOR: So added to that,

a short clarification– so the nonlinear transformations– they become features? PROFESSOR: Yeah. The word feature, we

are going to use. There’s a feature space

which is called Z. And anything that you take the input and

transform it into something else, this will be called feature. And features of features

will also be features. So if you take for example the

classification of the digits, we had the pixel values. That’s the raw input. And then we had the symmetry

and the intensity. These were features. If you go further and find nonlinear

transformations of those, these will also be called features. A feature is any higher-level

representation of a raw input. MODERATOR: Another question is: how

does this analysis change if we cannot assume that the data–

if they’re not independent. PROFESSOR: Not clear

about the question. So there is really– I think I get it. Probably when we get the inputs, the

question is independence versus dependence. And the independence was used in

getting the generalization bound. That’s probably the direction

of the question. The independence was from one

data point to another. So I have N inputs. And I want these guys to be generated

independently, according to a probability distribution. If they were originally independent,

and I transformed one of them and transformed the other, the independence

is inherited. There is no question of independence

between coordinates of the same input. The independence was a question

of the independence between the different inputs. MODERATOR: So the different inputs. PROFESSOR: Different

input points. MODERATOR: So another question is, are

there methods that use different hyperplanes and intersections

of them to separate data? PROFESSOR: Correct. The linear model that we have described

is the building block of so many models in machine learning. You will find that if you take a linear

model with a soft threshold, not the hard-threshold version, and you

put a bunch of them together, you will get a neural network. If you take the linear model, and you try

to pick the separating boundary in a principled way, you get

support vector machines. If you take the nonlinear

transformation, and you try to find a computationally efficient way of doing

it, you get kernel methods. So there are lots of methods within

machine learning that build on the linear model. The linear model is somewhat

underutilized. It’s not glorious. It’s not glorious,

but it does the job. The interesting thing is that if you

have a problem, there is a very good chance that if you take a simple linear

model, you will be able to achieve what you want. You may not be able to brag about it. But you are going to do the job. And obviously, the other models will

give you incremental performance in some cases. MODERATOR: So a question, getting

a little bit ahead– how do you assess the quality of E_in

and E_out systematically? PROFESSOR: This is

a theoretical question. E_in is very simple. I have the value of E_in. I can assess its value by just

looking at its value. I can evaluate it at any given point. And this is what makes the algorithm

able to pick the best in-sample hypothesis, by picking the one that

has the smallest in-sample error. The out-of-sample error,

I don’t have access to. There will be some methods described

after the theory that will give us an explicit estimate of the

out-of-sample error. But in general, I rely on the theory

that guarantees that the in-sample error tracks the out-of-sample error,

in order to go all out for the in-sample error, and hope that the

out-of-sample error follows, which we have seen in the graph when we were

looking at the evolution of the perceptron. And the in-sample error

was going down and up. And the out-of-sample error was also

going down and up, albeit with a discrepancy between the two. But they were tracking each other. MODERATOR: So here’s a question

that’s kind of a confusion. If you want to fit a polynomial, is

this still a linear regression case? PROFESSOR: Correct. Because right now, let’s say we have

a single input variable, x, like the case I gave. So you have x and y. Now you have a line. If you use the nonlinear transformation,

you can transform this x to x, x squared, x cubed, x to the

fourth, x to the fifth, and then fit a line to the new space. And a line in the new space will be

a polynomial in the old space. So this is covered through the

nonlinear transformation. MODERATOR: What is the relation

between linear regression least squares with maximum likelihood estimation. PROFESSOR: OK. When you look at linear regression in

the statistics literature, there are many more assumptions about the

probabilities and what the noise is. And you can get actually

more results about it. Under certain conditions, you can

relate it to the maximum likelihood. You can say, Gaussian goes

with the squared error. And in this case, minimizing it will

correspond to maximum likelihood. So there is a relationship. On the other hand, I prefer to give the

linear regression in the context of machine learning, without making too

many assumptions about distributions and whatnot, because I want it to be

applied to a general situation rather than applied to a particular

situation. As a result of that, I will be able to

say less in terms of what is the probability of being right or wrong. I just have the generalization from

in-sample and out-of-sample. But that suffices for most of the

machine learning situation. So there is a relationship. And it’s studied fairly well

in other disciplines. But it is not of particular

interest to the line of logic that I’m following. MODERATOR: So a popular question is: can

you give at least a set of usual nonlinear transformations used? PROFESSOR: There will be many. When we get to support vector

machines, we will be dealing with a number of transformations, some of

them polynomials like the ones that were mentioned. One of the useful ones is referred

to as radial basis functions. We will talk about that as well. So there will be transformations. And the main point is to be able to

understand what you can and what you cannot do, in terms of jeopardizing the

generalization performance by taking a nonlinear transformation. So after we are done with that theory,

we will have a significant level of freedom of choosing what nonlinear

transform we use. And we’ll have some guidelines of some

of the famous nonlinear transforms. So this is coming up. MODERATOR: I think you already

answered this question last time. But again, someone asks, is it

impossible for machine learning to find a pattern of a pseudo-random

number generator? PROFESSOR: Well, if it’s pseudo

random, then in principle, if you get the seed, you can produce it. But the way it’s usually used is you use

a pseudo-random number, and then you take a few bits and have them as

an output for different inputs. So just looking at the inputs and trying

to decipher it– it’s next to impossible. So it’s a practical question. Philosophically, yes you can. Practically, it looks random

for all intents and purposes. MODERATOR: So what are the different

treatments for continuous responses versus discrete responses in I guess– PROFESSOR: Yeah. Obviously, this is dictated

by the problem. If someone comes, and they want to

approve credit, etc, I’m going to use the classification hypothesis set. If someone wants to get a credit line or

something else, then I will have to use regression. So it really is dependent

on the problem. And the funny part is that real numbers

look more sophisticated. Yet the algorithm that goes with them,

which is linear regression, is much easier than the other one. The reason is that the other

one is combinatorial. And combinatorial optimization is

pretty difficult in general. So the answer to the question is that it

depends on the target function that the person is coming up with. And when there is cross fertilization

between the techniques, it’s just a way to use an analytic

advantage from one method to give the other one a jump start, or to give

it a reasonable solution. But it’s a computational question. The distinction is really in the

problem statement itself. MODERATOR: Can you say what makes

a nonlinear transformation good? PROFESSOR: OK. I will be able to talk about this

a little bit more intelligently after the theory. I would like to emphasize that the

theory part will be very important in giving us all the tools to talk, with

authority, about all the issues that are being raised. So there is a reason for including the

theory before we go into more details. This lecture was meant to give you just

a little bit of standard tools that you use, and if you look at it now,

you can use for many applications and many data sets, because now you

can deal with non-separable data. You can deal with real-valued data. And you can even deal with some

nonlinear situations. So it’s just a toolbox for you

to get your hands wet. And then things will become

more principled when we develop more material. MODERATOR: Yeah, I think that’s it. PROFESSOR: OK, that’s it. We will see you on Thursday.

Thank you, Prof. Yaser Abu-Mostafa for these lectures. The concepts are concisely and precisely explained. I especially like how he explains the equations with real world examples. It makes the course material much more approachable.

Thank you.

Interesting lecture

He's example help a lot to understand the class

Prof. Yaser Abu-Mostafa is by far the best lecturer I've ever seen. Well done, great course!

fantastic lecture

For the first time I am finding Machine Learning interesting and learn-able. Thank you very much sir.

Thank you, great and relaxing lectures!

What an amazing and charming teacher…I wanna have a beer with him.

Anybody have homeworks accompanying these lectures? I don't have a registered account for the course and the registration is closed now. 🙁

check the itunes page

there are homeworks an solutions available for free for this course

Thank you for uploading this!

is there subtitle available?

what's wrong? the video is not opening.

this is really good

Thank you Caltech and Prof.

Fantastic!

Hello, It's a nice lecture!! Thanks to caltech and Prof Yasser. Can any one tell me where I can get the corresponding slides and textbooks? Thanks

Great great lecture. Thank you Pr. Yaser Abu-Mostafa, it is clear and well performed!

Good lecture.

I think the question had to do with correlation between a transformed feature and the original feature. This is describing the problem of multicollinearity. With multicollinearity, numerical instability can occur. The weight estimates are unbiased(on average you'd expect them to be correct) but they're unstable – running with slightly different data might give different estimates.

E.g. Estimate SAT scores. If given weight, height and parent's occupation one might expect all 3 are correlated. The OLS algorithm won't know which to properly credit.

Great lecture! Does the use of features ('features' are "higher level representations of raw inputs") increase the performance of a model out of sample? Does it somehow add information? Or does it simply make it computationally easier to produce a model? I'm working on a problem where this could potentially be very useful.

I could also see how the use of features could make a model more meaningful to human interpretation, but there is a risk as well that interpretations will vary between people based on what words are being used. 'Intensity' and 'symmetry' are used here which are great examples, but is could very quickly get more abstract or technical.

Thank you in advance to anyone who has a answer to my question!

"+1" and "-1" among other things happen to be real numbers! LOL

Such a great professor, he is so clear in his explanations.

http://book.caltech.edu/bookforum/ the forums are not working any body checked?

AWESOME, lots of thanks.

The best lectures in machine learning that I've listen to online. Thanks Professor Yaser.

thank you Pr. Yaser Abu-Mostafa

very very very clear and detective really it's hard to find genius explanation like that

42 minutes !! BANG !!

what a great authority and fielding questions like a true pundit on the subject..great respect and thanks a lot.

How was E_out computed in each iteration? Was it using subsample of given sample and estimated E_out on full sample?

Perfect

I learnt a method in another class where the primary classification features were selected based on things that causes the maximum change in entropy.

Brilliant lecture

The linear regression is terribly explained.

In the previous lectures E(in) was used for in-sample performance. Was is substituted to in-sample error in this lecture? Am i missing something ?

A million times better than my proffessor.

Hello, where did I find this dataset to implement the algorithms?

Amazing lecture

I like the x1.5 speeding, works perfectly.

Wow. This lecture was great. The difference between using linear regression for classification and using linear classification outright couldn't have been explained better (I mean it was just amazing).

It seems to me that there is a small typo on the 18th slide (48:25). To perform classification using linear regression, it seems one needs to check sign(wx – y) rather than sign(wx).

People don't get 2.8's at caltech? I smell grade inflation.

2.8 doesnt happen in Caltech….. 3.8 doesnt happen in Tribhuvan University

What is the graph shown at 15:42 ,why are there ups and downs in Ein?

Key takeaway: Linearity of Weights not "variables".

@52:00

great lecture & channel. Thanks for such opportunity.

Incredible lecture!! Thanks

"Surprise, surprise"… What a great professor!

What I should do if I didn't understand all the math in this lecture?

do you have some resource that explains it quickly?

Great lecture overall. However, I couldn't really understand how to implement linear regression for classification…

Quick question, what is the y-axis label at 20:00? What probability are we tracking for E_in and E_out?

Dude is a rockstar, just threw my panties at my screen again… Great lecturer.

Not sure I'm a fan of the "symmetry" measure. The number 8 in that example is clearly offset from center, the example only would have apparent symmetry because it's a wide number with a lot of black space. If a 1 is slanted and off center, it will literally have nearly 0 apparent symmetry because only its center point would have vertical symmetry. Oh well, we'll see where it goes.

32:22 MSE, mean squared error, for those with statistics background.

Instead of using linear regression to determine a good estimate of starting weights for a more sophisticated classification algorithm, could we put the hypothesis function inside the sigmoid function? Wouldn't this overcome regressions difficulty with values far from the origin?

I didn't understand how he obtained X- transpose after the differentation.

Just to clarify, what piece of theory guarantees that the in-sample error will track the out of sample error? E_in <-> E_out

I thought linear regression was for extrapolating data to a line to forecast future predictions. But here it is explained in terms of a seperation boundary for classification. can someone explain?

22:48, haha!

37:10 The formula is correct if we define the gradient as the Jacobian matrix transposed, not just Jacobian matrix. In the optimalization techniques this assumption is very helpful, so I think he uses this convention.