Modernization Hub

Modernization and Improvement
Lecture 03 -The Linear Model I

Lecture 03 -The Linear Model I


ANNOUNCER: The following program
is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we discussed the
feasibility of learning. And we realized that learning is
indeed feasible, but only in a probabilistic sense. And we modeled that probabilistic sense
in terms of a bin that has an out-of-sample performance. We already mapped that to the
out-of-sample performance. The performance we don’t know. And in order to be able to tell what E_out
of h is– h is the hypothesis that corresponds to that particular bin– we look at the in-sample. And we realize that the in-sample tracks
the out-of-sample well through the mathematical relationship which is
the Hoeffding Inequality, that tells us that the probability that E_in
deviates from E_out by more than our specified tolerance is a small number. And that small number is a negative
exponential in N. So the bigger the sample, the
more reliable that E_in will track E_out well. That was the basic building block. But then we realized that this
applies to a single bin. And a single bin corresponds
to a single hypothesis. So now we go for a case where we
have a full model, h_1 up to h_M. And we take the simple case of
a finite hypothesis set. And we ask ourselves, what
would apply in this case? We realized that the problem with having
multiple hypotheses is that the probability of something bad
happening could accumulate. Because if there is a 0.5% chance that
the first hypothesis is bad, in the sense of bad generalization, and 0.5% for the second one, we could
be so unlucky as to have these 0.5% accumulate, and end up with a significant
probability that one of the hypotheses will be bad. And when one of the hypotheses will be
bad, if we are further unlucky, and this is the hypothesis we pick as our
final hypothesis, then E_in will not track E_out for the hypothesis
we pick. So we need to accommodate the case where
we have multiple hypotheses. And the argument was extremely simple. g is our notation for the
final hypothesis. It is one of these guys that
the algorithm will choose. Well, the probability that E_in doesn’t
track E_out will obviously be included in the fact that E_in for h_1
doesn’t track the out-of-sample for that one, or E_in for h_2 doesn’t track,
or E_in of h_M doesn’t track. The reason is very simple. g is one of the guys. If something bad happens with g, it must
happen with one of these guys at least, the one that was picked. So we can always say that this implies
these things, which is this or this or this or this or this. And after that, we apply a very simple
mathematical rule, which is the union bound. The probability of an event or another
event or another event is at most the sum of the probabilities. That rule applies regardless of the
correlation between these events, because it takes the worst-case scenario. If all the bad events happen
disjointly, then you add up the probabilities. If there is some correlation,
and they overlap, you will get a smaller number. In all of the cases, the probability of
this big event will be less than or equal to the sum of the individual
probabilities. And this is useful because in the coin
flipping case, which started this argument, the events are independent. In the case of the hypotheses of
a model, the events may not be independent, because we
have the same sample. And we are only changing
the hypotheses. So it could be that the deviation here
is related to the deviation here. But the union bound doesn’t care. Regardless of such correlations, you
will be able to get a bound on the probability of this event. And therefore, you will be able to bound
the probability that you care about, which has to do with the
generalization, to the individual Hoeffding applied to each of those. And since you have M of them,
you have an added M factor. So the final answer is that the
probability of something bad happening after learning is less than or equal to
this quantity, which is a helpful small quantity, times M. And we realize
that now we have a problem because if you use a bigger hypothesis
set, M will be bigger. And therefore, the right-hand side here
will become bigger and bigger when you add the M. And therefore, at
some point, it will even become meaningless. And we are not even worried yet about
M being infinity, which will be true for many hypothesis sets,
in which case, this is totally meaningless. However, we weren’t establishing
the final result in learning. We were establishing the principle
that, through learning, you can generalize. And we have established that. It will take us a couple of weeks to
get from that to the ability to say that a general learning model, an infinite
one, will generalize. And we will get the bound on generalization. That’s what the theory of generalization
will address. So today the subject is linear models. And as I mentioned at the beginning,
this is out of sequence. If I was following the logical sequence,
I would go immediately to the theory and take M, which takes care
of the finite case, and then generalize it to
the more general case. However, as I mentioned, I decided to
give you something concrete and practical to work with early on. And then we will go back to
the theory after that. The linear model is one of the most
important models in machine learning. And what we are going to
do in this lecture, we’re going to start with a practical
data set that we are going to use over and over in this class. And then, if you remember the perceptron
that we introduced in the first lecture, the perceptron
is a linear model. So here is the sequence
of the lecture. We are going to take the perceptron and
generalize it to non-separable data. That’s a relief, because we already
admitted that separable data is very rare. And we would like to see what
will happen when we have non-separable data. Then, we are going to generalize this
further to the case where the target function is not a binary classification
function, but a real-valued function. That also is a very important
generalization. And linear regression, as you will
see, is one of the most important techniques that is applied mostly in
statistics and economics, and also in machine learning. Finally, as if we didn’t do enough
generalization already, we are going to take this and generalize
it to a nonlinear case. All in a day’s work,
all in one lecture. It’s a pretty simple model. And at the end of the lecture, you will
be able to actually deal with very general situations. And you may ask yourself, why am I
calling the lecture Linear Model when I’m going to talk about nonlinear
transformation? Well, you’ll realize that nonlinear
transformation remains within the realm of linear models. That’s not obvious. We will see how that materializes. So that’s the plan. Now, let’s look at a real data set that
we are going to use, and will be available to you to try
different ideas on. And it’s very important to try
your ideas on real data. Regardless of how sure you are when
you have a toy data set that you generate, you should always go for
real data sets and see how the system that you thought of performs
in reality. So here is the data set. It comes from ZIP codes
in the postal office. So people write the ZIP code. And you extract individual characters,
individual digits. And you would like to take the image,
which happens to be 16 by 16 gray level pixels, and be able to decipher
what is the number in it. Well, that looks easy except that
people write digits in so many different ways. And if you look at it, there will
be some cases like this fellow. Is this a 1 or a 7? Is this a 0 or an 8? So you can see that there
is a problem. And indeed, if you get a human operator
to actually read these things and classify them, they will probably
be making an error of about 2.5%. And we would like to see if machine
learning can at least equal that, which means that we can automate the
process, or maybe beat that. So this is a data set that we
are going to work with. Let’s look at it a little bit more
closely to see how we input it to our algorithm. We have one algorithm so far, which is
the perceptron learning algorithm. We are going to try on this. And then we are going to generalize
it a little bit. The first item is the question
of input representation. What do I mean? This is your input, the raw
input, if you will. Now this is 16 pixels by 16 pixels. So there are 256 real numbers
in that input. If you look at the raw input x, this
would be x_1, x_2, x_3, dot, dot, dot, dot, and x_256. That’s a very long input to encode
such a simple object. And we add our mandatory x_0. Remember, in linear models, we have
this constant coordinate, x_0 equals 1, we add in order to
take care of the threshold. So this will always be in the
background whether we mention it or not. If you take this raw input and try
the perceptron directly on it, you realize that the linear model in this
case, which has a bunch of parameters, has really just too many parameters. It has 257 parameters. If you are working in
a 257th-dimensional space, that is a huge space. And the poor algorithm is trying to
simultaneously determine the values of all of these w’s based on your set. So the idea of input representation is
to simplify the algorithm’s life. We know something about the problem. We know that it’s not really
the individual pixels that matter. You can probably extract some features
from the input, and then give those to the learning algorithm and let
the learning algorithm figure out the pattern. So this gives us the idea of features. What are features? Well, you extract the
useful information. And as a suggestion, very simple one,
let’s say that in this particular case, instead of giving the raw input
with all of the pixel values, you extract some descriptors of
what the image is like. For instance, you look at this. Depending on whether this is the digital
8 or the digit 1, et cetera, there is a question of the intensity,
average intensity. 1 doesn’t have too many black pixels. 8 has a lot. 5 has some. So if you simply add up the intensity
of all the pixels, you probably will get a number that
is related to the identity. It doesn’t uniquely determine
it, but it’s related. It’s a higher-level representation
of the raw information there. Same as symmetry– if you think of the digit
1, 1 will be symmetric. If you flip it upside down, or you flip
it right and left, you will get something that overlaps
significantly with it. So you can also define a symmetry
measure, which means that you take the symmetric difference between something
and its flipped versions, and you see what you get. If something is symmetric, things will
cancel because it’s symmetric. You’ll get a very small value. And if something is not symmetric, let’s
say like the 5, you will get lots of values in the symmetric
difference. And you will get a high
value for that. So what you are measuring
is the anti-symmetry. You take the negative of that,
and you get the symmetry. So you get another guy,
which is the symmetry. So now, x_1 is the intensity variable, x_2 is the symmetry variable. Now admittedly, you have lost
information in that process. But the chances are you lost as much
irrelevant information as relevant information. So this is a pretty good representation
of the input, as far as the learning algorithm is concerned. And you went from 257 dimensional
to 3 dimensional. That’s a pretty good situation. And you probably realize that having
257 parameters is bad news for generalization, if you extrapolate
from what we said. Having 3 is a much better situation. So this is what we are
going to work with. When you take the linear model
in this case, you just have w_0, w_1, and w_2. And that’s what the perceptron algorithm,
for example, needs to use– to determine. Now let’s look at the illustration
of these features. You have these as your inputs. And x_1 is the intensity, x_2 is the symmetry. What do they look like? They look like this. This is a scatter diagram. Every point here is a data point. It’s one of the digits, one
of the images you have. And I’m taking the simple case of just
distinguishing the 1’s from the 5’s. So I’m only taking digits
that are 1’s or 5’s. And you can always take other
digits versus each other, and then combine the decision. If you can solve this unit problem,
you can generalize it to the other problem. So when you put all the 1’s and all the
5’s in a scatter diagram, you realize for example that the intensity on the
5’s is usually more than the intensity on the 1’s. There are more pixels occupied
by the 5’s than the 1’s. This is the coordinate
which is the intensity. And indeed, the red guys, which happen
to be the 5’s, are tilted a little bit more to the right,
corresponding to the intensity. If you look at the other coordinate,
which is symmetry, the 1 is often more symmetric than the 5. Therefore, the guys that happen to be
the 1’s, that are the blue, tend to be higher on the vertical coordinate. And just by these two coordinates, you
already see that this is almost linearly separable. Not quite, but it’s separable enough
that if you pass a boundary here, you will be getting
most of them right. Now you realize that it’s impossible
really to ask to get all of them right because, believe it or not, this fellow
is a 5, at least meant to be a 5 by the guy who wrote it. So we have to accept the fact that
there will be stuff that is completely undoable. And we will accept an error. It’s not a zero error. But hopefully, it’s a small error. So this is what the features
look like. Now what does the perceptron
learning algorithm do? What it does is this complicated figure,
which takes the evolution of E_in and E_out as a function
of iteration. When you apply the perceptron learning
algorithm, you apply it only to E_in. E_in is the only value you have. E_out is sitting out there. We don’t know what it is. We just hope that E_in tracks it well. Let’s look at the figure. These are the iteration numbers. So this is the first misclassified
example. You go and apply the perceptron learning
algorithm again, again, again for 1000 times. As you do that, E_in, which is the
green curve, will go down and sometimes will go up. We realize that the perceptron learning
algorithm takes care of one point at a time, and therefore may mess
up other points while it’s taking care of a point. So in general, it can go up or down. But the bad news here is that the
data is not linearly separable. And we made the remark that the
perceptron learning algorithm behaves very badly when the data is
not linearly separable. It can go from something pretty
good to something pretty bad, in just one iteration. So this is a very typical behavior of
the perceptron learning algorithm. Because the data is not linearly
separable, the perceptron learning algorithm will never converge. So what do we do? We force it to terminate
at iteration 1000. That is, we stop at 1000 and take
whatever weight vector we have. And we call this g, the final
hypothesis of the perceptron learning algorithm. Now we obviously look at this, and we
say, if I only took this guy. This is a better guy than the other. But you know, you’re just applying
the algorithm and cutting it off. Now, one of the things you observe
from here, I plotted E_out. You’re not going to be able to plot E_out
in a real problem that you deal with, if E_out is really
an unknown function. You may be able to estimate it using
some test examples. But all you need to know here is that
E_out is drawn here for illustration, just to tell you what is happening
in reality as you work on the in-sample error. And in this case, you find that E_out
actually tracks the E_in pretty well. There is a difference. So if we go from here to here,
that’s our epsilon. It’s a big epsilon. But the good news is
that it tracks it. When this goes down, this goes down. When this goes up, this goes up. So if you make your decision based on E_in,
the decision based on E_out will also be good. That’s good for generalization. And that is one of the advantages of
something as simple as the perceptron learning algorithm. It doesn’t have too many parameters. And because of our efforts in getting
only three features, it has even three parameters now. So the chances are that it will generalize
well, which it does. Now what does the final
boundary look like? This is only the illustration here, it’s just– this is the evolution. Eventually, you end up
with a hypothesis. The hypothesis would separate the points
in the scatter diagram you saw. So what does it look like? Well, it looks like this. This is your boundary. This is the final hypothesis, that
corresponds to the hypothesis you got at the final iteration. Well, it’s OK, but definitely
not good. It’s too deep into the blue region. You would have been better
off doing this. And the chances are maybe earlier
guys that had better in-sample error will do that. But that’s what you have to
live with, if you apply the perceptron learning algorithm. So now we go and try to modify the
perceptron learning algorithm in a very simple way, that is the simplest
modification you can ever imagine. So let’s see what happens. This is what that PLA did, right? And when we looked at it, we said:
if we only could keep this value. Well, this value is not a mystery. It happened in your algorithm. You can measure it explicitly. It’s an in-sample error. And you know that it’s better than
the value you ended up with. So in spite of the fact that you’re
doing these iterations according to the prescribed perceptron learning
algorithm rule– modify the weights according to one misclassified point–
you can keep track of the total in-sample error of the intermediate
hypothesis you got. Right? And only keep the guy that happens
to be the best throughout. So you’re going to continue
as if it’s really the perceptron learning algorithm. But when you are at the end, you keep
this guy and report it as the final hypothesis. What an ingenious idea! Now the reason the algorithm is called
the pocket algorithm is because the whole idea is to put the best solution
so far in your pocket. And when you get a better one, you take
the better one, put it in your pocket, and throw the old one. And when you are done, report
the guy in your pocket. We can do that. What does this diagram look like,
when you are looking at the pocket algorithm? Much better. You can look at these values, and
it is the best value so far. Here, we went down. And here, we indeed went down. Here, we went up. You see this green thing? Here, we didn’t, because the good guy is
in our pocket and that’s what we’re reporting the value for. And we continued with it
until we dropped again. And we dropped again. And we never changed that, because
there was never a better guy than this guy. So when we come to iteration
1000, we have this fellow. Now when you do that, you can use
perceptron learning algorithm with non-separable data, terminate it by
force at some iteration, and report the pocket value. And that will be your
pocket algorithm. And if you look at the classification
boundary, PLA versus pocket, this is what we had with the perceptron
learning algorithm. We complained a little bit that it’s
too deep in the blue region. And when you look at the other guy,
which is the pocket algorithm, it looks better. It actually does what we
thought it would do. It separates them better. Still, obviously, it cannot
separate them perfectly. Nothing can, because they are
not linearly separable. On the other hand, this is
a good hypothesis to report. So with this very simple algorithm,
you can actually deal with general inseparable data, but inseparable
data in the sense that it’s basically separable. However, it really is– this guy is bad, and this guy is bad. There’s nothing we can do about them. But there are few, so we will
just settle for this. We’ll see that there are other cases
of inseparable data that is truly inseparable, in which we have to do
something a little bit more drastic. So that’s as far as the classification
is concerned. Now we go to linear regression. The word regression simply
means real-valued output. There is absolutely no other
connotation to it. It’s a glorified way of saying
my output is real-valued. And it comes from earlier
work in statistics. And there’s so much work on it that
people could not get rid of that term. And it is now the standard term. Whenever you have a real-valued
function, you call it a regression problem. So that’s, with that, out of the way. Now, linear regression is
used incredibly often in statistics and economics. Every time you say: are these
variables related to that variable, the first thing that comes to
mind is linear regression. Let me give an example. Let’s say that you would like to relate
your performance in different types of courses, to your
future earnings. This is what you do. You look at– here are the courses I took. Here is the math, science, engineering,
humanities, physical education, other. And you get your GPA in each of them. So here, I got 3.5 Here, I got 3.8 Here, I got 3.2 Here, I got 2.8 2.8? No, no. That doesn’t happen at Caltech! You go for the other one, et cetera. So you just have the GPA’s for the
different groups of courses. Now, you say– someone graduates. I’m going to look 10 years
after graduation, and see their annual income. So the inputs are the GPA’s in the
courses at the time they graduated. The output is how much money they
make per year 10 years away from graduation. Now you ask yourself: how do these
things affect the output? So apply linear regression, as you will
see it in detail, and you finally find maybe the math and sciences
are more important. Or maybe all of that is an illusion. It was actually the humanities
that are important. You don’t know. You will see the data, and the data
will tell you what affects what. And any other situation like that,
people simply resort to linear regression. So in order to build it up, we are going
to use the credit example again, in order to be able to contrast it with
the classification problem we have seen before. What do we have? We have in the classification– we have the credit approval,
yes or no. That’s a classification function, binary
function, which says the output is +1 or -1. In the case of regression, we will
have real-valued function. And the interpretation in this case is
that you’re trying to predict the proper credit line for a customer. The customer applies. And it’s not a question of approving
the credit or not. Do you give them credit limit of $800
or $1,200 or $30,000 or what, depending on their input? So this is a real-valued function. And we are going to apply regression. Now you take the input. This is the same input as we had before,
data from the applicant that are related to the credit behavior,
so the age, the salary. I suspect that the salary will figure
very significantly now when you’re trying to tell the credit line, because
if someone is making 30,000 a year, you probably are not going to give
them a credit line of 200,000. So you can see that this will
probably be affected. And there are other guys that merely
have to do with the stability of the person. Years in residence. If the person has been in the same
residence for 10 years, they are unlikely to skip town. On the other hand, if they have been
there for only one month, well, you don’t know– that type of thing. So you have these variables. You encode them as the input x. And then your output in this case, which
is the linear regression output, is a hypothesis form which takes
this particular form. Let’s spend some time with
it to understand it. First, it’s regression because
the output is real. It’s linear regression because the form,
in terms of the input, is linear. Now, we have seen this before. We sum up from basically 1 to d. These are the genuine inputs,
the weighted version of the input variables. And then we add the mandatory x_0, which
is 1, which takes care of the threshold, which is w_0. This is the form we have seen before,
except that when we saw it before, we took this as a signal that
we only care about its sign. If it’s plus, we approve credit. If it’s minus, we don’t
approve credit. And we treated it as a credit
score, per se, when you take out the threshold. Now in this case, this is the output. We don’t threshold it. We don’t say it’s +1 or -1. There is w_0 in. But we don’t take it as
+1 or -1. We take it as a real number. And this is the dollar amount
we are going to give you as a credit line. Now the signal here will play
a very important role in all the linear algorithms. This is what makes the
algorithm linear. And whether you leave it alone as in
linear regression, you take a hard threshold as in classification or, as we
will see later, you can take a soft threshold, and you get a probability
and all of that– All of these are considered
linear models. And the algorithm depends on this
particular part, which is the signal being linear. We also took the trouble to
put it in vector form. And the vector form will simplify the
calculus that we do in this lecture in order to derive the linear
regression algorithm. But you can always– if you hate
the vector form, you can always go back to this. There is nothing mysterious
about this. This simply has a bunch of parameters,
w_0, w_1, up to w_d. And if I’m trying to minimize something,
you can minimize it with respect to scalar variables, which
applies very primitive calculus. But we obviously will do it in the
shorthand version, which is the vector or the matrix form, in order to
be able to get the derivation in an easier way. So that’s the problem. What is the data set in this case? Well, it’s historical data, but it’s
a different set of historical data. The credit line is decided
by different officers. Someone sits down and evaluates your
application and decides that this person gets 1000 limit, this person
gets 5000 limit, and whatnot. All we are trying to do in this
particular example is to replicate what they’re doing. We don’t want the credit
officer to do that. The credit officers sometimes are
inconsistent from one another. They may have a good day or a bad day. So we’d like to figure out what pattern
they collectively have in deciding the credit, and have
an automated system decide that. That’s what the linear regression
system will do for us. The historical data here are again
examples from previous customers. And the previous customers–
this is x_1, and this is y_1. So this is the application
that the customer gave. And this is the credit line
that was given to them. No tracking of credit behavior, we’re
just trying to replicate what the experts do in this case. And then you realize that each of these
y’s is actually a real number, which is the credit line that
is given to customer x_n. And that real number will likely
be a positive integer. It’s a credit line. It’s a dollar amount. And what we are doing is trying
to replicate that. That’s the statement of the problem. So what does linear regression do? First, we have to measure the error. We didn’t talk about that in
the case of classification, because it was so simple. Here, it’s a little bit less simple. And then, we’ll be able to discuss
the error function for classification as well. What do we mean by that? You will have an algorithm that tries
to find the optimal weights. These are the weights you’re
going to have. These weights are going to determine
what hypothesis you get. Some hypotheses will
approximate f well. Some hypotheses will not. We would like to quantify that, to give
a guidance to the algorithm in order to move from one hypothesis
to another. So we will define an error measure. And the algorithm will try to minimize
the error measure by moving from one hypothesis to the next. If you take linear regression, the
standard error function used there is the squared error. Let me write it down. Well, if you had a classification, there
is only a simple agreement on a particular example. You either got it right
or got it wrong. There is nothing else. Therefore, in that case, we
just defined binary error. Did you get it right or wrong? And we found the frequency
of getting it right. And we got the E_in and E_out. Here, you are estimating
a credit line. So if the guy gets 1000, and you tell
them 900, that’s not too bad. If the guy gets 1000, and you
tell them 5000, that’s bad. So you need to measure how
bad the situation is. And you define an error measure,
and you define it by the simple squared error. Now, squared error doesn’t have
an inherent merit here. It just happens to be the standard
error function used with linear regression. And its merit really is the simplicity
in the analytic solution that we are going to get. But when we discuss error measures in
the next lecture, we will go back to the principle, does error
measure matter? Why? How do we choose it? Et cetera. This will be answered in
a principled way next time. But for this time, let’s take this
as a standard error measure we are going to use. When you look at the in-sample error,
you use the error measure. On the particular example n,
n from 1 to N. For each example, this is the contribution
of the error. Each of these is affected by the
same w, because h depends on w. So as you change w, this value will
change for every example. And this is the error in that example. And if you want to get all the in-sample
error, you simply take the average of those. That will give me a snapshot
of how my hypothesis is doing on the data set. And now, we are going to ask our
algorithm to take this error and minimize it. Let’s actually just look at what
happens as an illustration. This is the simplest case
for linear regression. The input is one-dimensional. I
have only one relevant variable. I want to relate your overall GPA to
your earnings 10 years from now. Your overall GPA is x. Your earnings 10 years from now is y. That’s it. OK? [CHUCKLES] I would have properly called this
x_1 according to our notation. And then there would be an x_0,
which is the constant 1. But I didn’t bother, because
I have only one variable. But this is what we have. So you look at this. And you see that, for different
x’s, you have these guys. Wow. Your earnings are going down with– Well, that may not have been the
example that is drawn here. What linear regression does is it tries
to produce a line, which is what you have here, that tries to fit
this data according to the squared-error rule. So it may look like this. And in this case, the threshold
here depends on w_0. The slope depends on w_1, which
is the weight for x. And that is the solution you have. Now you didn’t get it right, but
what you got is some errors. And you realize that– this is
the error on the first example. This is the error on
the second example. And if you sum up the squares of the
lengths of these bars, that is what we called the in-sample error that we defined
in the previous viewgraph. Well, linear regression can apply
to more than one dimension. And I can plot 2 dimensions here
just to illustrate it. It’s the same principle. What you have here is you have x_1. If I can get the pointer– OK, we’ll leave it to rest. We have x_1 and x_2. And in this case, the linear
thing is really a plane. And you’re again not separating, but
trying to estimate these guys. And you’re making errors. And in general, when you go to
a higher-dimensional space, the line– which is the reason why we call it
linear– is not really a line. It’s a hyperplane, one dimension short
of the space you are working with. And that’s what you are trying to
use to approximate the guys. Now let’s look at the
expression for E_in. And that is the analytic expression
we are going to try to minimize. And that will make us derive the
linear regression algorithm. We wrote this before. And you have the value of the
hypothesis minus y_n squared. That is because it’s a squared error. And because it’s linear regression, this
value, h of x_n, happens to be w transposed x_n. It’s a linear function of x_n. Now let us try to write this
down in a vector form. I will explain this in detail. But let’s look at this. Instead of the summation, all of
a sudden, I have a norm squared of something that is– Capital X, I haven’t seen
capital X before. I haven’t seen vector y before. Well, it’s basically a consolidation
of the different x_n’s here. x_n is a vector. So you put the vectors in a matrix. You call it X. And you put the
scalars, the y_n, in a vector. And you call it y. The definition of capital X and
the vector y is as follows. For the matrix X, what you do– you
put your first example here. So this would be the constant coordinate 1,
the first coordinate, second coordinate, up to the d-th
coordinate, the last coordinate. And then you go for the second
example, and do the same and construct this matrix. And for y, you put the
corresponding output. This is the output for the first
example, output for the second example, output for the last example. Now one thing to realize
about the matrix X is that it’s pretty tall. The typical situation is that
you have few parameters. We reduced them to three, for example,
in the case of the classification of the digits. But you usually have many, many
examples, in the 1000’s. So this will be a very,
very long matrix. Now the way you take this– well, the
norm squared will be simply this vector transposed times itself. And when you do it, you realize that
what you are doing is summing up contributions from the
different components. And each component happens to be exactly
what you are having here. So this becomes a shorthand for
writing this expression. Now, let’s look at minimizing E_in. When you look at minimizing, you realize
that the matrix X, which has the inputs of the data, and y, which has
the outputs of the data, are, as far as we are concerned, constants. This is the data set someone gave me. The parameter I’m actually playing
with in order to get a good hypothesis is w. So E_in is of w. And w appears here. And the rest are constants. If I do any calculus of minimization,
it is with respect to w. So I try to minimize this. And what you do– you get the derivative
and equate it with 0, except here, it’s a glorified
derivative. You get the gradient, which is
the derivative on a bunch of them all at once. And there is a formula for it, which
is pretty simple in this case. I will explain it. By the way, if you hate this,
and you want to make sure, because linear regression is so
important. And you want to verify that it’s true, you can always go for the
scalar form, get partial E by partial every w: partial w_0, partial
w_1, partial w_d, get a formula that is a pretty hairy
one, and then try to reduce it. And– surprise, surprise– you will get the
solution here that we have in matrix form in two steps. Now if you look at this, deal with it in
terms of calculus as if it was just a simple square. If this was a simple square, and w
was the variable, what would the derivative be? You will get 2 sitting outside. Well, you’ve got it here. And then you will get the same
thing in a linear form. You got it here. And then you will get whatever constant
was multiplied by w to sit outside, which you got here. You just got here with a transpose,
because this is really not a square. This is the transpose of
this times itself. That’s where you get the transpose. Pretty straightforward and
standard matrix calculus. So that’s what you have. And then you equate this
to 0, but it’s a fat 0. It’s a vector of 0’s. You want all the derivatives
to be 0 all at once. And that will define a point where
this achieves a minimum. Now, you would suspect that the solution
will be simple, because this is a very simple quadratic form. And indeed, the solution is simple. And if you look at it, you realize that
if I want this to be 0, then I want this to cancel out. So when I multiply X transposed X w, I get
the same thing as X transposed y. So they cancel out, and I get my 0. So you write this down, and you find
that this is the situation. I want this term to be
equal to this term. And that will give me the 0. The interesting thing is that in spite
of the fact that the matrix X is a very tall matrix, definitely
not square, hence not invertible, X transposed X is actually a square
matrix, because X transposed is this way and X is this way. You multiply them, and you get
a pretty small square matrix. And as we will see, the chances are
overwhelming that it will be invertible. So you can actually solve this very
simply, by inverting this. You multiply by the inverse
in this direction. You multiply by this. This will disappear, and you will get
an explicit formula for w, which you were trying to solve for. And when you do that, you will get w
equals this funny symbol, X dagger. What is X dagger? This is simply a shorthand
for writing this. So I got the inverse of that, and
then multiplied it by here. So this is really what I get
to be multiplied by y. I call it X dagger. And indeed, it gets multiplied
by y to give me my w. Now the X dagger is a pretty
interesting notion. It’s called the pseudo-inverse of X.
X, being a non-invertible matrix, does not have an inverse. But it does have a pseudo-inverse. And the pseudo-inverse has
interesting properties. For example, if you take this, the X
dagger, and multiply it by X– so X dagger times X– what do you get? You add X here. You get X transposed X. Oh, I have
X transposed X to the -1 here. So they cancel out, and
I get an identity. So when I multiply X dagger
by X, I get the identity. So it’s OK to call it
an inverse of sorts. It doesn’t work the other way around. The other way around gives us
an interesting matrix, which we’ll talk about later. But basically, this is
the essence of it. If we were in a trivial situation
where X was a square– I have 3 parameters, and I have
3 examples to determine them– that can be solved perfectly. I can actually get this to be 0. And how would you get it to be 0? You would just multiply by the proper
inverse of X in this case, and you will get X inverse y. So this is pretty much similar,
when X is a tall one. And we are not going to get a 0. We’re just going to get a minimum
using the pseudo-inverse. Now I would like you to appreciate the
pseudo-inverse from a computational point of view. This is the formula for the pseudo-inverse
that you will need to compute, in order to get the solution
for linear regression. So let’s look at it. Something is inverted. And when you see inversion in matrix,
you say, oh, computation, computation. If this was a million by a million,
I’m in trouble. If this is 5 by 5, I’m in good shape. So we’d like to know, what kind
of matrix do we have here? Well, nothing mysterious about
what’s inside this. You have this fellow, which
is X transposed. It’s d plus 1, d is the
length of your input, 1 is the added constant variable. So these are the number of parameters. This would be 3 in the digit
classification guy. We have only x_1 and x_2, so d equals 2. d plus 1 equals 3, which corresponds
to x_0, x_1, x_2, or to w_0, w_1, w_2. So this is 3 times N.
N is the scary one. That’s the number of examples. That could be in the thousands. Now you multiply this by X,
and that’s what you have. The multiplication will be– multiplication is
not that difficult. Even if this is 10,000, I can
multiply this by 10,000. But the good news is that when I go to
this guy, I will have to be dealing with a simpler guy. Let’s just complete the formula first. This is what you have. This is what you are computationally
doing. And if you look at what’s inside
here, it completely shrinks. That is what the matrix inside is. It’s just 3 by 3 in our case. You can invert that. Just accumulating it is the one that
you have to go through all of the examples. And there’s a very simple
way of doing it. It’s not that difficult
to get this fellow. And you can see now that, oh, good
thing that we had 3 parameters. If we had the 257 parameters to begin
with, this would have been 257 by 257. Not that this will discourage us. But if you go for some raw inputs, you
can get something really in the thousands or sometimes
even more than that. So the computational aspect
of this is very simple. And there are so many packages for
computing the pseudo-inverse, or outright getting the solution for linear
regression, that you will never have to do that yourself, except
if you’re doing something very specialized. If you do have something very
specialized, it’s not that bad. So that is the final matrix. And the final matrix will have the
same dimensions as this guy. And if you look at it, this will
be multiplied by what? Multiplied by y, which is y_1,
y_2, y_3, y_N, corresponding to different outputs. And then, as a result of that,
you will get the w’s– w_0, w_1, up to w_d. Indeed, if you multiply this by an N
tall vector, you will get a d plus 1 tall vector, and that’s
what we expect. Let’s now flash the full linear
regression algorithm here. That’s a crowded slide. That is what you do. The first thing is you take the data
that is given to you, and put them in the proper form. What is the proper form? You construct the matrix
X and the vector y. And these are what we
introduced before. This will be the input data matrix, and
this will be the target vector. And once you construct them, you are
basically done, because all you are going to do– you plug this into
a formula, which is the pseudo-inverse. And then you will return the value w,
that is the multiplication of that pseudo-inverse with y. And you are done. Now you can call this one-step
learning if you want. With the perceptron learning algorithm,
it looked more like learning, because I have
an initial hypothesis. And then I take one example at a time,
and try to figure out what is going on, move this around, et cetera. And after 1000 iterations,
I get something. It looks more like
what we learn. We learn in steps. This looks like cheating. You give me the thing,
and [MAKES SOUND]. And you have the answer. Well, as far as we are concerned,
we don’t care how you got it. If it’s correct and gives you a correct
E_out, you have learned. And because this is so simple, this is
a very popular algorithm that is used often, and used often as a building
block for other guys. We can afford to use it as a building
block, because the step here will be so simple that we can become more
sophisticated in using it. Just one remark about the inversion– this has to be invertible in order
for this formula to hold. Now the chances are, that this will be
invertible in a real application you have, is close to 1. The reason is the following. Usually, you use very few parameters
and tons of examples. You will be very, very, very unlucky
to have these so dependent on each other that you cannot even capture
the dimensionality which is the number of columns. The number of columns is 3, 5, 10,
and you have 10,000 of those. So the chances are overwhelming in
a real problem that this will be invertible. Nonetheless, if it is not invertible,
you can still define the pseudo-inverse. It will not be unique and has
some elaborate features, but it’s not a big deal. That is not a situation you will
encounter in practice. So now we have linear regression. I’m going to tell you that you can
use linear regression not only for a real-valued function, for
regression problems. But you’re also going to be able
to use it for classification. Maybe the perceptron is now
going out of business. It has a competitor now. And the competitor has a very
simple algorithm. So let’s see how this works. The idea is incredibly simple. Linear regression learns
a real-valued function. Yeah, we know that. That is the real-valued function. The value belongs to the real numbers. Fine. Now the main observation, the ingenious
observation, is that binary-valued functions, which are the
classification functions, are also real-valued. +1 and -1, among other things,
happen to be real numbers. So linear regression is not going to
refuse to learn them as real numbers. Right? So what do we do? You use linear regression in order to
get a solution, such that the solution is approximately y_n in the
mean squared sense. For every example, the actual value
of the signal is close to the numerical +1 and the numerical -1. That’s what linear regression does. Now, having done that with y_n equals +1
or -1, you realize that in this case, if you take the
classification version of it– you take the sign of that signal in order
to be able to classify as +1 or -1. If the value is genuinely close to
+1 or -1 numerically, then the chances are when it’s +1,
this would be positive. And when it’s -1, it’s negative. The chances are– you’re getting close to
a number, you’ll probably cross the zero in doing that. And if you cross the zero, the
classification will be correct. So if you take this, and then plug it in
as weights for classification, you will likely get something
that will give you– likely to agree with +1 or -1. That’s a pretty simple trick,
because it’s almost free. All you need to do– I have a classification problem. Let’s run linear regression. It’s almost for free. Do this one-step learning, get
a solution, and use it for classification. Now, let’s see if this is
as good as it sounds. Well, the weights are good for
classification, so to speak, just by conjecture. But they also may serve as good initial
weights for classification. Remember that the perceptron algorithm,
or the pocket algorithm, are really very slow to get there. You start with a random guy. Half the guys are misclassified. And it just goes around, tries to
correct one, messes up the others, until it gets to the
region of interest. And then it converges. Why not give it a jump start? Why not run linear regression
first, get the w’s. We know that the w’s are OK, but they
are not really tailored toward classification. But they’re good initial condition. Feed those to the pocket algorithm, and
let it run to the solution, which is a classification solution. That’s a pretty nice idea. So let’s actually look at the
linear regression boundary. Now, I take an example here. Again, I have the +1 class
and the -1 class. And I applied– we’re trying to find, what is the
linear regression solution? Now, we remember, the blue region
and the pink region belong to classification. When you talk about linear regression,
you have the value here. And the signal is 0 here. The signal is positive, more positive,
more positive, more positive. And here, the signal is negative,
more negative, more negative, more negative. There is a real-valued function that
we are trying to interpret as a classification by taking the sign. Now, if you look at what the linear
regression is trying to do when you use it for classification,
all of these guys have a target value -1. It is actually trying to make
the numerical value equal -1 to all of them. So the chances are, these
will be -1. This will be -2, -3. And the linear regression algorithm
is very sad about that. It considers it an error, in spite of
the fact that, when we plug it into the classification, it just
has the correct sign. And that’s all we care about. But we are applying linear regression. It is actually trying very hard to make
all of them -1 at the same time, which obviously it cannot. And you can see now the problem
with linear regression. In its attempt to make this -8,
-1, it moved the boundary to the level where it’s in the middle
of the red region. And now, it’s very happy because it
minimized its error function. But that’s not really
the classification. Nonetheless, it’s a good
starting point. And then you take the classification now,
that forgets about the values and tries to adjust it according
to the classification. And you will get a good boundary. That’s the contrast between applying
linear regression for classification and linear classification outright. Now we are done. I’m going to start on nonlinear
transformation. And I’m going to give you a very
interesting tool to play with. Here is the deal. You probably realized that, even when
dealing with non-separable data, we are dealing with non-separable data that
are really basically separable with few exceptions. But in reality, when you take a real
problem, a real-life problem, you will find that the data you are going
to get could be anything. It could be, for example, something
that looks like this. So you want to classify these as
+1’s and these as -1’s. Let’s take the classification
paradigm here. Now I can put the line anywhere. And obviously, I’m in trouble because
this is not linearly separable, even by a long shot. You can look at this and say:
I can see the pattern here. Closer to the center, you have blues. Closer to the peripherals,
you have reds. So it would be very nice if
I could apply a hypothesis that looks like this. Yes. The only problem is that
that’s not linear. We don’t have the tools
to deal with that, yet. Wouldn’t it be nice if in two viewgraphs,
you can use linear regression and linear classification, the
perceptron or the pocket, to apply it to this guy? That’s what will happen. I told you this is
a practical lecture. So we take another example
of nonlinearity. We take the credit line. Now if you look at the credit line, the
credit line is affected by years in residence. We argued that if someone has been in
the same residence for a long time, there is stability and
trustworthiness. And someone has been a short time,
there’s a question mark. Now one thing is to say that this is
a variable that affects the output. Another thing to say is that
this is a variable that affects the output linearly. It would be strange if I’m trying to
determine a credit line, to decide that the credit line will be proportional
to the time you have lived in residence. If you have 10 years, 20 years, I will
give you twice the credit line. It doesn’t make sense. Because stability is established
probably by the time you get to 5 years. After that, it’s diminishing returns. So it would be very nice if I can
instead of using the linear one, define nonlinear features,
which is the following. Let’s take the condition, the logical
condition, that the years in residence are less than 1. And in my mind, I’m considering that
this is not very stable. You haven’t been there for very long. And another guy, which is x_i greater
than 5, you have been there for more than 5 years. So you are stable. The notation here, when I put something
between these brackets, means that this returns 1 if the condition is true, and
returns 0 if the condition is false. So this is 1, 0, and this is 1, 0. Now if I had those as variables in my
linear regression, they would be much more friendly to the linear formula in
deciding the credit line, rather than the crude input. But these are nonlinear
functions of x_i. And again, we have the nonlinearity. And we wonder if we can apply the same
techniques to a nonlinear case. This is the question. Can we use linear models? The key question to ask
is, linear in what? What do I mean? Look at linear regression. What does it implement? It implements this. This is indeed a linear formula. And when you look at the linear
classification counterpart, it implements this. This is a linear formula, and the
algorithm being simple depends on this part being linear. And then you just make a decision
based on that signal. Now, these you would think are called
linear because they are linear in the x’s, which they are. Yeah, I get these inputs. And I combine them linearly. And I get my surface. That’s why I’m calling it linear. However, you will realize that,
more importantly, these guys are linear in w. Now when you go from the definition
of a function to learning, the roles are reversed. The inputs, which are supposed to be
the variable when you evaluate a function, are now constants. They are dictated by the training set. They’re just a bunch of numbers
someone gave me. The real variables, as far as learning
is concerned, are the parameters. The fact that it’s linear in the
parameters is what matters in deriving the perceptron learning algorithm, and
the linear regression algorithm. If you go back to the derivation, it
didn’t matter what the x’s were. The x’s were sitting
there as constants. And their linearity in w is what
enabled the derivation. That results in the algorithm
working, because of linearity in the weights. Now that opens a fantastic possibility,
because now I can take the inputs, which are just constants. Someone gives me data. And I can do incredible nonlinear
transformations to that data. And it will just remain more elaborate
data, but constant. When I get to learn using the
nonlinearly transformed data, I’m still in the realm of linear models,
because the weight that will be given to the nonlinear feature will
have a linear dependency. Let’s look at an example. Let’s say that you take x_1 and x_2. I omitted the constant x_0
here, for simplicity. And these are the guys
that gave us trouble. These are the coordinates. This is x_1. This is x_2. These guys should map to +1. These guys should map to -1. I don’t have a linear separator. OK, fine. These are data, right? So everything that appears within this
box is just a bunch of constant x’s and corresponding constants y. Now I’m going to take
a transformation. I’m going to call it phi. Every point in that space, I’m going
to transform to another space. And my formula for transformation
will be this. I’m assuming here that the origin of
the coordinate system is here. So I’m taking x_1 squared
and x_2 squared. And you can see where I’m leading,
because now I’m measuring distances from the origin. And that seems to be
a helpful guy here. Now in doing this, all I did was take
constants and produce other constants. Now, you can look at this and say:
this is my training data. I take your original training data, do
the transformation, and forget about the original one. Can you solve the problem
in the new space? Oh, yes you can, because that’s what
they look like in the new space. All of a sudden, the red guys, which
happen to be far away, will have bigger values for x_1 squared
and x_2 squared. They will sit here. And the guys that are closer to the
origin, by the time they transform them, they will have smaller
values here. So this is now your new data set. Can you separate this
using a perceptron? Yes, I can. I can put a line going through here. Great. When you get a new point to classify,
transform it the same way, classify it here, and then report that. That’s the game. And there is really no limit,
at least computationally, in terms of what you can do here. You can dream up really elaborate
nonlinear transformations, transform the data, and then do
the classification. There is a catch. And it’s a big catch. I will stop here. And we’ll continue with the nonlinear
transformation at the beginning of the next lecture. And we’ll take a short break now, before
we go to the Q&A session. We have from the online audience. MODERATOR: A popular question is how to figure out in
a systematic way the nonlinear transformations,
instead of from the data. PROFESSOR: I said
that the nonlinear transformation is a loaded question. And there will be two steps
in dealing with it. I will talk about it a little bit more
elaborately at the beginning of next lecture. And then we are going to talk about the
guidelines for choice, and what you can do and what you cannot do, after we
develop the theory of generalization because it is very sensitive to
the generalization issue. And that should not come as a surprise,
because I can see that I can take the input, which is, let’s say, two
variables corresponding to two parameters. And I want the transformation to be as
elaborate as possible, in order to stand a good chance of being able
to separate them linearly. So I’m going to go all out. I’m just going to keep getting
nonlinear coordinates– x_1, x_1 squared, x_1 cubed, x_1
squared x_2, e to the x, just go on. Now at some point, you should smell
a rat, because you realize that I have this very, very long vector and
corresponding number of parameters. And generalization may become an issue,
which it will become an issue. So there are guidelines for
how far you can go. And also, there are guidelines
for how you can choose them. Do I look at the data and figure
out what is a good nonlinear transformation? Is this allowed? Is this not allowed? What the ramifications are? All of these will become clear only
after you look at the theory part. MODERATOR: OK. There’s a question about slide 15. So regarding the expression of E_in. How does the in-sample error here, or
the out-of-sample error, relate to the probabilistic definition
of last time? PROFESSOR: OK. Here we dealt only with
the in-sample error. So we decided on E_in. And in general in learning, you only
have the in-sample error to deal with. You have on the side a guarantee that
when you do well in-sample, you will do well out-of-sample. So you never handle the out-of-sample
explicitly. You just handle the in-sample, and have
the theoretical guarantee that what you are doing will help
you out-of-sample. Now, the error measure here
was a squared error. Therefore, when you define the in-sample
error, you get the squared error and average it. And when you define the out-of-sample
error, it’s really the expected value of the squared error. Now in the case of the binary
classification, the error was binary. You’re either right or wrong. So you can always define the
in-sample error as also the average of the question. Am I right or wrong on every point? So if you are right, there’s no error and you get 0. If you are wrong, you get 1. So you ask yourself: what is the
frequency of 1’s in-sample? And that would give you
the in-sample error. The expected value of that
error happens to be the probability of error. That’s why we simply, without going into
expectation and in-sample average versus out-of-sample expected value– in the case of classification, we simply
talked about frequency of error and probability of error, not because
they are different, but just because they are simple to state. But in reality, the aspect of them that
made them qualify as in-sample and out-of-sample is that the probability
is the expected value of an error measure that happens to
be a binary error measure. And the frequency of error happens
to be the average value of that error measure. STUDENT: So you showed us a very nice
graph with negative slope about dependence of future income and– PROFESSOR: This
is unintentional. I didn’t think of the income at
the time I drew the graph. So any implication that you should
really do worse in school in order to gain more money is– I disown any such conclusion! STUDENT: OK. But you mentioned the example of
determining future income from grade point average, or at least finding
some correlation. So the question I’m interested
in is, where can we get data? PROFESSOR: You can get– obviously, the alumni association
of every school keeps track of the alumni. And they send them questionnaires. And they have some of the inputs,
and how much money they make. There are a number of parameters. So there will be a number of
schools that have that. And actually, this is actually used. If you realize that something is
related to success or something, you can go back and revise your curriculum
or revise your criteria. So the data is indeed available,
if that’s the question. STUDENT: I mean, it’s available
in principle. But can we get it? PROFESSOR: Oh, we get it. I thought it was generic we. I don’t– obviously, the data will be
anonymous after a while. You’ll just get the GPA and the income,
without knowing who the person is. You are dependent on the kindness of the
alumni associations at different schools, I guess. Or maybe there are some available
in public domain. I have not looked. So my understanding is that you want
to run linear regression, see what happens, and then focus your time
on the courses that matter. That’s the idea now? That’s your feedback? MODERATOR: A technical question. Why is the w_0 included in
the linear regression. So there’s a confusion about this. And
also in that point, what do you do specifically in the binary case? How do you incorporate the
+1’s or -1? There’s some people asking about this. PROFESSOR: Let me
answer one at a time. I’ll talk about the threshold first. Why the threshold is there, right? Let’s look here. If you look at the line here,
the linear regression line. The linear regression line is
not a homogeneous line. It doesn’t pass by the origin. If I told you that you cannot use
a threshold, then the constant part of the equation goes away, and the
line you have will have to pass through the origin. Can you imagine if you were trying
to fit this with a line? Obviously, it would be down there
if you have the negative slope, or if you want to pass through
the points up there. So obviously, I need the constant
in order to get a proper model. And in general, there is
an offset depending on the values of these variables. And the offset is compensated
for by the threshold. That’s why we need the threshold
for linear regression. What is the second question? MODERATOR: In the binary case, when
you use y as +1 or -1, why does that just work? PROFESSOR: Well, if you apply
linear regression, you have the following guarantee at the end. The hypothesis you have has the
least squared error from the targets on the examples. That’s what has been achieved by the
linear regression algorithm. Now the outputs of the examples
being +1 or -1, we can put that together with
the first statement. And then we realize that the output
of my hypothesis is closest to the value +1 or -1 with
a mean squared error. The leap of faith is that, if you are
close to +1 versus -1, then the chances are when you are close to
+1, you are at least positive. And when you are close to -1,
you are at least negative. If you accept that leap of faith, then
the conclusion is that, when you take the threshold of the value of the signal
from linear regression, you will get the classification right
because positive will give you +1. Negative will give you -1. This is not quite the case, because in
the attempt to numerically replicate all the points, the signal for linear
regression can become– let’s say as I mentioned, +7 for some points
and -7 for another point. And the linear regression is trying to
push the w, which is what will end up being the boundary, in order to
capture that numerical value. So in attempting to fit stuff that is
irrelevant to the classification, it may mess up the classification. And that’s why the suggestion is, don’t
use it as a final thing for classification. Just use it as an initial weight, and
then use a proper classification, something as simple as the pocket
algorithm, in order to fine-tune it further in order to get the classification
part, without having to suffer from the numerical angle. MODERATOR: So also on that, does it
make a difference what you use? +1, -1, or something else? PROFESSOR: OK. If it’s plus something and minus the same
thing, it’s a matter of scale. If it’s plus and minus, and not
symmetric, it will be absorbed in the threshold. So it really doesn’t matter. It will just make things
look different. MODERATOR: Regarding the first part of the lecture, how
do you usually come up with features? PROFESSOR: OK. The best approach is to look at the
raw input, and look at the problem statement, and then try to infer
what would be a meaningful feature for this problem? For example, the case where I talked
about the years in residence. It does make sense to derive some
features that are closer to the linear dependency. There is no general algorithm
for getting features. This is the part where you work
with the problem, and you try to represent the input in a better way. And the only catch is, if you look
at the data in order to try to derive the features, there is a problem there that
will become apparent when we come to the theory. But the bottom line is that, if you don’t
look at the data, and you study the problem and derive features based
on that, that will almost always be helpful if you don’t have
too many of them. If you have too many of them, it
starts becoming a problem. But something– first order, usually when I get
a problem, I look at the data. And I probably can think of less
than a dozen variables that will be helpful. And I put all of them. And usually, a dozen variables in
this case doesn’t increase the input space by much. These are big problems. So I don’t suffer much from
the generalization issue. MODERATOR: So added to that,
a short clarification– so the nonlinear transformations– they become features? PROFESSOR: Yeah. The word feature, we
are going to use. There’s a feature space
which is called Z. And anything that you take the input and
transform it into something else, this will be called feature. And features of features
will also be features. So if you take for example the
classification of the digits, we had the pixel values. That’s the raw input. And then we had the symmetry
and the intensity. These were features. If you go further and find nonlinear
transformations of those, these will also be called features. A feature is any higher-level
representation of a raw input. MODERATOR: Another question is: how
does this analysis change if we cannot assume that the data–
if they’re not independent. PROFESSOR: Not clear
about the question. So there is really– I think I get it. Probably when we get the inputs, the
question is independence versus dependence. And the independence was used in
getting the generalization bound. That’s probably the direction
of the question. The independence was from one
data point to another. So I have N inputs. And I want these guys to be generated
independently, according to a probability distribution. If they were originally independent,
and I transformed one of them and transformed the other, the independence
is inherited. There is no question of independence
between coordinates of the same input. The independence was a question
of the independence between the different inputs. MODERATOR: So the different inputs. PROFESSOR: Different
input points. MODERATOR: So another question is, are
there methods that use different hyperplanes and intersections
of them to separate data? PROFESSOR: Correct. The linear model that we have described
is the building block of so many models in machine learning. You will find that if you take a linear
model with a soft threshold, not the hard-threshold version, and you
put a bunch of them together, you will get a neural network. If you take the linear model, and you try
to pick the separating boundary in a principled way, you get
support vector machines. If you take the nonlinear
transformation, and you try to find a computationally efficient way of doing
it, you get kernel methods. So there are lots of methods within
machine learning that build on the linear model. The linear model is somewhat
underutilized. It’s not glorious. It’s not glorious,
but it does the job. The interesting thing is that if you
have a problem, there is a very good chance that if you take a simple linear
model, you will be able to achieve what you want. You may not be able to brag about it. But you are going to do the job. And obviously, the other models will
give you incremental performance in some cases. MODERATOR: So a question, getting
a little bit ahead– how do you assess the quality of E_in
and E_out systematically? PROFESSOR: This is
a theoretical question. E_in is very simple. I have the value of E_in. I can assess its value by just
looking at its value. I can evaluate it at any given point. And this is what makes the algorithm
able to pick the best in-sample hypothesis, by picking the one that
has the smallest in-sample error. The out-of-sample error,
I don’t have access to. There will be some methods described
after the theory that will give us an explicit estimate of the
out-of-sample error. But in general, I rely on the theory
that guarantees that the in-sample error tracks the out-of-sample error,
in order to go all out for the in-sample error, and hope that the
out-of-sample error follows, which we have seen in the graph when we were
looking at the evolution of the perceptron. And the in-sample error
was going down and up. And the out-of-sample error was also
going down and up, albeit with a discrepancy between the two. But they were tracking each other. MODERATOR: So here’s a question
that’s kind of a confusion. If you want to fit a polynomial, is
this still a linear regression case? PROFESSOR: Correct. Because right now, let’s say we have
a single input variable, x, like the case I gave. So you have x and y. Now you have a line. If you use the nonlinear transformation,
you can transform this x to x, x squared, x cubed, x to the
fourth, x to the fifth, and then fit a line to the new space. And a line in the new space will be
a polynomial in the old space. So this is covered through the
nonlinear transformation. MODERATOR: What is the relation
between linear regression least squares with maximum likelihood estimation. PROFESSOR: OK. When you look at linear regression in
the statistics literature, there are many more assumptions about the
probabilities and what the noise is. And you can get actually
more results about it. Under certain conditions, you can
relate it to the maximum likelihood. You can say, Gaussian goes
with the squared error. And in this case, minimizing it will
correspond to maximum likelihood. So there is a relationship. On the other hand, I prefer to give the
linear regression in the context of machine learning, without making too
many assumptions about distributions and whatnot, because I want it to be
applied to a general situation rather than applied to a particular
situation. As a result of that, I will be able to
say less in terms of what is the probability of being right or wrong. I just have the generalization from
in-sample and out-of-sample. But that suffices for most of the
machine learning situation. So there is a relationship. And it’s studied fairly well
in other disciplines. But it is not of particular
interest to the line of logic that I’m following. MODERATOR: So a popular question is: can
you give at least a set of usual nonlinear transformations used? PROFESSOR: There will be many. When we get to support vector
machines, we will be dealing with a number of transformations, some of
them polynomials like the ones that were mentioned. One of the useful ones is referred
to as radial basis functions. We will talk about that as well. So there will be transformations. And the main point is to be able to
understand what you can and what you cannot do, in terms of jeopardizing the
generalization performance by taking a nonlinear transformation. So after we are done with that theory,
we will have a significant level of freedom of choosing what nonlinear
transform we use. And we’ll have some guidelines of some
of the famous nonlinear transforms. So this is coming up. MODERATOR: I think you already
answered this question last time. But again, someone asks, is it
impossible for machine learning to find a pattern of a pseudo-random
number generator? PROFESSOR: Well, if it’s pseudo
random, then in principle, if you get the seed, you can produce it. But the way it’s usually used is you use
a pseudo-random number, and then you take a few bits and have them as
an output for different inputs. So just looking at the inputs and trying
to decipher it– it’s next to impossible. So it’s a practical question. Philosophically, yes you can. Practically, it looks random
for all intents and purposes. MODERATOR: So what are the different
treatments for continuous responses versus discrete responses in I guess– PROFESSOR: Yeah. Obviously, this is dictated
by the problem. If someone comes, and they want to
approve credit, etc, I’m going to use the classification hypothesis set. If someone wants to get a credit line or
something else, then I will have to use regression. So it really is dependent
on the problem. And the funny part is that real numbers
look more sophisticated. Yet the algorithm that goes with them,
which is linear regression, is much easier than the other one. The reason is that the other
one is combinatorial. And combinatorial optimization is
pretty difficult in general. So the answer to the question is that it
depends on the target function that the person is coming up with. And when there is cross fertilization
between the techniques, it’s just a way to use an analytic
advantage from one method to give the other one a jump start, or to give
it a reasonable solution. But it’s a computational question. The distinction is really in the
problem statement itself. MODERATOR: Can you say what makes
a nonlinear transformation good? PROFESSOR: OK. I will be able to talk about this
a little bit more intelligently after the theory. I would like to emphasize that the
theory part will be very important in giving us all the tools to talk, with
authority, about all the issues that are being raised. So there is a reason for including the
theory before we go into more details. This lecture was meant to give you just
a little bit of standard tools that you use, and if you look at it now,
you can use for many applications and many data sets, because now you
can deal with non-separable data. You can deal with real-valued data. And you can even deal with some
nonlinear situations. So it’s just a toolbox for you
to get your hands wet. And then things will become
more principled when we develop more material. MODERATOR: Yeah, I think that’s it. PROFESSOR: OK, that’s it. We will see you on Thursday.

59 comments on “Lecture 03 -The Linear Model I

  1. Thank you, Prof. Yaser Abu-Mostafa for these lectures. The concepts are concisely and precisely explained. I especially like how he explains the equations with real world examples. It makes the course material much more approachable.

  2. For the first time I am finding Machine Learning interesting and learn-able. Thank you very much sir.

  3. Anybody have homeworks accompanying these lectures? I don't have a registered account for the course and the registration is closed now. 🙁

  4. Hello, It's a nice lecture!! Thanks to caltech and Prof Yasser. Can any one tell me where I can get the corresponding slides and textbooks? Thanks

  5. Good lecture.
    I think the question had to do with correlation between a transformed feature and the original feature. This is describing the problem of multicollinearity. With multicollinearity, numerical instability can occur. The weight estimates are unbiased(on average you'd expect them to be correct) but they're unstable – running with slightly different data might give different estimates.

    E.g. Estimate SAT scores. If given weight, height and parent's occupation one might expect all 3 are correlated. The OLS algorithm won't know which to properly credit.

  6. Great lecture! Does the use of features ('features' are "higher level representations of raw inputs") increase the performance of a model out of sample? Does it somehow add information? Or does it simply make it computationally easier to produce a model? I'm working on a problem where this could potentially be very useful.            

    I could also see how the use of features could make a model more meaningful to human interpretation, but there is a risk as well that interpretations will vary between people based on what words are being used. 'Intensity' and 'symmetry' are used here which are great examples, but is could very quickly get more abstract or technical.

    Thank you in advance to anyone who has a answer to my question! 

  7. what a great authority and fielding questions like a true pundit on the subject..great respect and thanks a lot.

  8. How was E_out computed in each iteration? Was it using subsample of given sample and estimated E_out on full sample?

  9. I learnt a method in another class where the primary classification features were selected based on things that causes the maximum change in entropy.

  10. In the previous lectures E(in) was used for in-sample performance. Was is substituted to in-sample error in this lecture? Am i missing something ?

  11. Wow. This lecture was great. The difference between using linear regression for classification and using linear classification outright couldn't have been explained better (I mean it was just amazing).

  12. It seems to me that there is a small typo on the 18th slide (48:25). To perform classification using linear regression, it seems one needs to check sign(wx – y) rather than sign(wx).

  13. What I should do if I didn't understand all the math in this lecture?
    do you have some resource that explains it quickly?

  14. Great lecture overall. However, I couldn't really understand how to implement linear regression for classification…

  15. Not sure I'm a fan of the "symmetry" measure. The number 8 in that example is clearly offset from center, the example only would have apparent symmetry because it's a wide number with a lot of black space. If a 1 is slanted and off center, it will literally have nearly 0 apparent symmetry because only its center point would have vertical symmetry. Oh well, we'll see where it goes.

  16. Instead of using linear regression to determine a good estimate of starting weights for a more sophisticated classification algorithm, could we put the hypothesis function inside the sigmoid function? Wouldn't this overcome regressions difficulty with values far from the origin?

  17. Just to clarify, what piece of theory guarantees that the in-sample error will track the out of sample error? E_in <-> E_out

  18. I thought linear regression was for extrapolating data to a line to forecast future predictions. But here it is explained in terms of a seperation boundary for classification. can someone explain?

  19. 37:10 The formula is correct if we define the gradient as the Jacobian matrix transposed, not just Jacobian matrix. In the optimalization techniques this assumption is very helpful, so I think he uses this convention.

Leave a Reply

Your email address will not be published. Required fields are marked *