## Lecture 8: Recurrent Neural Networks and Language Models

[MUSIC] Stanford University.>>All right, hello everybody. Welcome to Lecture seven or

maybe it’s eight. Definitely today is the beginning of where we talk about models that

really matter in practice. We’ll talk today about the simplest

recurrent neural network model one can think of. But in general, this model family is what most people

now use in real production settings. So it’s really exciting. We only have a little bit

of math in between and a lot of it is quite applied and

should be quite fun. Just one organizational

item before we get started. I’ll have an extra office

hour today right after class. I’ll be again on Queuestatus 68 or so. Last week we had to end at 8:30. And there’s still a lot of

people who had a question, so I’ll be here after class for

probably another two hours or so. Try to get through everybody’s questions. Are there any questions around projects?>>[LAUGH]

>>And organizational stuff? All right, then let’s take a look

at the overview for today. So to really appreciate the power of

recurrent neural networks it makes sense to get a little bit of background

on traditional language models. Which will have huge RAM requirements and

won’t be quite feasible in their best kinds of settings where

they obtain the highest accuracies. And then we’ll motivate recurrent

neural networks with language modeling. It’s a very important

kind of fundamental task in NLP that tries to

predict the next word. Something that sounds quite simple but

is really powerful. And then we’ll dive a little bit into

the problems that you can actually quite easily understand once

you have figured out how to take gradients and you actually

understand what backpropagation does. And then we can go and

see how to extend these models and apply them to real sequence tasks

that people really run in practice. All right, so let’s dive right in. Language models. So basically, we want to just compute the probability

of an entire sequence of words. And you might say,

well why is that useful? Why should we be able to compute

how likely a sequence is? And actually comes up for

a lot of different kinds of problems. So one, for instance,

in machine translation, you might have a bunch of potential

translations that a system gives you. And then you might wanna understand

which order of words is the best. So “the cat is small” should get a higher

probability than “small the is cat”. But based on another language

that you translate from, it might not be as obvious. And the other language might have

a reversed word order and whatnot. Another one is when you do speech

recognition, for instance. It also comes up in the machine

translation a little bit, where you might have, well this particular example is clearly

more a machine translation example. But comes up also in speech

recognition where you might wanna understand which word might be the better

choice given the rest of the sequence. So “walking home after school” sounds

a lot more natural than “walking house after school”. But home and

house have the same translation or same word in German which is haus,

H A U S. And you want to know which one is

the better one for that translation. So comes up in a lot of

different kinds of areas. Now basically it’s hard to compute

the perfect probabilities for all potential sequences ’cause

there are a lot of them. And so what we usually end up doing is

we basically condition on just a window, we try to predict the next word based

on the just the previous n words before the one that

we’re trying to predict. So this is, of course,

an incorrect assumption. The next word that I will utter will

depend on many words in the past. But it’s something that had to be done to use traditional count based

machine learning models. So basically we’ll approximate this

overall sequence probability here with just a simpler version. In the perfect sense this would basically

be the product here of each word, given all preceding words

from the first one all the way to the one just

before the i_th one. But in practice, this probability with

traditional machine learning models we couldn’t really compute so we actually approximate that with some

number of n words just before each word. So this is a simple Markov assumption

just assuming the next action or next word that is uttered just

depends on n previous words. And now if we wanted to use traditional

methods that are just basically based on the counts of words and not

using our fancy word vectors and so on. Then the way we would compute and estimate

these probabilities is essentially just by counting how often does, if you want to

get the probability for the second word, given the first word. We would just basically count up how often

do these two words co-occur in this order, divided by how often the first

word appears in the whole corpus. Let’s say we have a very large corpus and

we just collect all these counts. And now if we wanted to condition not just

on the first and the previous word but on the two previous words, then we’d

have to compute all these counts. And now you can kind of sense that well, if we want to ideally condition on as

many n-grams as possible before but we have a large vocabulary of say 100,000

words, then we’ll have a lot of counts. Essentially 100,000 cubed, many numbers we would have to store

to estimate all these probabilities. Does that make sense? Are there any questions for

these traditional methods? All right, now, the problem with

that is that the performance usually improves as we have more and

more of these counts. But, also,

you now increase your RAM requirements. And so,

one of the best models of this traditional type actually required 140 gigs of RAM for

just computing all these counts when they wanted to compute them for

126 billion token corpus. So it’s very,

very inefficient in terms of RAM. And you would never be able

to put a model that basically stores all these different n-gram counts. You could never store it in a phone or

any small machine. And now, of course, once computer

scientists struggle with a problem like that, they’ll find ways to deal with it,

and so, there are a lot of different

ways you can back off. You say, well, if I don’t find the 4-gram,

or I didn’t store it, because it was not frequent enough,

then maybe I’ll try the 3-gram. And if I can’t find that or I don’t have

many counts for that, then I can back off and estimate my probabilities with fewer

and fewer words in the context size. But in general you want

to have at least tri or 4-grams that you store and the RAM

requirements for those are very large. So that is actually something

that you’ll observe in a lot of comparisons between deep

learning models and traditional NLP models that are based on

just counting words for specific classes. The more powerful your models are, sometimes the RAM requirements can

get very large very quickly, and there are a lot of different ways

people tried to combat these issues. Now our way will be to use

recurrent neural networks. Where basically, they’re similar to

the normal neural networks that we’ve seen already, but they will actually tie

the weights between different time steps. And as you go over it, you keep using, re-using essentially the same

linear plus non-linearity layer. And that will at least in theory,

allow us to actually condition what we’re trying to predict

on all the previous words. And now here the RAM requirements will

only scale with the number of words not with the length of the sequence

that we might want to condition on. So now how’s this really defined? Again, they’re you’ll see different

kinds of visualizations and I’m introducing you to a couple. I like sort of this unfolded one where we

have here a abstract hidden time step t and we basically, it’s conditioned on

H_t-1, and then here you compute H_t+1. But in general,

the equations here are quite intuitive. We assume we have a list of word vectors. For now,

let’s assume the word vectors are fixed. Later on we can actually loosen

that assumption and get rid of it. And now, at each time step

to compute the hidden state. At that time step will essentially

just have these two matrices, these two linear layers,

matrix vector products and we sum them up. And that’s essentially similar to

saying we concatenate h_t-1 and the word vector at time step t, and

we also concatenate these two matrices. And then we apply

an element-wise non-linearity. So this is essentially just a standard

single layer neural network. And then on top of that we can

use this as a feature vector, or as our input to our standard

softmax classification layer. To get an output probability for

instance over all the words. So now the way we would write

this out in this formulation is basically the probability that

the next word is of this specific, at this specific index j conditioned

on all the previous words is essentially the j_th element

of this large output vector. Yes? What is s? So here you can have different

ways to define your matrices. Some people just use u, v,

and w or something like that. But here we basically use the superscript

just identify which matrix we have. And these are all different matrices, so

W_(hh), the reason we call it hh is it’s the W that computes the hidden

layer h given the input h t- 1. And then you have an h_x here,

which essentially maps x into the same vector space that we have. Our hidden states in and

then s is just our softmax w. The weights of the softmax classifier. And so let’s look at the dimensions here. It’s again very important. You have another question? So why do we concatenate and

not add is the question. So they’re the same. So when you write W_(h) using same notation plus W_(hx) times x then this is actually the same thing. And so this will now basically be

a vector, and we are feed in linearity but it doesn’t really change things, so

let’s just look at this inside part here. Now if we concatenated h and

x together we’re now have, and let’s say, x here has a certain

dimensionality which we’ll call d. So x is in R_d and

our h will define to be in for having the dimensionality R_(Dh). Now, what would the dimensionality be

if we concatenated these two matrices? So we have here the output has to be,

again a Dh matrix. And now this vector here is a, what dimensionality does this factor

have when we concatenate the two? That’s right. So this is a d plus Dh times one and here we have Dh times our matrix. It has to be the same dimensionality, so d plus Dh and

that’s why we could essentially concatenate here W_h in this way,

and W_hx here. And now we could basically multiply these. And if you, again if this is confusing,

you can write out all the indices. And you realize that these

two are exactly the same. Does that make sense? Right, so as you sum up all the values

here, It’ll essentially just get summed up also, it doesn’t matter

if you do it in one go or not. Just a single layer and that worked

where you compact in two inputs but it’s in many cases for recurrent

neutral networks is written this way. All right. So now, here are two other ways

you’ll often see these visualized. This is kind of a not unrolled version of

a hidden, of a recurrent neural network. And sometimes you’ll also see

sort of this self loop here. I actually find these kinds of

unrolled versions the most intuitive. All right. Now when you start and you. Yup? Good question. So what is x[t]? It’s essentially the word vector for the word that appears

at the t_th time step. As opposed to x_t and intuitively here

x_t you could define it in any way. It’s really just like as you go through

the lectures you’ll actually observe different versions but intuitively

here x_t is just a vector at xt but here xt is already an input, and

what it means in practice is you actually have to now go at that t time

step, find the word identity and pull that word vector from your glove or word

to vec vectors, and get that in there. So x_t we used in previous

lectures as the t_th element for instance in the whole embedding matrix,

all our word vectors. So this is just to make it very explicit

that we look up the identity of the word at the tth time step and then get

the word vector for that identity, like the vector in all our word vectors. Yep. So I’m showing here a single layer

neural network at each time step, and then the question is whether that

is standard or just for simplicity? It is actually the simplest and

still somewhat useful. Variant of a recurrent neural network,

though we’ll see a lot of extensions even in this class, and then in the lecture

next week we’ll go to even better versions of these kinds of

recurrent neural networks. But this is actually a somewhat

practical neural network, though we can improve it in many ways. Now, you might be curious when

you just start your sequence, and this is age 0 here and

there isn’t any previous words. What you would do and the simplest thing

is you just initialize the vector for the first hidden layer at the first or the

0 time step as just a vector of all 0s. Right and this is the X[t] definition

you had just describe through the column vector of L which is our embedding matrix

at index [t] which the time step t. All right so it’s very important to keep

track properly of all our dimensionality. Here, W(S) to Softmax actually goes

over the size of our vocabulary, V times the hidden state. So the output here is the same

as the vector of the length of the number of words that we

might wanna to be able to predict. All right, any questions for the feed forward definition of

a recurrent neural network? All right, so how do we train this? Well fortunately, we can use all the same machinery we’ve

already introduced and carefully derived. So basically here we have probability

distribution over the vocabulary and we’re going to use the same exact cross

entropy loss function that we had before, but now the classes are essentially

just the next word. So this actually sometimes

creates a little confusion on the nomenclature that we have

’cause now technically this is unsupervised in the sense that

you just give it raw text. But this is the same kind of objective

function we use when we have supervised training where we have a specific

class that we’re trying to predict. So the class at each time step is

just a word index of the next word. And you’re already familiar with that, here we’re just summing over the entire

vocabulary for each of the elements of Y. And now, in theory, you could just. To evaluate how well you can predict

the next word over many different words in longer sequences, you could in theory just

take this negative of the average log probability is over this entire dataset. But for maybe historical reasons, and also

reasons like information theory and so on that we don’t need to get into, what’s

more common is actually to use perplexity. So that’s just 2 to

the power of this value and, hence, we want to basically

be less perplexed. So the lower our perplexity is,

the less the model is perplexed or confused about what the next word is. And we essentially, ideally we’ll assign

a higher probability to the word that actually appears in the longer

sequence at each time step. Yes? Any reason why 2 to the J? Yes, but it’s sort of a rat hole

we can go down, maybe after class. Information theory bits and

so on, it’s not necessary. All right.>>[LAUGH]

>>All right, so now you would think, well this is pretty

simple, we have a single set of W matrices, and training should

be relatively straightforward. Sadly, and this is really the main

drawback of this and a reason of why we introduce all these other more powerful

recurrent neural network models, training these kinds of models

is actually incredibly hard. And we can now analyze, using the tools of back propagation and

chain rule and all of that. Now we can analyze and

understand why that is. So basically we’re multiplying here,

the same matrix at each time step, right? So you can kind of think of

this matrix multiplication as amplifying certain patterns over and

over again at every single time step. And so, in a perfect world, we would want the inputs from many time

steps ago to actually be able to still modify what we’re trying to predict

at a later, much later, time step. And so, one thing I would like

to encourage you to do is to try to take the derivatives

with respect to these Ws, if you just had a two or

three word sequence. It’s a great exercise,

great preparation for the midterm. And it’ll give you some

interesting insights. Now, as we multiply the same matrix

at each time step during foreprop, we have to do the same thing during

back propagation We have, remember, our deltas, our air signals and sort of

the global elements of the gradients. They will essentially at each time step

flow through this network backwards. So when we take our cross-entropy

loss here, we take derivatives, we back propagate we compute our deltas. Now the first time step here that just

happened close to that output would make a very good update and

will probably also make a good update to the word vector here if

we wanted to update those. We’ll talk about that later. But then as you go backwards in

time what actually will happen is your signal might get either too weak,

or too strong. And that is essentially called

the vanishing gradient problem. As you go backwards through time, and you

try to send the air signal at time step t, many time steps into the past, you’ll

have the vanishing gradient problem. So, what does that mean and

how does it happen? Let’s define here a simpler, but

similar recurrent neural network that will allow us to give you an intuition and

simplify the math downstream. So here we essentially just say, all

right, instead of our original definition where we had some kind of f

some kind of non-linearity, here we use the sigma function,

you could use other one. First introduce the rectified linear units

and so on instead of applying it here, we’ll apply it in the definition

just right in here. So it’s the same thing. And then let’s assume, for now,

we don’t have the softmax. We just have here, a standard,

a bunch of un-normalized scores. Which really doesn’t matter for

the math, but it’ll simplify the math. Now if you want to compute the total

error with respect to an entire sequence, with respect to your W then

you basically have to sum up all the errors at all the time steps. At each time step, we have an error of how incorrect we

were about predicting the next word. And that’s basically the sum here and now we’re going to look at the element

at the t timestamp of that sum. So let’s just look at a single time step,

a single error at a single time step. And now even computing that will

require us to have a very large chain rule application,

because essentially this error at time step t will depend on all

the previous time steps too. So you have here the delta or

dE_t over dy_t, so the t, the hidden state. Sorry, the soft max output or

here these unnormalized square output Yt. But then you have to multiply that

with the partial derivative of yt with respect to the hidden state. So that’s just That’s just this guy

right here, or this guy for ht. But now, that one depends on,

of course, the previous one, right? This one here, but it also depends

on that one, and that one, and the one before that, and so on. And so that’s why you have to sum over

all the time step from the first one, all the way to the current one, where

you’re trying to predict the next word. And now, each of these was

also computed with a W, so you have to multiply partial of that,

as well. Now, let’s dig into

this a little bit more. And you don’t have to worry too

much if this is a little fast. You won’t have to really

go through all of this, but it’s very similar to a lot of

the math that we’ve done before. So you can kind of feel comfortable for

the most part going over it at this speed. So now, remember here,

our definition of h_t. We basically have all these partials

of all the h_t’s with respect to the previous time steps,

the h’s of the previous time steps. Now, to compute each of these,

we’ll have to use the chain rule again. And now, what this means is essentially a partial derivative of a vector

with respect to another vector. Something that if we’re clever with

our backprop definitions before, we never actually have to do in practice,

right? ’cause this is a very large matrix, and we’re combining the computation with the

flow graph, and our delta messages before such that we don’t actually have to

compute explicitly, these Jacobians. But for the analysis of the math here, we’ll basically look at

all the derivatives. So just because we haven’t defined it,

what’s the partial for each of these is essentially called the Jacobian,

where you have all the partial derivatives with respect to each element of the top

here ht with respect to the bottom. And so in general, if you have

a vector valued function output and a vector valued input, and you take

the partials here, you get this large matrix of all the partial derivatives

with respect to all outputs. Any questions? All right, so basically here,

a lot of chain rule. And now, we got this beast

which is essentially a matrix. And we multiply, for each partial here, we actually have to multiply all of these,

right? So this is a large product

of a lot of these Jacobians. Now, we can try to simplify this,

and just say, all right. Let’s say, there is an upper bound. And we also,

the derivative of h with respect to h_j. Actually, with this simple definition of

each h actually can be computed this way. And now,

we can essentially upper bound the norm of this matrix with

the multiplication of basically these equation right here,

where we have W_t. And if you remember our

backprop equations, you’ll see some common terms here, but we’ll actually write this out as

not just an element wise product. But we can write the same thing as

a diagonal where we have instead of the element wise. Elements we basically just put them into

the diagonal of a larger matrix, and with zero path,

everything that is off diagonal. Now, we multiply these two norms here. And now, we just define beta, W and

beta h, as essentially the upper bounds. Some number, single scalar for each as like how large they

could maximally be, right? We have W, we could compute easily

any kind of norm for our W, right? It’s just a matrix, computed matrix norm,

we get a single number out. And now, basically, when we write

this all, we put all this together, then we see that an upper bound for

this Jacobians is essentially for each one of these

elements as this product. And if we define each of the elements

here, in terms of their upper bounds beta, then we basically have this product

beta here taken to the t- k power. And so as the sequence gets longer and

longer, and t gets larger and larger, it really depends on the value

of beta to have this either blow up or get very, very small, right? If now the norms of this matrix,

for instance, that norm, and then you have

control over that norm, right? You initialize your wait matrix W with some small random values initially

before you start training. If you initialize this to a matrix that

has a norm that is larger than one, then at each back propagation step and

the longer the time sequence goes. You basically will get a gradient

that is going to explode, cuz you take some value that’s larger

than one to a large power here. Say, you have 100 or something,

and your norm is just two, then you have two to the 100th as an upper

bound for that gradient and vice-versa. If you initialize your matrix W in

the beginning to a bunch of small random values such that the norm of

your W is actually smaller than one, then the final gradient that will be

sent from ht to hk could become a very, very small number, right,

half to the power of 100th. Basically, none of the errors will arrive. None of the error signal, we got small and

smaller as you go further and further backwards in time. Yeah. So if the gradient here is exploding, does

that mean a word that is further away has a bigger impact on a word that’s closer? And the answer is when

it’s exploding like that, you’ll get to not a number in no time. And that doesn’t even become a practical

issue because the numbers will literally become not a number,

cuz it’s too large a value to compute. And we’ll have to think

of ways to come back. It turns out the exploding gradient

problem has some really great hacks that make them easier to deal with than

the vanishing gradient problem. And we’ll get to those in a second. All right, so now,

you might say this could be a problem. Now, why is the vanishing gradient

problem, an actual common practice? And again, it basically prevents

us from allowing a word that appears very much in the past

to have any influence on what we’re trying to break in

terms of the next word. And so here a couple of examples from just

language modeling where that is a real problem. So let’s say, for instance,

you have Jane walked into the room. John walked in too. It was late in the day. Jane said hi to. Now, you can put an almost

probability mass of one, that the next word in this blank is John,

right? But if now,

each of these words have the word vector, you type it in to the hidden state,

you compute this. And now, you want the model to pick up

the pattern that if somebody met somebody else, and your all this complex stuff. And then they said hi too, and

the next thing is the name. You wanna put a very high probability

on it, but you can’t get your model to actually send that error signal way

back over here, to now modify the hidden state in a way that would allow you

to give John a high probability. And really, this is a large problem in

any kind of time sequence that you have. And many people might

intuitively say well, language is mostly a Sequence problem,

right? You have words that appear

from left to right or in some temporal order as we speak. And so this is a huge problem. And now we’ll have a little bit

of code that we can look into. But before that we’ll have

the awesome Shayne give us a little bit of an intercession,

intermission.>>Hi, so let’s take a short break

from recurrent neural networks to talk about transition-based

dependency parsing, which is exactly what you guys saw

this time last week in lecture. So just as a recap, a transition-based

dependency parser is a method of taking a sentence and

turning it into dependence parse tree. And you do this by looking at

the state of the sentence and then predicting a transition. And you do this over and over again in a greedy fashion until

you have a full transition sequence which itself encodes, the dependency

parse tree for that sentence. So I wanna show you how to get from

the model that you’ll be implementing in your assignment two question two, which you’re hopefully working

on right now, to SyntaxNet. So what is SyntaxNet? SyntaxNet is a model that Google came out with and they claim

it’s the world’s most accurate parser. And it’s new,

fast performant TensorFlow framework for syntactic parsing is available for

over 40 languages. The one in English is called

the Parse McParseface.>>[LAUGH]

>>So my slide seemed to have been jumbled a little bit here, but

hopefully you can read through it. So basically the baseline we’re

gonna begin with is the Chen and Manning model which came out in 2014. And Chen and Manning are respectively

your head TA and instructor. And the models that produce SyntaxNet

in just two stages of improvements, those directly modified Chen and Manning’s model, which is exactly what

you guys will be doing in assignment two. And so we’re going to focus today

on the main bulk of these changes, modifications which were

introduced in 2015 by Weiss et al. So without further ado, I’m gonna look

at their three main contributions. So the first one is they leverage

unlabeled data using something called Tri-Training. The second is that they tuned

their neural network and made some slight modifications. And the last and probably most important

is that they added a final layer on top of the model involving a structured

perceptron with beam search. So each of these seeks to solve a problem. So the first one is tri-training. So as you know, in most supervised models, they perform better the more

data that they have. And this is especially the case for

dependency parsing, where as you can imagine there are an

infinite number of possible sentences with a ton of complexity and

you’re never gonna see all of them, and you’re gonna see even some

of them very, very rarely. So the more data you have, the better. So what they did is they took

a ton of unlabeled data and two highly performing dependency parsers

that were very different from each other. And when they agreed, independently

agreed on a dependency parse tree for a given sentence, then that would

be added to the labeled data set. And so now you have ten

million new tokens of data that you can use in addition

to what you already have. And this by itself improved

a highly performing network’s performance by 1% using

the unlabeled attachment score. So the problem here was not having

enough data for the task and they improved it using this. The second augmentation they made

was by taking the existing model, which is the one you

guys are implementing, which has an input layer

consisting of the word vectors. The vectors for the part of speech tags

and the arc labels with one hidden layer and one soft max layer predicting which

transition and they changed it to this. Now this is actually pretty much the same

thing, except for three small changes. The first is that they added, there are two hidden layers

instead of one hidden layer. The second is that they used

a RELU nonlinearity function instead of the cube nonlinearity function. And the third and most important is

that they added a perceptron layer on top of the soft max layer. And notice that the arrows,

that it takes in as input the outputs from all

the previous layers in the network. So this perceptron layer wants

to solve one particular problem, and this problem is that greedy algorithms

aren’t able to really look ahead. They make short term decisions and as a result they can’t really

recover from one incorrect decision. So what they said is, let’s allow

the network then to look ahead and so we’re going to have a tree

which we can search over and this tree is the tree of all the possible

partial transition sequences. So each edge is a possible transition

form the state that you’re at. As you can imagine, even with three transitions your tree

is gonna blossom very, very quickly and you can’t look that far ahead and

explore all of the possible branches. So what you have to do

is prune some branches. And for that they use beam search. Now beam search is only

gonna keep track of the top K partial transition

sequences up to a depth of M. Now how do you decide which K? You’re going to use a score computed

using the perceptron weights. You guys probably have a decent idea

at this point of how perceptron works. The exact function they used

is shown here, and I’m gonna leave up the annotations so you can take

a look at it later if you’re interested. But basically those are the three

things that they did solve, the problems with the previous

Chen & Manning model. So in summary, Chen & Manning had

an unlabeled attachment score of 92%, already phenomenal performance. And with those three changes,

they boosted it to 94%, and then there’s only 0.6%

left to get you to SyntaxNet, which is Google’s 2016

state of the art model. And if you’re curious what the did to get

that 0.6%, take a look at Andrew All’s paper Which uses global normalization

instead of local normalization. So the main takeaway, and

it’s pretty straight forward but I can’t stress it enough, is when you’re

trying to improve upon an existing model, you need to identify the specific

flaws that are in this model. In this case the greedy algorithm and

solved those problems specifically. In this case they did that

using semi-supervised method using unlabeled data. They tune the model better and they use the structured

perception with beam search. Thank you very much.>>[APPLAUSE]

>>Kind of awesome. You can now look at these

kinds of pictures and you totally know what’s going on. And in like state of the art stuff

that the largest companies in the world publishes. Exciting times. All right, so we’ll gonna through a little bit of

like a practical Python notebook sort of implementation that shows you a simple

version of the vanishing gradient problem. Where we don’t even have a full recurrent

real network we just have a simple two layer neural network and even in

those kinds of networks you will see that the error that you start at

the top and the norm of the gradients as you go down through your network,

the norm is already getting smaller. And if you remember these were the two

equations where I said if you get to the end of those two equations you know

all the things that you need to know, and you’ll actually see these three

equations in the code as well. So let’s jump into this. I don’t see it. Let me get out of the presentation All right, better, all right. Now, zoom in. So here, we’re going to define

a super simple problem. This is a code that we started,

and 231N (with Andrej), and we just modified it to

make it even simpler. So let’s say our data set,

to keep it also very simple, is just this kind of

classification data set. Where we have basically three classes,

the blue, yellow, and red. And they’re basically in

the spiral clusterform. We’re going to define our

simple nonlinearities. You can kind of see it as a solution

almost to parts of the problem set, which is why we’re only showing it now. And we’ll put this on the website too,

so no worries. You can visit later. But basically, you could define here f,

our different nonlinearities, element-wise, and the gradients for them. So this is f and

f prime if f is a sigmoid function. We’ll also look at the relu, the other

nonlinearity that’s very popular. And here, we just have the maximum between

0 and x, and very simple function. Now, this is a relatively

straight forward definition and implementation of this simple

three layer neural network. Has this input, here our nonlinearity,

our data x, just these points in two dimensional space, the class,

it’s one of those three classes. We’ll have this model here,

we have our step size for SDG, and our regularization value. Now, these are all our parameters,

w1, w2 and w3 for all the outputs, and

variables of the hidden states. Two sets is bigger, all right.>>[LAUGH]

>>All right, now, if our nonlinearity is the relu, then we have here relu,

and we just input x, multiply it. And in this case,

your x can be the entirety of the dataset, cuz the dataset’s so small, each

mini-batch, we can essentially do a batch. Again, if you have realistic datasets,

you wouldn’t wanna do full batch training, but we can get away with it here. It’s a very tiny dataset. We multiply w1 times x

plus our bias terms, and then we have our element-wise

rectified linear units or relu. Then we’ve computed in layer two,

same idea. But now, it’s input instead of

x is the previous hidden layer. And then we compute our scores this way. And then here, we’ll normalize

our scores with the softmax. Just exponentiate our scores,

some of them. So very similar to the equations

that we walk through. And now,

it’s just basically an if statement. Either we have used relu

as our activations, or we use a sigmoid, but

the math inside is the same. All right, now,

we’re going to compute our loss. Our good friend, the simple average cross

entropy loss plus the regularization. So here,

we have negative log of the probabilities, we summed them up overall the elements. And then here, we have our regularization

as the L2, standard L2 regularization. And we just basically sum up the squares

of all the elements in all our parameters, and I guess it does cut off a little bit. Let me zoom in. All three have the same of

amount of regularization, and we add that to our final loss. And now, every 1,000 iterations,

we’ll just print our loss and see what’s happening. And this is something you

always want to do too. You always wanna visualize,

see what’s going on. And hopefully,

a lot of this now looks very familiar. Maybe if implemented it not quite as

efficiently, as efficiently in problem set one, but maybe you have, and

then it’s very, very straightforward. Now, that was the forward propagation,

we can compute our error. Now, we’re going to go backwards, and we’re computing our delta

messages first from the scores. Then we have here, back propagation. And now,

we have the hidden layer activations, transposed times delta

messages to compute w. Again, remember, we have always for

each w here, we have this outer product. And that’s the outer

product we see right here. And now, the softmax was the same

regardless of whether we used a value or a sigmoid. Let’s walk through the sigmoid here. We now, basically, have our delta scores,

and have here the product. So this is exactly computing delta for

the next layer. And that’s exactly this equation here,

and just Python code. And then again,

we’ll have our updates dw, which is, again, this outer product right there. So it’s a very nice

sort of equations code, almost a nice one to one

mapping between the two. All right, now, we’re going to go through the network

from the top down to the first layer. Again, here, our outer product. And now, we add the derivatives for

our regularization. In this case, it’s very simple, just matrices themselves

times the regularization. And we combine all our gradients

in this data structure. And then we update all our parameters

with our step_size and SGD. All right, then we can evaluate how

well we do on the training set, so that we can basically print out

the training accuracy as we train us. All right, now, we’re going to

initialize all the dimensionality. So we have there just our two

dimensional inputs, three classes. We compute our hidden sizes

of the hidden vectors. Let’s say, they’re 50, it’s pretty large. And now, we can run this. All right, we’ll train it with both

sigmoids and rectify linear units. And now,

once we wanna analyze what’s going on, we can essentially now plot some of

the magnitudes of the gradients. So those are essentially the updates as we

do back propagation through the snap work. And what we’ll see here is

the some of the gradients for the first and the second layer when

we use sigmoid non-linearities. And basically here, the main takeaway

messages that blue is the first layer, and green is the second layer. So the second layer is

closer to the softmax, closer to what we’re trying to predict. And hence, it’s gradient is

usually had larger in magnitude than the one that arrives

at the first layer. And now, imagine you do this 100 times. And you have intuitively your vanishing

gradient problem in recurrent neural networks. They’ll essentially be zero. They’re already almost half in size over the iterations when

you just had two layers. And the problem is a little less strong

when you use rectified linear units. But even there, you’re going to have

some decrease as you continue to train. All right,

any questions around this code snippet and vanishing creating problems? No, sure. [LAUGH] That’s a good question. The question is why

are the gradings flatlining. And it’s essentially

because the dataset is so simple that you actually just

perfectly fitted your training data. And then there’s not much else to do

you’re basically in a local optimum and then not much else is happening. So yeah, so these are the outputs where

if you visualize the decision boundaries, here at the relue and the relue you

see a little bit more sort of edges, because you have sort of linear

parts of your decision boundary and the sigmoid is a little smoother,

little rounder. All right, so now you can implement a very

quick versions to get an intuition for the vanishing gradient problem. Now the exploding gradient problem is,

in theory, just as bad. But in practice,

it turns out we can actually have a hack, that was first introduced by

Thomas Mikolov, and it’s very unmathematical in some ways ’cause say,

all you have is a large gradient of 100. Let’s just cap it to five. That’s it,

you just define the threshold and you say whenever the value is larger

than a certain value, just cut it. Totally not the right

mathematical direction anymore. But turns out to work very

well in practice, yep. So vanishing creating problems,

how would you cap it? It’s like it gets smaller and

smaller, and you just multiply it? But then it’s like, it might overshoot. It might go in the completely

wrong direction. And you don’t want to have the hundredth

word unless it really matters. You can’t just make all

the hundred words or thousand words of the past

all matter the same amount. Right?

Intuitively. That doesn’t make that much sense either. So this gradient clipping solution

is actually really powerful. And then a couple years after it

was introduced, Yoshua Bengio and one of his students Actually gained

a little bit of intuition and it’s something I encourage

you always to do too. Not just in the equations, where you

can write out recurrent neural network, where everything’s one dimensional,

and the math comes out easy and you gain intuition about it. But you can also, and this is what

they did here, implement a very simple recurrent neural network which

just had a single hidden unit. Not very useful for anything in practice

but now, with the single unit W. And you know, at still the bias term, they can actually visualize exactly

what the air surface looks like. So and oftentimes we call the air

surface or the energy landscape or so that the landscape of

our objective function. This error surface and basically. You can see here the size of

the z axis here is the error that you have when you trained

us on a very simple problem. I forgot what the problem here was but it’s something very simple

like keep around this unit and remember the value and then just

return that value 50 times later. Something simple like that. And what they essentially observe

is that in this air surface or air landscape you have

these high curvature walls. And so as you do an update each little line here you can interpret as

what happens at an sg update step. You update your parameters. And you say, in order to minimize

my objective function right now, I’m going to change the value

of my one hidden unit and my bias term just like by this amount

to go over here, go over here. And all of a sudden you hit

these large curvature walls. And then your gradient basically blows up,

and it moves you somewhere way different. And so intuitively what happens here is, if you rescale to the thick size with

the special method, then essentially you’re not going to jump to some crazy,

faraway place, but you’re just going to stay in this general area that seemed

useful before you hit that curvature wall. Yeah? So the question is, intuitively,

why wouldn’t such a trick work for the vanishing grading problem but it does

work for the exploding grading problem. Why does the reason for the vanishing does not apply to

the exploding grading problem. So intuitively,

this is exactly the issue here. So the exploding,

as you move way too far away, you basically jump out of the area

where you, in this case here for instance, we’re getting closer and

closer to a local optimum, but the local optimum was very

close to high curvature wall. And without the gradient problem,

without the clipping trick, you would go way far away. Right, now, on the vanishing grading

problem, it get’s smaller and smaller. So in general clipping doesn’t make sense,

but let’s say, so that’s the obvious answer. You can’t, something gets smaller and

smaller, it doesn’t help to have a maximum and then make it, you know cut it to that

maximum ’cause that’s not the problem. It goes in the opposite direction. And so. That’s kind of most

obvious intuitive answers. Now, you could say. Why couldn’t you, if it gets below

a certain threshold, blow it up? But then that would mean that. Let’s say you had. You wanted to predict the word. And now you’re 50 time steps away. And really,

the 51st doesn’t actually impact the word you’re trying to

predict at time step T, right? So you’re 50 times to 54 and

it doesn’t really modify that word. And now you’re artificially going to

blow up and make it more important. So that’s less intuitive than saying, I don’t wanna jump into some completely

different part of my error surface. The wall just comes from this is what

the error surface looks like for a very very simple recurrent node network

with a very simple kind of problem that it tries to solve. And you can actually use most

of the networks that you have, you can try to make them

have just two parameters and then you can visualize

something like this too. In fact it’s very intuitive

sometimes to do that. When you try different optimizers,

we’ll get to those in a later lecture like Adam or SGD or achieve momentum,

we’ll talk about those soon. You can actually always try to visualise

that in some simple kind of landscape. This just happens to be the landscape that

this particular recurrent neural network problem has with one-hidden unit and

just a bias term. So the question is, how could we know for sure that this happens with non-linear

actions and multiple weight. So you also have some

non-linearity here in this. So that intuitively wouldn’t prevent

us from transferring that knowledge. Now, in general, it’s very hard. We can’t really visualize

a very high dimensional spaces. There is actually now an interesting

new idea that was introduced, I think by Ian Goodfellow

where you can actually try to, let’s say you have your parameter space,

inside your parameter space, you have some kind of cross function. So you say my w matrices are at this value

and so on, and I have some error when all my values are here, and then I start

to optimize and I end up somewhere here. Now the problem is, we can’t

visualize it because it’s usually in realistic settings,

you have the 100 million. Workflow. At least a million or so

parameters, sometimes 100 million. And so, something crazy might be going

on as you optimize between this. And so, because we can’t visualize it and we can’t even sub-sample it because

it’s such a high-dimensional space. What they do is they actually

draw a line between the point from where they started with their random

initialization before optimization. And end the line all the way to the point where you actually

finished the optimization. And then you can evaluate along

this line at a certain intervals, you can evaluate how big your area is. And if that area changes between

two such intervals a lot, then that means we have very

high curvature in that area. So that’s one trick of how

you might use this idea and gain some intuition of

the curvature of the space. But yeah, only in two dimensions can we

get such nice intuitive visualizations. Yeah. So the question is why don’t

we just have less dependence? And the question of course,

it’s a legit question, but ideally we’ll let

the model figure this out. Ideally we’re better at

optimizing the model, and the model has in theory these

long range dependencies. In practice, they rarely ever do. In fact when you implement these, and

you can start playing around with this and this is something I

encourage you all to do too. As you implement your models you can try

to make it a little bit more interactive. Have some IPython Notebook,

give it a sequence and look at the probability of the next word. And then give it a different sequence

where you change words like ten time steps away, and

look again at the probabilities. And what you’ll often observe is that

after seven words or so, the words before actually don’t matter, especially not for

these simple recurrent neural networks. But because this is a big problem, there are actually a lot of

different kinds of solutions. And so the biggest and

best one is one we’ll introduce next week. But a simpler one is to use

rectified linear units and to also initialize both of your w’s

to ones from hidden to hidden and the ones from the input to the hidden

state with the identity matrix. And this is a trick that I

introduced a couple years ago and then it was sort of combined

with rectified linear units. And applied to recurrent

neural networks by Quoc Le. And so the main idea here is if

you move around in your space. Let’s say you have your h, and usually we have here our whh times h,

plus whx plus x. And let’s assume for now that h and

x have the same dimensionality. So then all these

are essentially square matrices. And we have here our different vectors. Now, in the standard initialization,

what you would do is you’d just have a bunch of small random values

and all the different elements of w. And what that means is

as you start optimizing, whatever x is you have some random

projection into the hidden space. Instead, the idea here is we actually

have identity initialization. Maybe you can scale it, so instead

you have a half times the identity, and what does that do? Intuitively when you combine

the hidden state and the word vector? That’s exactly right. If this is an identity initialized matrix. So it’s just, 1, 1, 1,

1, 1, 1 on the diagonal. And you multiply all of these by one half. Same as just having a half,

a half, a half, and so on. And you multiply this with this vector and

you do the same thing here. What essentially that means is that

you have a half, times that vector, plus half times that other vector. And intuitively that means in

the beginning, if you don’t know anything. Let’s not do a crazy random projection

into the middle of nowhere in our parameter space, but just average. And say, well as I move through the space

my hidden state is just a moving average of the word vectors. And then I start making some updates. And it turns out when you look here and you apply this to the very

tight problem of MNIST. Which we don’t really have to go into,

but its a bunch of small digits. And they’re trying to basically predict what digit it is by going over

all the pixels in a sequence. Instead of using other kinds of neural networks like

convolutional neural networks. And basically we look

at the test accuracy. These are very long time sequences. And the test accuracy for

these is much, much higher. When you use this identity initialization

instead of random initialization, and also using rectified linear units. Now more importantly for

real language modeling, we can compare recurrent neural

networks in this simple form. So we had the question before like,

do these actually matter or did I just kind of describe single

layer recurrent neural networks for the class to describe the concept. And here we actually have these

simple recurrent neural networks, and we basically compare. This one is called Kneser-Ney with 5

grams, so a lot of counts, and some clever back off and smoothing techniques which

we won’t need to get into for the class. And we compare these on

two different corpora and we basically look at the perplexity. So these are all perplexity numbers,

and we look at the neural network or the neural network that’s

combined with Kneser-Ney, assuming probability estimates. And of course when you combine the two

then you don’t really get the advantage of having less RAM. So ideally this by itself would do best,

but in general combining the two

used to still work better. These are results from five years ago,

and they failed most very quickly. I think the best results now are pure

neural network language models. But basically we can see

that compared to Kneser-Ney, even back then, the neural

network actually works very well. And has much lower perplexity than just

the Kneser-Ney or just account based. Now one problem that you’ll

observe in a lot of cases, is that the softmax is really,

really large. So your word vectors are one

set of parameters, but your softmax is another set of parameters. And if your hidden state is 1000, and let’s say you have

100,000 different words. Then that’s 100,000 times 1000 dimensional

matrix that you’d have to multiply with the hidden state at

every single time step. So that’s not very efficient, and so one way to improve this is with

a class-based word prediction. Where we first try to predict some

class that we can come up, and there are different kinds

of things we can do. In many cases you can sort,

just the words by how frequent they are. And say the thousand most frequent

words are in the first class, the next thousand most frequent

words in the second class and so on. And so you first basically classify, try

to predict the class based on the history. And then you predict the word inside

that class, based on that class. And so this one is only

a thousand dimensional, and so you can basically do this. And now the more classes

the better the perplexity, but also the slower the speed

the less you gain from this. And especially at training time

which is what we see here, this makes a huge difference. So if you have just very few classes,

you can actually reduce the number here of seconds

that each eproc takes. By almost 10x compared to

having more classes or even more than 10x if you

have the full softmax. And even the test time, is faster cuz now

you only essentially evaluate the word probabilities for the classes that

have a very high probability here. All right, one last trick and

this is maybe obvious to some but it wasn’t obvious to others even in

the past when people published on this. But you essentially only need

to do a single backward’s pass through the sequence. Once you accumulate all the deltas

from each error at each time set. So looking at this figure,

really quick again. Here, essentially you have

one forward pass where you compute all the hidden states and

all your errors, and then you only have a single

backwards pass, and as you go backwards in time you keep accumulating

all the deltas of each time step. And so originally people said, for this

time step I’m gonna go all the way back, and then I go to the next time step,

and then I go all the way back, and then the next step, and all the way back,

which is really inefficient. And is essentially same as combining all the deltas in one clean

back propagation step. And again, it’s kind of is intuitive. An intuitive sort of

implementation trick but people gave that the term back

propagation through time. All right, now that we have these

simple recurrent neural networks, we can use them for

a lot of fun applications. In fact, the name entity recognition

that we’re gonna use in example with the Window. In the Window model, you could only

condition the probability of this being a location, a person, or an organization

based on the words in that Window. The recurrent neural network

you can in theory take and condition these probabilities

on a lot larger context sizes. And so

you can do Named Entity Recognition (NER), you can do entity level sentiment in

context, so for instance you can say. I liked the acting, but

the plot was a little thin. And you can say I want to now for

acting say positive, and predict the positive class for that word. Predict the null class, and

all sentiment for all the other words, and then plot should get

negative class label. Or you can classify opinionated

expressions, and this is what researchers at Cornell where they

essentially used RNNs for opinion mining and essentially wanted

to classify whether each word in a relatively smaller purpose here is

either the direct subjective expression or the expressive subjective expression,

so either direct or expressive. So basically this is direct

subjective expressions, explicitly mention some private state or

speech event, whereas the ESEs just indicate the sentiment or emotion without

explicitly stating or conveying them. So here’s an example, like the committee as usual has

refused to make any statements. And so you want to classify

as usual as an ESE, and basically give each of these

words here a certain label. And this is something you’ll actually

observe a lot in sequence tagging paths. Again, all the same models

the recurrent neural network. You have the soft max at every time step. But now the soft max actually

has a set of classes that indicate whether a certain

expression begins or ends. And so here you would basically

have this BIO notation scheme where you have the beginning or

the end, or a null token. It’s not any of the expressions

that I care about. So here you would say for instance,

as usual is an overall ESE expression, so it begins here, and

it’s in the middle right here. And then these are neither ESEs or DSEs. All right, now they started with

the standard recurrent neural network, and I want you to at some point be able

to glance over these equations, and just say I’ve seen this before. It doesn’t have to be W superscript HH,

and so on. But whenever you see, the summation

order of course, doesn’t matter either. But here, they use W, V, and

U, but then they defined, instead of writing out softmax,

they write g here. But once you look at these equations,

I hope that eventually you’re just like it’s just a recurrent neural network,

right? You have here,

are your hidden to hidden matrix. You have your input to hidden matrix, and

here you have your softmax waits you. So same idea, but these are the actual

equations from this real paper that you can now kind of read and immediately sort

of have the intuition of what happens. All right, you need directional

recurrent neural network where we, if we try to make the prediction here,

of whether this is an ESE or whatever name entity recognition,

any kind of sequence labelling task, what’s the problem with

this kind of model? What do you think as we go

from left to right only? What do you think could be a problem for

making the most accurate predictions? That’s right. Words that come after

the current word can’t be helping us to make accurate

predictions at that time step, right? Cuz we only went from left to right. And so one of the most common

extensions of recurrent neural networks is actually to do bidirectional

recurrent neural networks where instead of just going from left to

right, we also go from right to left. And it’s essentially the exact same model. In fact, you could implement it by

changing your input and just reversing all the words of your input, and

then it’s exactly the same thing. And now, here’s the reason why they

don’t have superscripts with WHH, cuz now they have these

arrows that indicate whether you’re going from left to right,

or from right to left. And now, they basically have

this concatenation here, and in order to make a prediction at a certain

time step t they essentially concatenate the hidden states from both the left

direction and the right direction. And those are now the feature vectors. And this vector ht coming from the left,

has all the context ordinal, again seven plus words,

depending on how well you train your RNN. From all the words on the left, ht from the right has all the contacts

from the words on the right, and that is now your feature vector to make an

accurate prediction at a certain time set. Any questions around bidirectional

recurrent neural networks? You’ll see these a lot in all

the recent papers you’ll be learning, in various modifications. Yeah. Have people tried

Convolutional Neural Networks? They have, and we have a special lecture

also we will talk a little bit about Convolutional Neural Networks. So you don’t necessarily have a cycle,

right? You just go, basically as you implement

this, you go once all the way for your the left, and you don’t have any interactions with

the step that goes from the right. You can compute your

feet forward HTs here for that direction,

are only coming from the left. And the HT from the other direction,

you can compete, in fact you could paralyze this if

you want to be super efficient and. Have one core,

implement the left direction, and one core implement the right direction. So in that sense it doesn’t make

the vanishing create any problem worse. But, of course,

just like any recurring neural network, it does have the vanishing

creating problem, and the exploding creating problems and it has

to be clever about flipping it and so, yeah We call them standard feedforward neural networks or

Window based feedforward neural networks. And now we have recurrent neural networks. And this is really one of

the most powerful family and we’ll see lots of extensions. In fact, if there’s no other

question we can go even deeper. It is after all deep learning. And so, now you’ll observe [LAUGH] we

definitely had to skip that superscript. And we have different, Characters here for each of our matrices, because,

instead of just going from left to right, you can also have a deep neural

network at each time step. And so now, to compute the ith

layer at a given time step, you essentially again, have only the things

coming from the left that modify it but, you just don’t take in the vector from the

left, you also take the vector from below. So, in the simplest definition that is

just your x, your input vector right? But as you go deeper you now also have

the previous hidden layers input. Instead of why are the, So the question is, why do we feed the hidden layer into

another hidden layer instead of the y? In fact, you can actually have so

called short circuit connections, too, where each of these h’s can

go directly to the y as well. And so here in this figure you see

that only the top ones go into the y. But you can actually have short circuit

connections where y here has as input not just ht from the top layer,

noted here as capital L, but the concatenation of all the h’s. It’s just another way to make

this monster even more monstrous. And in fact there a lot of modifications,

in fact, Shayne has a paper, an ArXiv right now on a search based

odyssey type thing where you have so many different kinds of knobs that you can

tune for even more sophisticated recurrent neural networks of the type that we’ll

introduce next week that, it gets a little unwieldy and it turns out a lot of

the things don’t matter that much, but each can kind of give you a little

bit of a boost in many cases. So if you have three layers,

you have four layers, what’s the dimensionality

of all the layers and the various different kinds of connections

and short circuit connections. We’ll introduce some of these, but

in general this like a pretty decent model and will eventually extract away from

how we compute that hidden state, and that will be a more complex kind of cell

type that we’ll introduce next Tuesday. Do we have one more question? So now how do we evaluate this? It’s very important to evaluate

your problems correctly, and we actually talked about this before. When you have a very imbalanced data set,

where some of the classes appear very frequently and others are not very

frequent, you don’t wanna use accuracy. In fact, in these kinds of sentences,

you often observe, this is an extreme one where you have a lot of ESEs and

DSEs but in many cases, just content. Standard sort of non-sentiment context and words, and so a lot of these

are actually O, have no label. And so it’s very important to use F1 and we basically had this question also after

class, but it’s important for all of you to know because the F1 metric is really

one of the most commonly used metrics. And it’s essentially just the harmonic

mean of precision and recall. Precision is just the true

positives divided by true positives plus false positives and recall is just true positives divided

by true positives plus false negatives. And then you have here the harmonic

mean of these two numbers. So intuitively, you can be very

accurate by always saying something or have a very high recall for

a certain class but if you always miss another class

That would hurt you a lot. And now here’s an evaluation

that you should also be familiar with where basically this is something

I would like to see in a lot of your project reports too as you analyze the

various hyper parameters that you have. And so one thing they found here is they

have two different data set sizes that they train on,

in many cases if you train with more data, you basically do better but then also it’s

not always the case that more layers. So this is the depth that we had here, the

number l for all these different layers. It’s not always the case

that more layers are better. In fact here, the highest performance

they get is with three layers, instead of four or five. All right, so let’s recap. Recurring neural networks, best deep learning model family that

you’ll learn about in this class. Training them can be very hard. Fortunately, you understand

back propagation now. You can gain an intuition of

why that might be the case. We’ll in the next lecture extend

them some much more powerful models the Gated Recurring Units or LSTMs,

and those are the models you’ll see all over the place in all the state

of the art models these days. All right. Thank you.