Modernization Hub

Modernization and Improvement
Lecture 6: Dependency Parsing

Lecture 6: Dependency Parsing


[SOUND] Stanford University.>>We’ll get back started
again with CS224N, Natural Language Processing
with Deep Learning. So, you’re in for a respite,
or a change of pace today. So for today’s lecture, what we’re
principally going to look at is syntax, grammar and dependency parsing. So my hope today is to teach
you in one lecture enough about dependency grammars and
parsing that you’ll all be able to do the main part of
Assignment 2 successfully. So quite a bit of the early part of
the lecture is giving a bit of background about syntax and dependency grammar. And then it’s time to talk about
a particular kind of dependency grammar, transition-based, also dependency parsing,
transition-based dependency parsing. And then it’s probably only in
the last kind of 15 minutes or so of the lecture that we’ll then get back
into specifically neural network content. Talking about a dependency parser that
Danqi and I wrote a couple of years ago. Okay, so for general reminders, I hope you’re all really aware
that Assignment 1 is due today. And I guess by this stage you’ve either
made good progress or you haven’t. But to give my,
Good housekeeping reminders, I mean it seems like every year there
are people that sort of blow lots of late days on the first assignment for
no really good reason. And that isn’t such
a clever strategy [LAUGH]. So hopefully [LAUGH] you are well
along with the assignment, and can aim to hand it in before
it gets to the weekend. Okay, then secondly today is also the day
that the new assignment comes out. So maybe you won’t look at it
till the start of next week but we’ve got it up ready to go. And so that’ll involve a couple of new
things and in some respects probably for much of it you might not want to start
it until after next Tuesday’s lecture. So two big things will be different for
that assignment. Big thing number one is we’re gonna do
assignment number two using TensorFlow. And that’s the reason why, quite apart
from exhaustion from assignment one, why you probably you don’t wanna start
it on the weekend is because on Tuesday, Tuesday’s lecture’s gonna be
an introduction to TensorFlow. So you’ll really be more qualified
then to start it after that. And then the other big different thing
in assignment two is we get into some sort of more substantive
natural language processing content. In particular, you guys are going to build
neural dependency parsers, and the hope is that you can learn about everything
that you need to know to do that today. Or perhaps looking at some of
the readings on the website, if you don’t get quite
everything straight from me. Couple more comments on things. Okay, so for final projects. We’re going to sort of post,
hopefully tomorrow or on the weekend, a kind of an outline of what’s in
assignment four, so you can have sort of a more informed meaningful choice between
whether you want to do assignment four, or come up with a final project. The area of assignment four, if you do it, is going to be question answering
over the SQuAD dataset. But we’ve got kind of a page and a half
description to explain what that means, so you can look out for that. But if you are interested in
doing a final project, again, we’ll encourage people to come and meet
with one of the final project mentors or find some other well qualified person
around here to be a final project mentor. So what we’re wanting is that sort of,
everybody has met with their final project mentor
before putting in an abstract. And that means it’d be really great for people to get started doing
that as soon as possible. I know some of you have already
talked to various of us. For me personally, I’ve got final
project office hours tomorrow from 1 to 3 pm so
I hope some people will come by for those. And again, sort of as Richard mentioned, not everybody can possible have Richard or
me as the final project mentor. And besides, there’s some really big
advantages of having some of the PhD student TAs as final project mentors. Cuz really, for things like spending
time hacking on TensorFlow, they get to do it much more than I do. And so, Danqi, Kevin, Ignacio, Arun that they’ve had tons of experience
doing NLP research using deep learning. And so that they’d also be great mentors,
and look them up for their final project advice. The final thing I just want to touch
on is we clearly had a lot of problems, I realize, at keeping up and
coping with people in office hours, and queue status has just
regularly got out of control. I’m sorry that that’s
been kind of difficult. I mean honestly we are trying to work and
work out ways that we can do this better, and we’re thinking of sort of unveiling
a few changes for doing things for the second assignment. If any of you peoples have any better
advice as to how things could be organized so that they could work
better feel free to send a message on Piazza with suggestions
of ways of doing it. I guess yesterday I ran down
Percy Liang and said, Percy, Percy, how do you do it for CS221? Do you have some big
secrets to do this better? But unfortunately I seem to come away
with no big secrets cuz he sort of said: “we use queue status and we use the Huang
basement”, what else are you meant to do? So I’m still looking for
that divine insight [LAUGH] that will tell me how to get this
problem better under control. So if you’ve got any good ideas,
feel free to share. But we’ll try to get
this as much better under control as we can for the following weeks. Okay, any questions, or
should I just go into the meat of things? Okay. All right, so what we’re going
to want to do today is work out how to put structures over
sentences in some human language. All the examples I’m going to show is for
English, but in principle, the same techniques you can apply for
any language, where these structures are going to sort of reveal
how the sentence is made up. So that the idea is that sentences and
parts of sentences have some kind of structure and there are sort of regular
ways that people put sentences together. So, we can sort of start off with very
simple things that aren’t yet sentences like “the cat” and “a dog”, and they
seem to kind of have a bit of structure. We have an article, or
what linguists often call a determiner, that’s followed by a noun. And then, well, for those kind of phrases, which get called noun
phrases that describe things, you can kind of make them bigger and there
are sort of rules for how you can do that. So you can put adjectives in
between the article and the noun. You can say the large dog or a barking dog
or a cuddly dog, and things like that. And, well, you can put things like what I
call prepositional phrases after the noun so you can get things like “a large dog
in a crate” or something like that. And so, traditionally what linguists and
natural language processors have wanted to do is describe
the structure of human languages. And they’re effectively two key tools
that people have used to do this and one of these key tools and I think in general the only one
you have seen a fraction of is to use what in computer science terms what
is most commonly referred to as context free grammars which are often referred to
by linguists as phrase structure grammars. And is then referred to as
the notion of constituency and so for that what we are doing is writing
these context free grammar rules and the least if you are Standford
undergrad or something like that. I know that way back in 103, you spent a whole lecture learning about
context-free grammars, and their rules. So I could start writing some rules that
might start off saying a noun phrase, and go to a determiner or a noun. Then I realized that noun phrases
would get a bit more complicated. And so I came up with this new rule
that says- Noun phrase goes to terminal optional adject of noun and then
optional prepositional phrase wherefore prepositional phrase that’s a preposition
followed by another noun phrase. Because, I can say a crate,
or, a large crate. Or, a large crate by the door. And then, well I can go along
even further, and I could say, you know a large barking
dog by the door in a crate. So then I noticed, wow I can put
in multiple adjectives there and I can stick on multiple prepositional
phrases, so I’m using that star, the kinda clingy star that you also see, See in regular expressions to say that
you can have zero or any number of these. And then I can start making a bigger
thing like, talk to the cuddly dog. Or, look for the cuddly dog. And, well, now I’ve got a verb
followed by a prepositional phrase. And so, I can sort of build
up a constituency grammar. So that’s one way of organizing
the structure of sentences and, you know,
in 20th dragging into 21st century America, this has been
the dominant way of doing it. I mean it’s what you see mainly in your
Intro CS class when you get taught about regular languages and context free
languages and context sensitive languages. You’re working up the Chomsky
hierarchy where Noam Chomsky did not actually invent
the Chomsky hierarchy to torture CS under grads With formal content
to fill the SCS 103 class. The original purpose of the Chomsky
hierarchy was actually to understand the complexity of human languages and
to make arguments about their complexity. If you look more broadly, and
sorry, it’s also dominated, sorta linguistics in America in the last
50 years through the work of Noam Chomsky. But if you look more broadly than that,
this isn’t actually the dominate form of syntactic description
that is being used for understanding of
the structure of sentences. So what else can you do? So there is this other alternative view of
linguistic structure which is referred to as Dependency structure and
what your doing with dependency structure. Is that you’re describing the structure
of a sentence by taking each word and saying what it’s a dependent on. So, if it’s a word that
kind of modifies or is an argument of another word that you’re
saying, it’s a dependent of that word. So, barking dog, barking is a dependent
of dog, because it’s of a modifier of it. Large barking dog, large is a modifier of
dog as well, so it’s a dependent of it. And dog by the door, so the by the door
is somehow a dependent of dog. And we’re putting
a dependency between words, and we normally indicate those
dependencies with arrows. And so we can draw dependency
structures over sentences that say how they’re represented as well. And when right in the first class,
I gave examples of ambiguous sentences. A lot of those ambiguous sentences, we
can think about in terms of dependencies. So do you remember this one,
scientists study whales from space. Well that was an ambiguous headline. And well, why is it an ambiguous headline? Well it’s ambiguous because
there’s sort of two possibilities. So in either case there’s
the main verb study. And it’s the scientist that’s studying,
that’s an argument of study, the subject. And it’s the whales that are being
studied, so that’s an argument of study. That’s the object. But the big difference is then,
what are you doing with the from space. You saying that it’s modifying study,
or are you saying it’s modifying whales? And like, if you sort of just
quickly read the headline It sounds like it’s the bottom one, right? It’s whales from space. And that sounds really exciting. But [LAUGH] what the article was meant to
be about was, really, that they were being able to use satellites to
track the movements of whales. And so it’s the first one where the,
from space, is modifying. How they’re being studied. And so thinking about ambiguities of
sentences can then be thought about, many of them, in terms of these dependency
structures as to what’s modifying what. And this is just a really common thing in natural language because these kind
of questions of what modifies what, really dominate a lot of
questions of interpretation. So, here’s the kind of sentence you find when you’re reading
the Wall Street Journal every morning. The board approved its acquisition by
Royal Trustco Limited of Toronto for $27 a share at its Monthly meeting. And as I’ve hopefully indicated by
the square brackets, if you look at the structure of this sentence, it sort
of starts off as subject, verb, object. The board approved its acquisition, and then everything after that is a whole
sequence of prepositional phrases. By Royal Trustco Ltd, of Toronto, for
$27 a share, at its monthly meeting. And well, so then there’s the question of,
what’s everyone modifying? So the acquisition is by
Royal Trustco Ltd, so that’s, by Royal Trustco Ltd is modifying
the thing that immediately precedes that. And of Toronto is modifying the company,
Royal Trustco Limited, so that’s modifying the thing that
comes immediately preceeding it. So you might think this is easy, everything just modifies the thing
that’s coming immediately before it. But that, then stops being true. So, what’s for $27 a share modifying? Yeah so
that’s modifying the acquisition so then we’re jumping back
a few candidates and saying is modifying acquisition and
then actually at it’s monthly meeting. That wasn’t the Toronto the Royal
Trustco Ltd or the acquisition that that was when the approval was happening so
that jumps all the way back up to the top. So in general the situation is that if
you’ve got some stuff like a verb and a noun phrase, then you start
getting these prepositional phrases. Well, the prepositional
phrase can be modifying, either this noun phrase or the verb. But then when you get to
the second prepositional phrase. Well, there was another noun phrase
inside this prepositional phrase. So, now there’s. Three choices. It can be modifying this noun phrase,
that noun phrase or the verb phrase. And then we get to another one. So it’s now got four choices. And you don’t get sort of
a completely free choice, cuz you do get a nesting constraint. So once I’ve had for $27 a share
referring back to the acquisition, the next prepositional phrase has to,
in general, refer to either the acquisition or
approved. I say in general because
there are exceptions to that. And I’ll actually talk about that later. But most of the time in English,
it’s true. You have to sort of refer to
the same one or further back, so you get a nesting relationship. But I mean, even if you obey that nesting
relationship, the result is that you get an exponential number of
ambiguities in a sentence based on in the number of prepositional phrases
you stick on the end of the sentence. And so the series of the exponential
series you get of these Catalan numbers. And so Catalan numbers actually show up in a lot of places in
theoretical computer science. Because any kind of structure
that is somehow sort of similar, if you’re putting these constraints in,
you get Catalan series. So, are any of you doing CS228? Yeah, so another place the Catalan series turns up
is that when you’ve got a vector graph and you’re triangulating it, the number of
ways that you can triangulate your vector graph is also giving you Catalan numbers. Okay, so
human languages get very ambiguous. And we can hope to describe
them on the basis of sort of looking at these dependencies. So that’s important concept One. The other important concept I wanted to
introduce at this point is this idea of full linguistics having annotated
data in the form of treebanks. This is probably a little
bit small to see exactly. But what this is, is we’ve got sentences. These are actually sentences
that come off Yahoo Answers. And what’s happened is,
human beings have sat around and drawn in the syntactic structures of
these sentences as dependency graphs and those things we refer to as treebanks. And so a really interesting thing
that’s happened starting around 1990 is that people have devoted a lot of resources to building up these kind
of annotated treebanks and various other kinds of annotated linguistic
resources that we’ll talk about later. Now in some sense, from the viewpoint of
sort of modern machine learning in 2017, that’s completely unsurprising, because all the time what we do
is say we want labelled data so we can take our supervised classifier and
chug on it and get good results. But in many ways, it was kind of
a surprising thing that happened, which is sort of different to the whole
of the rest of history, right? Cuz for the whole of the rest of
the history, it was back in this space of, well, to describe linguistic
structure what we should be doing is writing grammar rules that describe
what happens in linguistic structure. Where here, we’re no longer even
attempting to write grammar rules. We’re just saying, give us some sentences. And I’m gonna diagram these sentences and
show you what their structure is. And tomorrow give me a bunch more and
I’ll diagram them for you as well. And if you think about it, in a way,
that initially seems kind of a crazy thing to do, cuz it seems
like just putting structures over sentences one by one seems really,
really inefficient and slow. Whereas, if you’re writing a grammar, you’re writing this thing
that generalizes, right? The whole point of grammar is that
you’re gonna write this one small, finite grammar. And it describes an infinite
number of sentences. And so surely,
that’s a big labor saving effort. But, slightly surprisingly, but maybe it
makes sense in terms of what’s happened in machine learning, that it’s just turned
out to be kind of super successful, this building of explicit,
annotated treebanks. And it ends up giving us a lot of things. And I sort of mention a few
of their advantages here. First, it gives you
a reusability of labor. But the problem of human beings
handwriting grammars is that they tend to, in practice, be almost unreusable,
because everybody does it differently and has their idea of the grammar. And people spend years working on one and
no one else ever uses it. Where effectively, these treebanks have
been a really reusable tool that lots of people have then built on top of to
build all kinds of natural language processing tools, of part of speech
taggers and parsers and things like that. They’ve also turned out to be a really
useful resource, actually, for linguists, because they give a kind of real languages
are spoken, complete with syntactic analyses that you can do all kinds of
quantitative linguistics on top of. It’s genuine data that’s broad
coverage when people just work with their intuitions as to what
are the grammar rules of English. They think of some things but
not of other things. And so this is actually a better way to
find out all of the things that actually happened. For anything that’s sort
of probabilistic or machine learning, it gives some sort
of not only what’s possible, but how frequent it is and what other
things it tends to co-occur with and all that kind of distributional
information that’s super important. And crucially, crucially, crucially,
and we’ll use this for assignment two, it’s also great because it gives you
a way to evaluate any system that you built because this gives us what we treat
as ground truth, gold standard data. These are the correct answers. And then we can evaluate any tool on
how good it is at reproducing those. Okay, so that’s the general advertisement. And what I wanted to do now is sort of
go through a bit more carefully for sort of 15 minutes, what are dependency
grammars and dependency structure? So we’ve sort of got that straight. I guess I’ve maybe failed to say, yeah. I mentioned there was this sort of
constituency context-free grammar viewpoint and
the dependency grammar viewpoint. Today, it’s gonna be all dependencies. And what we’re doing for
assignment two is all dependencies. We will get back to some notions of
constituency and phrase structure. You’ll see those coming back in
later classes in a few weeks’ time. But this is what we’re
going to be doing today. And that’s not a completely random choice. It’s turned out that, unlike what’s
happened in linguistics in most of the last 50 years, in the last decade
in natural language processing, it’s essentially been swept by
the use of dependency grammars, that people have found dependency
grammars just a really suitable framework on which to build semantic
representations to get out the kind of understanding of language that
they’d like to get out easily. They enable the building of very fast, efficient parsers,
as I’ll explain later today. And so in the last sort of ten years, you’ve just sort of seen this huge sea
change in natural language processing. Whereas, if you pick up a conference
volume around the 1990s, it was basically all phrase structure grammars and one or
two papers on dependency grammars. And if you pick up a volume now, what you’ll find out is that of the papers
they’re using syntactic representations, kind of 80% of them are using
dependency representations. Okay, yes.>>What’s that,
a phrase structure grammar? Phrase structure, what’s the phrase
structure grammar, that’s exactly the same as the context-free grammar
when a linguist is speaking. [LAUGH] Yes,
formerly a context-free grammar. Okay, so
what does a dependency syntax say? So the idea of dependency syntax
is to say that the sort of model of syntax is we have relationships
between lexical items, words, and only between lexical items. They’re binary, asymmetric relations,
which means we draw arrows. And we call those arrows dependencies. So the whole, there is a dependency
analysis of bills on ports and immigration were submitted by
Senator Brownback, Republican of Kansas. Okay, so that’s a start,
normally hen we do dependency parsing, we do a little bit more than that. So typically we type the dependencies
by giving them a name for some grammatical relationship. So I’m calling this the subject, and
it’s actually a passive subject. And then this is an auxiliary modifier,
and Republican of Kansas is an appositional
phrase that’s coming off of Brownback. And so we use this kind of
typed dependency grammars. And interestingly,
I’m not going to go through it, but there’s sort of some interesting
math that if you just have this, although it’s notationally very different, from context-free grammar,
these are actually equivalent to a restricted kind of context-free
grammar with one addition. But things become sort of a bit more
different once you put in a typing of the dependency labels, where I wont
go into that in great detail, right. So a substantive theory
of dependency grammar for a language,
we’re then having to make some decisions. So what we’re gonna do is when we,
we’re gonna draw these arrows between two things, and
I’ll just mention a bit more terminology. So we have an arrow and its got what we
called the tail end of the arrow, I guess. And the word up here is sort of the head. So bills is an argument of submitted, were
is an auxiliary modifier of submitted. And so this word here is normally referred
to as the head, or the governor, or the superior, or
sometimes even the regent. I’ll normally call it the head. And then the word at
the other end of the arrow, the pointy bit,
I’ll refer to as the dependent, but other words that you can sometimes
see are modifier, inferior, subordinate. Some people who do dependency grammar
really get into these classist notions of superiors and inferiors, but
I’ll go with heads and dependents. Okay, so the idea is you
have a head of a clause and then the arguments of the dependence. And then when you have a phrase like, by Senator Brownback, Republican of Texas. It’s got a head which is here
being taken as Brownback and then it’s got words beneath it. And so one of the main parts of
dependency grammars at the end of the day is you have to make decisions
as to which words are heads and which words are then the dependents of
the heads of any particular structure. So in these diagrams I’m showing you here,
and the ones I showed you back a few pages,
what I’m actually showing you here is analysis according
to universal dependencies. So universal dependencies is
a new tree banking effort which I’ve actually been
very strongly involved in. That sort of started
a couple of years ago and there are pointers in both
earlier in the slides and on the website if you wanna go off and
learn a lot about universal dependencies. I mean it’s sort of
an ambitious attempt to try and have a common dependency representation
that works over a ton of languages. I could prattle on about it for ages, and if by some off chance there’s
time at the end of the class I could. But probably there won’t be so I won’t
actually tell you a lot about that now. But I will just mention one thing that
probably you’ll notice very quickly. And we’re also going to be using this
representation in the assignment that’s being given out today,
the analysis of universal dependencies treats prepositions sort of differently
to what you might have seen else where. If you’ve seen any, many accounts of
English grammar, or heard references in some English classroom,
to have prepositions, having objects. In universal dependencies,
prepositions don’t have any dependents. Prepositions are treated kind
of like they were case markers, if you know any language like, German, or Latin, or Hindi, or
something that has cases. So that the by is sort of treated as
if it were a case marker of Brownback. So this sort of a bleak modifier
of by Senator Brownback. And so it’s actually treating
Brownback here as the head with the preposition as sort of like
a case marking dependent of by. And that was sort of done to get more
parallelism across different languages of the world. But I’ll just mention that. Other properties of old dependencies,
normally dependencies form a tree. So there are formal properties
that goes along with that. That means that they’ve got a single-head,
they’re acyclic, and they’re connected. So there is a sort of graph
theoretic properties. Yeah, I sort of mentioned that really dependencies have dominated
most of the world. So just very quickly on that. The famous first linguist was Panini, who wrote his Grammar of Sanskrit
around the fifth century BCE. Really most of the work that Panini
did was kind of on sound systems and make ups of words,
phonology, and morphology, when we mentioned linguistic
levels in the first class. And he only did a little bit of
work on the structure of sentences. But the notation that he used for structure of sentences was essentially
a dependency grammar of having word relationships being
marked as dependencies. Question? Yeah, so the question is,
well compare CFGs and PCFGs and do they, dependency grammars
look strongly lexicalized, they’re between words and
does that makes it harder to generalize. I honestly feel I just
can’t do justice to that question right now if I’m gonna get
through the rest of the lecture. But I will make two comments, so I mean, there’s certainly the natural way
to think of dependency grammars, they’re strongly lexicalized, you’re
drawing relationships between words. Whereas the simplest way of thinking of
context-free grammars is you’ve got these rules in terms of categories like. Noun phrase goes to determiner noun,
optional prepositional phrase. And so, that is a big difference. But it kind of goes both ways. So, normally, when actually, natural
language processing people wanna work with context-free grammars,
they frequently lexicalize them so they can do more precise probabilistic
prediction, and vice versa. If you want to do generalization and
dependency grammar, you can still use at least
notions of parts of speech to give you a level of generalization
as more like categories. But nevertheless, the kind of natural
ways of sort of turning them into probabilities, and machine learning
models are quite different. Though, on the other hand,
there’s sort of some results, or sort of relationships between them. But I would think I’d better
not go on a huge digression. But you have another question? That means to rather than just have
categories like noun phrase to have categories like a noun phrase headed
by dog, and so it’s lexicalized. Let’s leave this for
the moment though, please, okay. [LAUGH]
Okay, so that’s Panini, and
there’s a whole big history, right? So, essentially for
Latin grammarians, what they did for the syntax of Latin,
again, not very developed. They mainly did morphology, but it was essentially a dependency
kind of analysis that was given. There was sort of a flowering of Arabic
grammarians in the first millennium, and they essentially had a dependency grammar. I mean, by contrast, I mean, really kind
of context free grammars and constituency grammar only got invented almost in
the second half of the 20th century. I mean, it wasn’t actually Chomsky
that originally invented them, there was a little bit of earlier work in
Britain, but only kind of a decade before. So, there was this French
linguist Lucien Tesniere, he is often referred to as the father
of modern dependency grammar, he’s got a book from 1959. Dependency grammars have been very popular
and more sorta free word order languages, cuz notions, sort of like context-free
grammars work really well for languages like English that
have very fixed word order, but a lot of other languages of the world
have much freer word order. And that’s often more naturally
described with dependency grammars. Interestingly, one of the very first
natural language parsers developed in the US was also a dependency parser. So, David Hays was one of the first
US computational linguists. And one of the founders of the Association
for Computational Linguistics which is our main kind of academic association where
we publish our conference papers, etc. And he actually built in 1962,
a dependency parser for English. Okay, so
a lot of history of dependency grammar. So, couple of other fine points
to note about the notation. People aren’t always consistent in
which way they draw the arrows. I’m always gonna draw the arrows, so
they point, go from a head to a dependent, which is the direction
which Tesniere drew them. But there are some other people who
draw the arrows the other way around. So, they point from
the dependent to the head. And so, you just need to look and
see what people are doing. The other thing that’s very commonly done,
and we will do in our parses, is you stick this pseudo-word,
which might be called ROOT or WALL, or some other name like that,
at the start of the sentence. And that kind of makes the math and
formalism easy, because, then, every sentence starts with
root and something is a dependent of root. Or, turned around the other way, if you
think of what parsing a dependency grammar means is for every word in
the sentence you’re going to say, what is it a dependent of,
because if you do that you’re done. You’ve got the dependency
structure of the sentence. And what you’re gonna want to say is,
well, it’s either gonna be a dependent of some other word in the sentence,
or it’s gonna be a dependent of the pseudo-word ROOT, which is meaning
it’s the head of the entire sentence. And so, we’ll go through some
specifics of dependency parsing the second half of the class. But the kind of thing that you
should think about is well, how could we decide which
words are dependent on what? And there are certain various information
sources that we can think about. So yeah, it’s sort of totally natural with
the dependency representation to just think about word relationships. And that’s great, cuz that’ll fit super
well with what we’ve done already in distributed word representations. So actually,
doing things this way just fits well with a couple of tools we
already know how to use. We’ll want to say well,
discussion of issues, is that a reasonable attachment
as lexical dependency? And that’s a lot of the information
that we’ll actually use, but there’s some other sources of
information that we’d also like to use. Dependency distance, so sometimes,
there are dependency relationships and sentences between words that is 20 words
apart when you got some big long sentence, and you’re referring that back
to some previous clause, but it’s kind of uncommon. Most of dependencies are pretty short
distance, so you want to prefer that. Many dependencies don’t, sort of,
span certain kinds of things. So, if you have the kind of dependencies
that occur inside noun phrases, like adjective modifier,
they’re not gonna cross over a verb. It’s unusual for many kinds of
dependencies to cross over a punctuation, so it’s very rare to have a punctuation
between a verb and a subject and things like that. So, looking at the intervening
material gives you some clues. And the final source of information is
sort of thinking about heads, and thinking how likely they are to have to dependence
in what number, and on what sides. So, the kind of information there is,
right, a word like the, is basically not likely to have
any dependents at all, anywhere. So, you’d be surprised if it did. Words like nouns can have dependents, and
they can have quite a few dependents, but they’re likely to have some kinds like
determiners and adjectives on the left, other kinds like prepositional
phrases on the right verbs tend to have a lot of dependence. So, different kinds of words have
different kinds of patterns of dependence, and so there’s some information
there we could hope to gather. Okay, yeah,
I guess I’ve already said the first point. How do we do dependency parsing? In principle, it’s kind of really easy. So, we’re just gonna take every
word in the sentence and say, make a decision as to what word or
root this word is a dependent of. And we do that with a few constraints. So normally, we require that only
one word can be a dependent of root, and we’re not going to allow any cycles. And if we do both of those things, we’re guaranteeing that we make
the dependencies of a tree. And normally,
we want to make out dependencies a tree. And there’s one other property
I then wanted to mention, that if you draw your
dependencies as I have here, so all the dependencies been drawn
as loops above the words. It’s different if you’re allowed to
put some of them below the words. There’s then a question as to
whether you can draw them like this. So that they have that kind of nice,
little nesting structure, but none of them cross each other. Or whether, like these two that I’ve
got here, where they necessarily cross each other, and
I couldn’t avoid them crossing each other. And what you’ll find is in most languages,
certainly English, the vast majority of dependency
relationships have a nesting structure relative to the linear order. And if a dependency tree is fully nesting, it’s referred to as
a projective dependency tree, that you can lay it out in this plane,
and have sort of a nesting relationship. But there are few structures in English where you’d get things
that aren’t nested and yet crossing. And this sentence is
a natural example of one. So I’ll give a talk
tomorrow on bootstrapping. So something that you can do with
noun modifiers, especially if they’re kind of long words like bootstrapping or
techniques of bootstrapping, is you can sort of move them towards
the end of the sentence, right. I could have said I’ll give
a talk on bootstrapping tomorrow. But it sounds pretty natural to say, I’ll
give a talk tomorrow on bootstrapping. But this on bootstrapping is
still modifying the talk. And so that’s referred to by
linguists as right extraposition. And so when you get that kind of
rightward movement of phrases, you then end up with these crossing lines. And that gives you what’s referred to
as a non-projective dependency tree. So, importantly,
it is still a tree if you sort of ignore the constraints of linear order,
and you’re just drawing it out. There’s a graph in theoretical computer
science, right, it’s still a tree. It’s only when you consider this extra
thing of the linear order of the words, that you’re then forced
to have the lines across. And so that property which you don’t
actually normally see mentioned in theoretical computer science
discussions of graphs is then this property that’s
referred to projectivity. Yes.
>>[INAUDIBLE]>>So the questions is is it possible to recover the order of the words
from a dependency tree. So given how I’ve defined dependency
trees, the strict answer is no. They aren’t giving you the order at all. Now, in practice, people write down
the words of a sentence in order and have these crossing brackets, right, crossing
arrows when they’re non-projective. And, of course, it would be a
straightforward thing to index the words. And, obviously, it’s a real thing about
languages that they have linear order. One can’t deny it. But as I’ve defined dependency structures,
yeah, you can’t actually recover
the order of words from them. Okay, one more slide before
we get to the intermission. Yeah, so in the second half of the class, I’m gonna tell you about
a method of dependency parsing. I just wanted to say, very quickly,
there are a whole bunch of ways that people have gone
about doing dependency parsing. So one very prominent way of doing
dependency parsing is using dynamic programming methods, which is normally what people have
used for constituency grammars. A second way of doing it
is to use graph algorithms. So a common way of doing dependency
parsing, you’re using MST algorithms, Minimum Spanning Tree algorithms. And that’s actually a very
successful way of doing it. You can view it as kind of
a constraint satisfaction problem. And people have done that. But the way we’re gonna look at it is
this fourth way which is, these days, most commonly called transition
based-parsing, though when it was first introduced, it was quite often called
deterministic dependency parsing. And the idea of this is that
we’re kind of greedily going to decide which word each
word is a dependent of, guided by having a machine
learning classifier. And this is the method you’re
going to use for assignment two. So one way of thinking about this is, so far in this class,
we only have two hammers. One hammer we have is word vectors, and
you can do a lot with word vectors. And the other hammer we have is
how to build a classifier as a feedforward neural network
with a softmax on top so it classifies between two various classes. And it turns out that if
those are your two hammers, you can do dependency parsing this way and
it works really well. And so, therefore, that’s a great
approach for using in assignment two. And it’s not just a great approach for
assignment two. Actually method four is the dominant
way these days of doing dependency parsing because it has
extremely good properties of scalability. That greedy word there is a way of
saying this is a linear time algorithm, which none of the other methods are. So in the modern world
of web-scale parsing, it’s sort of become most
people’s favorite method. So I’ll say more about that very soon. But before we get to that, we have Ajay doing our research spotlight
with one last look back at word vectors.>>Am I on?
Okay, awesome, so let’s take a break from
dependency parsing and talk about something we should
know a lot about, word embeddings. So for today’s research highlight, we’re
gonna be talking about a paper titled, Improving Distributional Similarity with
Lessons Learned from Word Embeddings. And it’s authored by Levy, et al. So in class we’ve learned two major
paradigms for generating word vectors. We’ve learned count-based
distributional models, which essentially utilize a co-occurrence
matrix to produce your word vectors. And we’ve learned SVD,
which is Singular Value Decomposition. And we haven’t really talked about PPMI. But, in effect, it still uses that co-occurrence matrix to
produce sparse vector encodings for words. We’ve also learned neural
network-based models, which you all should have
lots of experience with now. And, specifically, we’ve talked
about Skip-Gram Negative Sampling, as well as CBOW methods. And GloVe is also a neural
network-based model. And the conventional wisdom is that
neural network-based models are superior to count-based models. However, Levy et al proposed
that hyperparameters and system design choices are more important,
not the embedding algorithms themselves. So they’re challenging
this popular convention. And so, essentially,
what they do in their paper is propose a slew of hyperparameters that,
when implemented and tuned over, the count-based distributional models
pretty much approach the performance of neural network-based models,
to the point where there’s no consistent, better choice across the different
tasks that they tried. And a lot of these
hyperparameters were actually inspired by these neural network-based
models such as Skip-Gram. So if you recall, which you all
should be very familiar with this, we have two hyperparameters in Skip-Gram. We have the number of negative samples
that we’re sampling, as well as the unigram distributions smoothing
exponent, which we fixed at 3 over 4. But it can be thought of as
more of a system design choice. And these can also be transferred
over to the account based variants. And I’ll go over those very quickly. So the single hyper
parameter that Levy et al., proposed that had the biggest
impact in performance was Context Distribution Smoothing
which is analogous to the unigram distribution
smoothing constant 3 over 4 here. And in effect they both
achieved the same goal which is to sort of smooth out your distribution
such that you’re penalizing rare words. And using this hyperparameter
which interestingly enough, the optimal alpha they
found was exactly 3 over 4, which is the same as
the Skip-Gram Unigram smoothing exponent. They were able to increase performance
by an average of three points across tasks on average which
is pretty interesting. And they also propose Shifted PMI, which I’m not gonna get
into the details of this. But this is analogous to
the negative sampling, choosing the number of
negative samples in Skip-Gram. And they’ve also proposed a total
of eight hyperparameters in total. And we’ve described one of them which
is the Context Distribution Smoothing. So here’s the results. And this is a lot of data, and if you’re
confused, that’s actually the conclusion that I want you to arrive at because
clearly there’s no trend here. So, what the authors did was
take all four methods, tried three different windows, and then test
all the models across a different task. And those are split up into word
similarity and analogy task. And all of these methods are tuned to find the best hyperparameters
to optimize for the performance. And the best models are bolded, and as you
can see there’s no consistent best model. So, in effect, they’re challenging
the popular convention that neural network-based models
are superior to the count-based models. However, there’s a few
things to note here. Number one, adding hyperparameters
is never a great thing because now you have to train those
hyperparameters which takes time. Number two,
we still have the issues with count-based distributional models specifically
with respect to the computational issues of storing PPMI counts
as well as performing SVD. So the key takeaways here is that the
paper challenges the conventional wisdom that neutral network-based models are in
fact superior to count-based models. Number two,
while model design is important, hyperparameters are also key for
achieving good results. So this implies specifically to
you guys especially if you’re doing a project instead
of assignment four. You might implement the model but
that might only take you half way there. Some models to find your optimal
hyperparameters might take days or even weeks to find. So don’t discount their importance. And, finally, my personal interest within
ML is in deep representation learning. And this paper specifically excites
me because I think it sort of displays that there’s still lots
of work to be done in the field. And so, the final takeaway
is challenge the status quo. Thank you.>>[APPLAUSE]>>Okay, thanks a lot Ajay. Okay and so
now we’re back to learning about how to build a transition based
dependency parser. So, maybe in 103 or compilers class,
formal languages class, there’s this notion of
shift reduced parsing. How many of you have seen shift
reduced parsing somewhere? A minority it turns out. They just don’t teach formal languages the
way they used to in the 1960s in computer science anymore.>>[LAUGH]
>>You’ll just have to spend more time with Jeff Ullman. Okay, well I won’t assume that
you’ve all seen that before. Okay, essentially what
we’re going to have is, I’ll just skip these two slides and
go straight to the pictures. Because, they will be
much more understandable. But before I go on, I’ll just
mention the picture on this page, that’s a picture of Joakim Nivre. So Joakim Nivre is a computational
linguist in Uppsala, Sweden who pioneered this approach of
transition based dependency parsing. He’s one of my favorite
computational linguists. I mean he was also an example,
going along with what Ajay said, of sort of doing something unpopular and out of the mainstream and
proving that you can get it to work well. So at an age when everyone else was trying
to build sort of fancy dynamic program parsers Joakim said no,no, what I’m
gonna do, is I’m just gonna take each successive word and have a straight
classifier that says what to do with that. And go onto the next word completely
greedy cuz maybe that’s kinda like what humans do with incremental
sentence processing and I’m gonna see how well
I can make that work. And it turned out you can
make it work really well. So and then sort of transition based
parsing has grown to this sort of really widespread dominant
way of doing parsing. So it’s good to find something different
to do If everyone else is doing something, it’s good to think of something else
that might be promising that you got an idea from. And I also like Joakim because he’s
actually another person that’s really interested in human languages and linguistics which actually seems
to be a minority of the field of natural language processing
when it comes down to it. Okay, so here’s some more formalism,
but I’ll skip that as well and show it to you afterwards and
I’ll give you the idea of what an arc-standard transition-based
dependency parser does. So what we’re gonna do is were going
to have a sentence we want to parse, I ate fish, and so we’ve got some rules
for parsing which is the transition scheme which is written so
small you can’t possibly read it. And this is how we start. So we have two things,
we have a stack, and a stack is kinda got the gray
cartouche around that. And we start off parsing any
sentence by putting it on the stack, one thing, which is our root symbol. Okay and
the stack has its top towards the right. And then we have this other thing
which gets referred to as the buffer. And the buffer is the orange cartouche and the buffer is the sentence
that we’ve got to deal with. And so the thing that we regard as the top
of the buffer is the thing to the left, because we’re gonna be taking
off excessive words right? So the top of both of them is sort of at
that intersection point between them. Okay and so,
to do parsing under this transition-based scheme there are three
operations that we can perform. We can perform, they’re called Shift,
Left-Arc and Right-Arc. So the first one that we’re
gonna do is shift operation. So shift is really easy. All we do when we do a shift is we take
the word that’s on the top of the buffer and put it on the top of the stack. And then we can shift again and we take the word that’s on the top of the
buffer and put it on the top of the stack. Remember the stack,
the top is to the right. The buffer, the top is to the left. That’s pretty easy, right? Okay, so there are two other
operations left in this arc-standard transition scheme which were left arc and
right arc. So what left arc and right arc
are gonna do is we’re going to make attachment decisions by adding
a word as the dependent, either to the left or to the right. Okay, so what we do for left arc is on the stack we say that
the second to the top of the stack is a dependent of
the thing that’s the top of the stack. So, I is a dependent of ate, and we remove
that second top thing from the stack. So that’s a left arc operation. And so now we’ve got a stack
with just [root] ate on it. But we collect up our decisions, so we’ve
made a decision that I is a dependent of ate, and that’s that said A that I am
writing in small print off to the right. Okay, so
we still had our buffer with fish on it. So the next thing we’re gonna do is
shift again and put fish on the stack. And so at that point our buffer is empty, we’ve moved every word on to
the stack in our sentence. And we have on it root ate fish, okay. So then the third operation we have is right arc, and right arc is
just the opposite of left arc. So for the right arc operation, we say
the thing that’s on the top of the stack should be made a dependent of the thing
that’s second to top on the stack. We remove it from the stack and
we add an arc saying that. So we right arc, so we say fish is a dependent of ate,
and we remove fish from the stack. We add a new dependency saying
that fish is a dependent of ate. And then we right arc one more time so then we’re saying that ate is
the dependent of the root. So we pop it off the stack and we’re
just left with root on the stack, and we’ve got one new dependency saying
that ate is a dependent of root. So at this point, And
I’ll just mention, right, in reality there’s,
I left out writing the buffer in a few of those examples there just because it was
getting pretty crowded on the slide. But really the buffer is always there,
right, it’s not that the buffer disappeared and came back again,
it’s just I didn’t always draw it. So but in our end state, we’ve got one thing on the stack,
and we’ve got nothing in the buffer. And that’s the good state
that we want to be in if we finish parsing our sentence correctly. And so we say, okay,
we’re in the finished state and we stop. And so that is almost all there is to arc-standard
transition based parsing. So if just sort of go back to
these slides that I skipped over. Right, so we have a stack and our buffer,
and then on the side we have a set of dependency arcs A which starts
off empty and we add things to. And we have this sort of set of actions
which are kind of legal moves that we can make for parsing, and so
this was how things are. So we have a start condition, ROOT on the
stack, buffer is the sentence, no arcs. We have the three operations
that we can perform. Here I’ve tried to write
them out formally, so the sort of vertical bar is sort of
appends an element to a list operation. So this is sort of having wi as the first
word on the buffer, it’s written the opposite way around for the stack
because the head’s on the other side. And so we can sort of do this shift
operation of moving a word onto the stack and these two arc operations
add a new dependency. And then removing one word from the stack
and our ending condition is one thing on the stack which will
be the root and an empty buffer. And so
that’s sort of the formal operations. So the idea of transition based
parsing is that you have this sort of set of legal moves to parse a sentence
in sort of a shift reduced way. I mean this one I referred to as
arc-standard cuz it turns out there are different ways you can define
your sets of dependencies. But this is the simplest one,
the one we’ll use for the assignment, and one that works pretty well. Question? I was gonna get to that. So I’ve told you the whole
thing except for one thing which is this just gives
you a set of possible moves. It doesn’t say which
move you should do when. And so
that’s the remaining thing that’s left. And I have a slide on that. Okay, so the only thing that’s left
is to say, gee, at any point in time, like we were here, at any point in time,
you’re in some configuration, right. You’ve got certain things on there,
certain things in the stacks, certain things in your buffer, you have
some set of arcs that you’ve already made. And which one of these
operations do I do next? And so that’s the final thing. And the way that you do that,
that Nivre proposed, is well what we should do is just
build a machine learning classifier. Since we have a tree bank
with parses of sentences, we can use those parses
of sentences to see which sequence of operations would
give the correct parse of a sentence. I am not actually gonna go
through that right now. But if you have the structure
of a sentence in a tree bank, you can sort of work out deterministically
the sequence of shifts and reducers that you need
to get that structure. And it’s indeed unique, right, that for
each tree structure there’s a sequence of shifts and left arcs and right arcs
that will give you the right structure. So you take the tree, you read off
the correct operation sequence, and therefore you’ve got a supervised
classification problem. Say in this scenario, what you
should do next is you should shift, and so you’re then building
a classified to try to predict that. So in the early work that started off
with Nivre and others in the mid 2000s, this was being done with conventional
machine learning classifiers. So maybe an SVM, maybe a perceptron,
a kind of maxent / soft max classifiers, various things, but sort of some
classified that you’re gonna use. So if you’re just deciding between
the operations, shift left arc, right arc,
you have got at most three choices. Occasionally you have less because if
there’s nothing left on the buffer you can’t shift anymore, so then you’d
only have two choices left maybe. But something I didn’t mention
when I was showing this is when I added to the arc set, I didn’t only
say that fish is an object of ate. I said,
the dependency is the object of ate. And so
if you want to include dependency labels, the standard way of doing that is you just
have sub types of left arc and right arc. So rather than having three choices. If you have a approximately 40
different dependency labels. As we will in assignment two and
in universal dependencies. You actually end up with the space
of 81 way classification. Because you have classes with
names like left arc as an object. Or left arc as an adjectival modifier. For the assignment,
you don’t have to do that. For the assignment,
we’re just doing un-type dependency trees. Which sort of makes it a bit more
scalable and easy for you guys. So it’s only sort of a three way
decision is all you’re doing. In most real applications, it’s really
handy to have those dependency labels. Okay. And then what do we use as features? Well, in the traditional model, you sort
of looked at all the words around you. You saw what word was on
the top of the stack. What was the part of speech of that word? What was the first word in the buffer? What was its parts of speech? Maybe it’s good to look at the thing
beneath the top of the stack. And what word and part of speech it is. And further ahead in the buffers. So you’re looking at a bunch of words. You’re looking at some attributes of those
words, such as their part of speech. And that was giving you
a bunch of features. Which are the same kind of classic,
categorical, sparse features of
traditional machine learning. And people were building
classifiers over that. Yeah, Question? So yeah, the question is are most
treebanks annotated with part of speech? And the answer is yes. Yeah, so I mean. We’ve barely talked about
part of speech so far, things like living things,
nouns, and verbs. So the simplest way of doing
dependency parsing as you’re first writing a part of speech, tag it or
assign parts of speech to words. And then you’re doing the syntactic
structure of dependency parsing over a sequence of word,
part of speech, tag pairs. Though there has been other work
that’s done joint parsing and part of speech tag
prediction at the same time. Which actually has some advantages,
because you can kind of explore. Since the two things are associated, you can get some advantages
from doing it jointly. Okay, on the simplest possible model,
which was what Nivre started to explore. There was absolutely no search. You just took the next word,
ran your classifier. And said, that’s the object of the verb,
what’s the next word? Okay, that one’s a noun modifier. And you went along and
just made these decisions. Now you could obviously think,
gee maybe if I did some more searching and explore different alternatives
I could do a bit better. And the answer is yes, you can. So there’s a lot of work
in dependency parsing. Which uses various forms of beam search
where you explore different alternatives. And if you do that, it gets a ton slower. And gets a teeny bit better in
terms of your performance results. Okay, but especially if you start from the
greediest end or you have a small beam. The secret of this type of parsing
is it gives you extremely fast linear time parsing. Because you’re just going through
your corpus, no matter how big. And say, what’s the next word? Okay, attach it there. What’s the next word? Attach it there. And you keep on chugging through. So when people, like prominent search
engines in suburbs south of us, want to parse the entire
content of the Web. They use a parser like this
because it goes super fast. Okay. And so, what was shown was these
kind of greedy dependencies parses. Their accuracy is slightly below
the best dependency parses possible. But their performance is
actually kind of close to it. And the fact that they’re sort of so
fast and scalable. More than makes up for
their teeny performance decrease. So that’s kind of exciting. Okay, so then for the last few minutes
I now want to get back to neural nets. Okay so where are we at the moment? So at the moment we have a configuration
where we have a stack and a buffer and parts of speech or words. And as we start to build some structure. The things that we’ve taken off
the stack when we build arcs. We can kind of sort of think of them as
starting to build up a tree as we go. As I’ve indicated with that example below. So, the classic way of doing that
is you could then say, okay, well we’ve got all of these features. Like top of stack is word good,
or top of stack is word bad, or top of stack is word easy. Top of stack’s part of
speech as adjective. Top of stack’s word is noun. And if you start doing that. When you’ve got a combination of
positions and words and parts of speech. You very quickly find that the number
of features you have in your model is sort of order ten million. Extremely, extremely large. But you know that’s precisely how these
kinds of parses were standardly made in the 2000s. So you’re building these huge machine
learning classifiers over sparse features. And commonly you even had features
that were conjunctions of things. As that helped you predict better. So you had features like the second
word on the stack is has. And its tag is present tense verb. And the top word on the stack is good. And things like that would be one feature. And that’s where you easily get
into the ten million plus features. So even doing this already
worked quite well. But the starting point
from going on is saying, well it didn’t work completely great. That we wanna do better than that. And we’ll go on and
do that in just a minute. But before I do that, I should mention
just the evaluation of dependency parsing. Evaluation of dependency
parsing is actually very easy. Cuz since for each word we’re saying,
what is it a dependent of. That we’re sort of making choices of
what each word is a dependent of. And then there’s a right answer. Which we get from our tree bank,
which is the gold thing. We’re sort of, essentially,
just counting how often we are right. Which is an accuracy measure. And so, there are two ways
that that’s commonly done. One way is that we just look at
the arrows and ignore the labels. And that’s often referred to as
the UAS measure, unlabeled accuracy. Or we can also pay
attention to the labels. And say you’re only right if
you also get the label right. And that’s referred to as the LAS,
the labelled accuracy score. Yes? So the question is, don’t you have
waterfall effects if you get something wrong high up that’ll destroy
everything else further down? You do get some of that. Because, yes, one decision will
prevent some other decisions. It’s typically not so bad. Because even if you mis-attach something
like a prepositional phrase attachment. You can still get right all of
the attachments inside noun phrase that’s inside that
prepositional phrase. So it’s not so bad. And I mean actually dependency parsing evaluation suffers much less
badly from waterfall effects. Than doing CFG parsing which
is worse in that respect. So it’s not so bad. Okay, I had one slide there
which I think I should skip. Okay I’ll skip on to Neural ones. Okay, so, people could build quite good machine learning dependency parsers on
these kind of categorical features. But nevertheless,
there was a problems of doing that. So, Problem #1 is the features
were just super sparse. That if you typically might have a tree
bank that is an order about a million words, and if you’re then trying
to train 15 million features, which are kinda different
combinations of configurations. Not surprisingly, a lot of those
configurations, you’ve seen once or twice. So, you just don’t have any
accurate model of what happens in different configurations. You just kind of getting these
weak feature weights, and crossing your fingers and
hoping for the best. Now, it turns out that
modern machine learning, crossing your fingers works pretty well. But, nevertheless,
you’re suffering a lot from sparsity. Okay, the second problem is,
you also have an incompleteness problem, because lots of configurations
you’ll see it run time, will be different configurations that you just
never happened to see the configuration. When exquisite was the second
word on the stack, and the top word of the stack,
speech, or something. Any kind of word pale,
I’ve only seen a small fraction of them. Lot’s of things you
don’t have features for. The third one is a little bit surprising. It turned out that when you looked at
these symbolic dependency parsers, and you ask what made them slow. What made them slow
wasn’t running your SVM, or your dot products in your logistic
regression, or things like that. All of those things were really fast. What these parsers were ending up
spending 95% of their time doing is just computing these features, and
looking up their weights because you had to sort of walk around the stack and
the buffer and sort of put together. A feature name, and then you had to
look it up in some big hash table to get a feature number and a weight for it. And all the time is going on that, so even though there are linear time,
that slowed them down a ton. So, in a paper in 2014 Danqi and I developed this alternative
where we said well, let’s just replace that all
with a neural net classifier. So that way, we can have a dense
compact feature representation and do classification. So, rather than having our 10
million categorical features, we’ll have a relatively modest
number of dense features, and we’ll use that to decide our next action. And so, I want to spend the last
few minutes sort of showing you how that works, and this is basically
question two of the assignment. Okay, and basically, just to give you
the headline, this works really well. So, this was sort of the outcome
the first Parser MaltParser. So, it has pretty good UAS and LAS and it had this advantage,
that it was really fast. When I said that’s been the preferred
method, I give you some contrast in gray. So, these are two of
the graph base parsers. So, the graph based parsers have
been somewhat more accurate, but they were kind of like two
orders in magnitude slower. So, if you didn’t wanna parse much stuff
than you wanted accuracy, you’d use them. But if you wanted to parse the web,
no one use them. And so,
the cool thing was that by doing this as neural network dependency parser,
we were able to get much better accuracy. We were able to get accuracy that
was virtually as good as the best, graph-based parsers at that time. And we were actually about to build
a parser that works significantly faster than MaltParser, because of
the fact that it wasn’t spending all this time doing feature combination. It did have to do more
vector matrix multiplies, of course, but that’s a different story. Okay, so how did we do it? Well, so, our starting point was
the two tools we have, right? Distributed representation. So, we’re gonna use distributed
representations of words. So, similar words have close by vectors,
we’ve seen all of that. We’re also going to use part, in our POS,
we use part-of-speech tags and dependency labels. And we also learned distributed
representations for those. That’s kind of a cool idea, cuz it’s also the case that parts of
speech some are more related than others. So, if you have a fine grain
part-of-speech set where you have plural nouns and proper names as
different parts of speech from nouns, singular, you want to say
that they are close together. So, we also had distributed
representations for those. So now,
we have the same kind of configuration. We’re gonna run exactly the same
transition based dependency parser. So, the configuration
is no different at all. But what we’re going to extract
from it is the starting point. We extract certain positions, just like Nivre’s MaltParser but
then what we’re gonna do is, for each of these positions, like top of
stack, second top of stack, buffer etc. We’re then going to look then
up in our bedding matrix, and come up with a dense representation. So, you might be representing
words as sort of a 50 or 100 dimensional word vector representation
of the kind that we’ve talked about. And so, we get those representations for
the different words as vectors, and then what we’re gonna do is just
concatenate those into one longer vector. So, any configuration of the parser
is just being represented as the longest vector. Well, perhaps not that long, our vectors are sort of more
around 1,000 not 10 million, yeah. Sorry, the dependency of, right, the question is what’s this
dependency on feeding as an input? The dependency I’m feeding
here as an import, is when I previously built some arcs
that are in my arc set, I’m thinking maybe it’ll be useful to use those arcs as
well, to help predict the next decision. So, I’m using previous decisions on arcs
as well to predict my follow-up decisions. Okay, so how do I do this? And this is essentially what
you guys are gonna build. From my configuration,
I take things out of it. I get there embedding representations, and I can concatenate them together,
and that’s my input layer. I then run that through a hidden
layer Is a neural network, feedforward neural network,
I then have, from the hidden layer, I’ve run that through a Softmax layer,
and I get an output layer, which is a probability distribution of my
different actions in the standard Softmax. And of course, I don’t know what
any of these numbers are gonna be. So, what I’m gonna be doing is I’m going
to be using cross-entropy error, and then back-propagating
down to learn things. And this is the whole model,
and it learns super well, and it produces a great dependency parser. I’m running a tiny bit short of time,
but let me just, I think I’ll have to rush this but
I’ll just say it. So, non-linearities, we’ve mentioned
non-linearities a little bit. We haven’t said very much about them, and I just want to say a couple more
sentences on non-linearities. Something like a softmax. You can say that using a logistic function
gives you a probability distribution. And that’s kind of what you get in
generalized linear models and statistics. In general, though, you want to say that. For neural networks. Having these non-linearities sort of
let’s us do function approximation by putting together these various
neurons that have some non-linearity. We can sorta put together little
pieces like little wavelets to do functional approximation. And the crucial thing to notice is you
have to use some non-linearity, right? Deep networks are useless unless you put
something in between the layers, right? If you just have multiple linear layers
they could just be collapsed down into one linear layer that the sort of
product of linear transformations, affine transformations is just
an affine transformation. So deep networks without
non-linearities do nothing, okay? And so we’ve talked about
logistic non-linearities. A second very commonly used
non-linearity is the tanh non-linearity, which is tanh is normally
written a bit differently. But if you sort of actually do your
little bit of math, tanh is really the same as a logistic, just sort of
stretched and moved a little bit. And so tanh has the advantage that
it’s sort of symmetric around zero. And so that often works a lot better
if you’re putting it in the middle of a new neural net. But in the example I showed you earlier,
and for what you guys will be using for
the dependency parser, the suggestion to use for the first
layer is this linear rectifier layer. And linear rectifier
non-linearities are kind of freaky. They’re not some interesting curve at all. Linear rectifiers just map things
to zero if they’re negative, and then linear If they’re positive. And when these were first introduced,
I thought these were kind of crazy. I couldn’t really believe that these
were gonna work and do anything useful. But they’ve turned out to
be super successful, so in the middle of neural networks, these
days often the first thing you try and often what works the best is what’s called
ReLU, which is rectified linear unit. And they just sort of effectively
have these nice properties where if you’re on the positive
side the slope is just 1. Which means that they transmit
error in the back propagation step really well linearly back
down through the network. And if they go negative that gives enough
of a non-linearity that they’re just sort of being turned off
in certain configurations. And so these really non-linearities
have just been super, super successful. And that’s what we suggest that
you use in the dependency parser. Okay, so I should stop now. But this kind of putting a neural
network into a transition based parser was just a super successful idea. So if any of you heard about the Google
announcements of Parsey McParseface. And SyntaxNet for their kind of
open source dependency parser. It’s essentially exactly
the same idea of this. Just done with a bigger scaled up,
better optimized neural network. Okay, thanks a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *