## Finding Dependency Trees Two – Georgia Tech – Machine Learning

Right. So if, in order for this to minimized you would have to have picked a parent that tells you a lot about yourself. All right. Because entropy is information. Entropy is randomness. And if I pick a really good parent then knowing something about the parent tells me something about me and my entropy will be low. So if I can find the set of parents for each of my features such that I get the most out of knowing the value of those parents then I will have the lowest sort of sum of [UNKNOWN] conditional [UNKNOWN]. You with me?>>Yeah.>>Okay, now this is very nice and you would think we’d be done except it’s not entirely clear how you would go about computing this. Actually, I know how you go about computing it, and let me tell you, it’s excruciatingly painful. But it turns out that there’s a cute little trick that you can do to make it less excruciatingly painful, and what I’m going to do is I’m going to define a slightly different version of this function. And I’m going to call it, j hat. So as I said before, we want to minimize this particular cost function, j. Which we get directly from the Kullback–Leibler divergence. So all I’ve done is define a new function j prime, where I’ve added this term. Just minus the sum of all of the unconditional entropies of each of the features. Now I’m able to do this because nothing in this depends upon pi and so doesn’t actually change the proper pi. That makes sense?>>Yeah, except I keep thinking about pie.>>Mm, pie. Mm, I’m sorry, you got me distracted. Okay but do you see how minimizing this versus minimizing this should give me the same pi?>>I do. I mean, it’s sort of like adding a constant if you’ve got a max. It doesn’t change which element gives you the max.>>Right. But by adding this term I’ve actually come up with something kind of cute. What is this expression, Michael? Do you know?>>It looks kind of familiar from Information Theory. Is that cross-entropy?>>No, though sort of, but no. Is it mutual information?>>It is in fact, mutual information.>>In fact.>>It’s the negative of, mutual information. So, minimizing this expression, is the same thing as maximizing, neutral information. Now, I went through this for a couple of reasons Michael. One is, I think it’s easier to, kind of see what mutual information is. But also because, this going to induce a very simple algorithm. Which I’ll show you in a second, for figuring out how to find a dependency tree. And the trick here is to realize that, conditional entropies are directional. Right?>>Wait, so conditional entropies is, is, oh, is that, that’s the quantity that we started up?>>Right.>>At the top with the hq, okay, huh.>>So, this is, this is directional right? Xi depends upon this>>Yeah.>>And if I were to do h of pi given xi, I would be getting a completely different number. Mutual information on the other hand is bi-directional.>>Interesting.>>It’s going to turn out to be easier to think about. But before we do that let’s just make certain that this makes sense so we wanted to minimize a couple of liable deductions because that’s what Shannon told us to do. We work it all out and it turns out we really want to minimize the kind of cost function another way of rewriting this of conditional entropy. We threw this little trick in, which just allows us to turn those conditional interviews into mutual information. And what this basically says is that to find the best pi, the best parents, the best dependency tree, means you should maximize the mutual information between every feature and its parent.>>Nice.>>Right. And that sort of makes sense, right? I want be associated with the parent that gives the most information about me.>>Charles, I have a question.>>Yes?>>Just a little bit confused. The the xis, what are they, where’s that bound?>>Oh, by the summation.>>The, what summation?>>You know, the summation that’s always been there the entire time I’ve been talking.>>Oh, I’m sorry I missed that.>>I don’t know how you missed it man, its right there.>>Yeah, I mean, you didn’t actually put the index in the summation so I was, guess I was, I was confused.>>My fault right, so just to be clear of anyone who noticed that the [INAUDIBLE] version is a summation over all possible variables in the distribution And so we’ve done here is carry that all the way through and so these two steps here in particular. This new cost function that we’re trying to minimize is a sum over the negative mutual information between every variable, every feature, and its parents. And that’s the same thing as trying to maximize the mutual information, the sum of the mutual informations, between every feature and its parent.>>Hmm, interesting.>>Right, and again I think that makes a lot of sense, right? You, basically the best tree is, the best dependency tree is the one that captures dependencies the best.>>I see. Alright. That’s cool. Alright, so now we need to figure out how to do that optimization.>>Exactly right. And it turns out it’s gloriously easy.