Quite recently I (Matt) learned from my colleague Brett a little bit about information theory and how people have developed precise methods to methods to tell us how much one random variable can “teach us” about another random variable. As he explained to me, this theory can be applied to assign a numerical value assessing how good a set of predictions is. (All of this is originally due to Claude Shannon). For example, every day you wake up and the weather man on channel 2 says it’s going to rain some days, and the weather man on channel 4 says it’s going to rain on other days. After a year of listening to both of these predictions, you want to decide who is better so you only have to listen to one channel. How are you supposed to make your decision?

One issue is that everyone has some prior information about the weather where they live. At the extreme case, suppose you live in the desert where it’s sunny 99% of the days. The weather man on channel 2 is lazy – he just gets up every morning and says it’s going to be sunny, every single day, and goes home. He’s right 99% of the time, but he hasn’t helped us any! Suppose that the weather man on channel 4 though, predicts it’s going to rain 1% of the time, but when he does he’s right only 50% of the time. What is the probability that it’s going to rain when he says it’s going to rain? Well then obviously it’s 50% – and he has given us some real information. On the other hand, when he predicts that it’s going to be sunny, he’s right almost 99.5 percent of the time, not bad.

In what follows, I’m going to explain how to measure how much information the weather man (or any random variable) tells us about the weather (or any other random variable). By necessity, it’s a bit technical.

The first idea thing to discuss is the entropy function, which will measure “how random” our variable is. We’ll start with a single random variable which takes on the value 0 with probability and the value 1 with probability (imagine flipping a biased coin). We wish to measure this variables’ ‘randomness’. If is zero or one, then we should have no randomness, and if is exactly 1/2, we should have maximal randomness. We can use Shannon’s entropy function to make this measurement:

and in general for a discrete random variable this would read (we take the sum over all possible outcomes i for ):

This formula might seem unmotivated, but it turns out that if you want an entropy function which satisfies some pretty intuitive conditions (for us those conditions would be that the function should be continuous in , symmetric in and , and maximized when , in general there is one last condition which I won’t go into), then this formula is forced upon you (up to a constant) – awesome! For our simple biased coin, we can graph as a function of :

Now, we’ll move onto the interaction between two variables. Suppose you want to send a single bit of information to your friend (that is, just a zero or a one) across a noisy line that flips your message with probability (between 0 and 1) and transmits the correct message of the time. There are two random variables here: , the bit you choose to send, and the output variable, , the bit your friend receives. (You can also think of the weatherman example – the weatherman says it rains or it doesn’t, and then it rains or it doesn’t). We’d like to measure how much of the output is explained by the input – and of course the percentage of time the flips are made is key. The general formula for the conditional entropy of two variables X and Y:

We should read this as adding up the entropies of if we know which value of has occurred. Let’s assume for the moment that you choose to send your bit at random, that is, . As a function of p in our exmaple, the conditional entropy is:

I’ll let you work it out what it is if you select to send the zero bit with probability . There is one last definition, but we’ve already done all the hard work. We define the mutual information function of our two random variables as

It turns out that this is equal to

(This is always less than or equal to , so if you like you can divide by to normalize it). You should think of this function as the entropy of explained by . If and are independent (so that there is no information about the first ‘encoded’ in the other), then and the information is zero. If on the other hand, ‘explains’ all the entropy of , then (the sign comes from the definition) and then the normalized information would be one.

Let’s return to our weathermen. Remember our first weatherman predicts ‘no rain’ every day, and goes home. Call his ‘random’ variable and the the rain/no rain variable . Then you can compute and that is the same, so that the information between the weather and the prediction is zero (as expected). On the other hand, our second weatherman predicts ‘rain’ 99% of the time, and ‘no rain’ 1% of the time, but when he does he is only right 50% of the time. I worked through all the probabilities, and (using the natural logarithm) I calculated that . The second weatherman is giving us some real information, as we expected! You can do the same thing using more complicated discrete random variables, or use continuous random variables (and all the sums become integrals as usual).

I should add a word of caution – this is not a ‘utility’ measurement. Someone who hates false positives 1000 times more than false negatives might prefer the first weatherman for personal (but strange) reasons. This is just a measurement of how much information is passed between the two variables. Another word of caution is that we will not know the actual probability distribution of rain or not rain – but since we’ve been paying attention to the weather for a year, we could be happy using our observations as an approximation of that distribution.

Hi Matt,

This is really great and intriguing but I’m confused about a couple of things. Why should we read that formula as “adding up the entropies of X if we know which value of has occurred”? Is that supposed to be obvious?

Next, when we say X is a fair coin, how are you getting that formula for the conditional entropy? Are you not using the above formula? I’d expect to see 1/2 somewhere.

Finally, what do you mean by H(X, Y) in the definition of the mutual information function?

Thanks!

Cathy

Hi Cathy,

Thanks for the careful reading!

a) It’s not quite obvious – you have to use the formula that P(A and B) = P (A | B) P (B). Then in side the summation you get:

-P(X = i | Y = j) P(Y = j) log P(X = i | Y = j).

So what you actually get is the sum of the entropies of the conditional distribution X weighted by the probabilities of Y.

b) I am using the above formula! It’s awesome isn’t it? There are four terms to add up. P(X = 0, Y = 0) = (1-p)/2 = P(X = 1, Y =1). P (X = 0, Y =1 ) = P(X = 1, Y = 0) = p/2, and P(X = 0) = P(Y = 0) = 1/2. The 1/2s cancel inside the log term, and the four terms add up to the formula above.

c) Yes, I didn’t explain this. It just means the entropy of the joint distribution, defined exactly how you would think it should be (i.e. exactly analogous to the entropy of a single random variable).

Cool, thanks!

[…] little while ago Matt posted an overview of the definition of Shannon’s entropy function and it’s use in assessing interaction […]