## An overview of the Naive Bayes algorithm

Daniel and I (Matt DeLand) are working on a data analysis project together. For the purposes of this post, the details of the project and the data are mostly irrelevant. Basically, there is a single attribute we would like to predict – let us call this attribute Days – a fair amount of training data, and some data we are trying to make predictions on.

The ‘Days’ attribute that we are trying to predict is discrete, and can take on the values 0, 1, 2, 3, or 4 or more (some of the data has been merged, to make things more manageable). Let’s take a quick look at this data:

You’ll notice that the data is heavily skewed toward zero. In fact, all of the other attributes look quite similar. In the previous post, Daniel talks about why a Bayesian approach might be a good fit for such a scenario, so you can read that or take it for granted.

One way in which we can proceed is to try to compute the probability that Days takes on one of these values for each prediction we are trying to make. Let us assume that the attributes we are trying to predict from are labeled $a_1, \ldots, a_n$. Then we are trying to compute

$P(Days = d | a_1, \ldots, a_n )$

One way to proceed is to apply Bayes’ formula which says that

$P(Days = d | a_1, \ldots, a_n) = P( a_1, \ldots, a_n | Days = d) \cdot P(Days = d) / P(a_1, \ldots, a_n)$

At first, you might think this looks more complicated than what we started with, but the point of the Naive Bayes approach is to simplify the computation of the terms on the right hand side. The main Simplifying Assumption of the approach is to assume that, given Days = d, that the variables $a_1, \ldots, a_n$ are independent. In a formula then,

$P(a_1, \ldots, a_n | Days = d) = \prod P(a_i | Days = d)$

Again, you might think we’ve just introduced more terms (and we have), but actually if we have enough training data then the terms on the right hand side are quite straightforward to compute. To compute $P(a_1 | Days = d)$, we can simply count all the times that $a_1$ occurs when $Days = d$ and divide by the total number of times that $Days = d$ (this is another application of Bayes’ formula!).

We do this for each possible value of $d$ to arrive at a probability distribution of the variable we are trying to predict given the prior attributes. These conditional probabilities are often called the posterior distributions. Let’s see this in action in an R function.


posterior
p1 p2 p3 p4
attributes
for (i in 1:length(attributes)) {
bin <- min(which(Attributebins[[a]] > attributes[a])) - 1
# e.g. Attributebins[i] = [0,5,10,15,20], attributes[i] = 7
p1 = p1 * Probs[[1]][[i]][bin]
p2 = p2 * Probs[[2]][[i]][bin]
p3 = p3 *Probs[[3]][[i]][bin]
p4 = p4 *Probs[[4]][[i]][bin]

}

p1 = p1 * nrow(day0) / nrow(data)
p2 = p2 * nrow(day1) / nrow(data)
p3 = p3 * nrow(day23) / nrow (data)
p4 = p4 * nrow(day4) / nrow(data)

con p1 = p1 * con
p2 = p2 * con
p3 = p3 * con
p4 = p4 * con

probabilities
return (probabilities)

}


Some explanations are in order.

First, the vector ‘attributes’ (line 1) are the attributes from which we are trying to predict the value of Days.

What we found was that there were not enough values of Days = 2 and Days = 3, so we lumped them together into one ‘bin’. This explains why we are only computing 4 probabilities (p1 corresponds to Days = 0, p2 to Days = 1, p3 to Days = 2 or 3, p4 to Days = 4 or more).

Also, our attributes could take on many different values, so we binned them together in order to make the computation of $P(a_i | Days = d)$ more reliable. The bins are stored in Attributebins (a list of vectors), and so given attribute number $i$, we first decide which bin $a_i$ falls into (line 12).

The next step is to compute the number of times that attribute $i$ occurs at the same time as Days = d, but this computation we have precomputed (since it needs doing for each prediction, so we save computation time) and stored in the Probs data frame.

We then simply keep a running product of the probabilities over all the attributes, one for each of Days = 0,1,2-3, and 4 or more (lines 15-18). At the end we multiply by the probability of Days = d, and this is just a count of the number of times days = d in the training data divided by the size of the training data (lines 22-25).

There is one last point, which is that, in the formula above we are supposed to divide by $P(a_1, \ldots, a_n)$. However, since we are computing $P(Days = d | a_1, \ldots, a_n)$ for each possible value of $d$, we know that these conditional probabilities are going to add up to 1. A little algebra should convince you that this forces $P(a_1, \ldots, a_n)$ to be the unique constant such that it multiplies with $p1 + p2 + p3 + p4$ to give one. This is the last step in the algorithm (lines 27-31). We use the probabilities to compute expected values, $0 \cdot p1 + p2 + 2.5\cdot p3 + 4 \cdot p4$. Here is a sample of what the result of applying this algorithm to the training data looks like:

You might think it doesn’t look very accurate – and you would be right! But these results are among the best predictions that we have been able to make using any approach. Naive Bayes is a very useful approach because the computations are so simple. But be careful when applying this algorithm, it may not be the case that your variables are independent! Nevertheless, this may give you a nice starting point for making predictions about future data, and we’ll discuss approaches to the non-independent case in the future.