In the last few months I’ve been collaborating a lot with Matt DeLand on a number of different problems in data mining. Matt is really smart, technically proficient and a great person to work with, so if you are looking for someone to solve your data problems look no further. He has also been kind enough to contribute posts on a regular basis. The first in these will be a few words about coding up a version of the naive Bayes algorithm, but before he does so I decided to talk in general about how we arrived at the conclusion that this particular approach might be fruitful. Matt will talk in more detail above the problem and the data at hand, so I’ll just intro that we are dealing with a set of about 145,000 rows and 95 attributes, where we are trying to predict an outcome which we call “Days.”

Lets first look at a histogram of Days, do get an idea of how the outcome might be distributed.

Well, it seems like most of the output is 0. As a matter of fact about 85% of the rows give 0 as the output which we can see immediately.

> length(which(data$Days==0))/nrow(data) [1] 0.8474433

A model that predicts 0 all the time will be 85% correct, so the key here is to find one that will deal with a small percentage of outliers. Here is the full distribution of 1 to 15 days:

> for (i in 0:15){ + print(length(which(data$Days==i)))} [1] 124975 [1] 9299 [1] 4548 [1] 2882 [1] 1819 [1] 1093 [1] 660 [1] 474 [1] 316 [1] 263 [1] 209 [1] 145 [1] 135 [1] 111 [1] 65 [1] 479

Now lets take a look at some of the attributes.

The three below are good representative of the lot.

We might potentially be able to model these as normally distributed around 0 with very shallow tails, or perhaps a Gamma distribution with k=1. There are a couple of attributes that exhibit a mixed distribution phenomenon as seen below.

So not only is our model going to have catch outliers or shallow tails, but it also going to have to deal with attributes following different types distributions.

One thing we could always try is to see if any of the attributes are correlated closely with the output. This could tell is if there are flags in the data which could help predict outlier outputs. Running the following few lines produces a plot of correlations between Days and all the other attributes.

s <- seq(-1,1, length.out=ncol(data))plot(1:ncol(data), s, type="n", xlab="Attributes", ylab="Correlation with Days") for (c in 1:ncol(data)){ points(c, cor(data$Days, data[,c]), col=c)}

Other than the last dot, which is cor(Days, Days), the correlation coefficient is pretty low, so this first naive approach did not produce any immediate insight.

Now, since a Bayesian learner models the distributions directly from the data, and does not require a priori input of the distribution type, this approach becomes a reasonable next next step.