Category Archives: Data Mining

Parallel Computing with R

One of the reasons that R can be quite slow is that by default it uses only one core, regardless of how many your machine actually runs. There are a number of ways to get better computing time using R and with almost no code overhead increase performance by at a factor of at least the number cores locally available. Most of the packages are designed for running network clusters, but they work equally, albeit likely not as quickly, well with just one machine. Luckily many of them have very nice high-level wrappers that essentially hide all of the low-level maintenance. In addition, the examples to follow provide a good introduction to parallel computing in the case you decided to take it to the next level, linking multiple machines together, etc.

I will give a brief survey on the workings of a few of these packages in view of just one machine (extending this to a ‘real’ cluster basically only requires making sure that all packages and dependencies are installed in all machines and passwordless ssh login is enabled).

Continue reading


Are You Ready For Some #football?

While I (Matt) was sitting here watching Monday Night Football, I decided to see who else was doing the same – especially because it’s halftime! You may have heard about Twitter – they have an awesome API which allows us to pull all sorts of data from it. If you use Python, it’s (literally) easy to install using

easy_install twitter

There’s all kinds of cool stuff we could do, but I won’t subject you everything I tried. What I ended up doing is searching for tweets which contained the text ‘MNF’ (for monday night football!), and then searching who was retweeting those tweets. This gives us a directed graph (tweeter -> retweeter) from which we can start to visualize and understand who are the “most important people” talking about the game (besides us, of course). I should say that I learned how to do some of this from the excellent O’Reilly book, “Mining the Social Network” by Matthew Russell.

The first step is to query the API to find tweets containing this tag:

import twitter
tw = twitter.Twitter(domain = "")

results = []
for page in range(1,10):
    results.append( = 'MNF', rpp = 100, page = page))
tweets = [ r['text'] \
           for result in results \
           for r in result['results'] ]

The next step is to search each tweet to decide if it was retweeted or not – this involves searching for the text ‘RT’ or ‘via’, which you are no doubt familiar with if you use twitter, and recording the name of the original tweeter. The relevant tool to do this is to use Python’s regular expression library (re), and the relevant comman is:

rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)

After stripping the user names from the retweeted tweets we are going to add the user names into a directed graph which can be done using the Python package networkx. Just loop over all retweeted tweets from the step above, and add them to the graph

g = networkx.DiGraph()
g.add_edge(s, tweet["from_user"], {'tweet_id' : tweet['id'] } )

There’s all kinds of cool stuff you can do with this graph object, but I’m just going to skip most of it and show you the picture (since I have to get back to the game, of course). I manipulated it so that we only see the largest connected components of our graph:

There you go, the most important (i.e. had their tweets retweeted the most) MNF watchers are ‘ESPN’, ‘Sportscenter’, ‘JasonWitten’, ‘PeytonsHead’, ‘JordinSparks’, ‘TristinKennedy’, and ‘OmyBoyBaby’. It seems like we’re in good company!

Lagged auto-correlation and Wikipedia editors

There was a data mining competition put up on kaggle a few months ago, where the challenge was to predict wikipedia editor behavior. The data set contained a log of the behavior for about 50,000 editors over a period of roughly 5 years, with the main file containing roughly 222 million rows. Due to the volume, I decided to use python over R for this particular investigation. Similar to the data that was looked at in the Naive Bayes post, the distribution here had very low variance and very shallow tails, i.e. most editors had very little activity, whereas a few were incredibly active (the busies having over 300,000 edits in this period of 5 years!).

Since we are dealing with human behavior over a prolonged period of time, I though it would be interesting to see if there was any cyclical patterns for these very active editors, and perhaps use this later in the prediction process. One way to proceed was to use lagged auto-correlation plots, as described in this great post by Cathy. The basic idea is to take a time series and produce a list of correlations, or another series, where each correlation coefficient is gotten by shifting the series by one tick and correlating it with the original. If there are any cycles in the original times series these will show up as spikes in the correlation coefficient series plot.

I decided to start with 300k plus editor.

Continue reading

An overview of the Naive Bayes algorithm

Daniel and I (Matt DeLand) are working on a data analysis project together. For the purposes of this post, the details of the project and the data are mostly irrelevant. Basically, there is a single attribute we would like to predict – let us call this attribute Days – a fair amount of training data, and some data we are trying to make predictions on.

The ‘Days’ attribute that we are trying to predict is discrete, and can take on the values 0, 1, 2, 3, or 4 or more (some of the data has been merged, to make things more manageable). Let’s take a quick look at this data:

You’ll notice that the data is heavily skewed toward zero. In fact, all of the other attributes look quite similar. In the previous post, Daniel talks about why a Bayesian approach might be a good fit for such a scenario, so you can read that or take it for granted.

One way in which we can proceed is to try to compute the probability that Days takes on one of these values for each prediction we are trying to make. Let us assume that the attributes we are trying to predict from are labeled a_1, \ldots, a_n. Then we are trying to compute

P(Days = d | a_1, \ldots, a_n )

One way to proceed is to apply Bayes’ formula which says that

P(Days = d | a_1, \ldots, a_n) = P( a_1, \ldots, a_n | Days = d) \cdot P(Days = d) / P(a_1, \ldots, a_n)

Continue reading