There was a data mining competition put up on kaggle a few months ago, where the challenge was to predict wikipedia editor behavior. The data set contained a log of the behavior for about 50,000 editors over a period of roughly 5 years, with the main file containing roughly 222 million rows. Due to the volume, I decided to use python over R for this particular investigation. Similar to the data that was looked at in the Naive Bayes post, the distribution here had very low variance and very shallow tails, i.e. most editors had very little activity, whereas a few were incredibly active (the busies having over 300,000 edits in this period of 5 years!).

Since we are dealing with human behavior over a prolonged period of time, I though it would be interesting to see if there was any cyclical patterns for these very active editors, and perhaps use this later in the prediction process. One way to proceed was to use lagged auto-correlation plots, as described in this great post by Cathy. The basic idea is to take a time series and produce a list of correlations, or another series, where each correlation coefficient is gotten by shifting the series by one tick and correlating it with the original. If there are any cycles in the original times series these will show up as spikes in the correlation coefficient series plot.

I decided to start with 300k plus editor.

First I read the data into python using the csv module, and from that created two lists ‘memberId’ and ‘timestamp.’ Since each editor appears many times in the memberId list and the timestamps are not in chronological order I needed to first take care of two things: find the editor with the highest number of edits and create a (chronological) time series for him.

I made a list of unique editors:

umembers = set(memberId)

Then I created a list of times series for each editor in umembers. Since I could not find a function in the python library that took a list and value and returned a list of all indexes where that value occure, I just wrote one.

timeseries = [] def all_indices(qlist, value): indices = [] idx = -1 while 1: try: idx = qlist.index(value, idx+1) indices.append(idx) except ValueError: break return indices for mem in umembers: index = all_indices(memberId, mem); ts = [] ts = [timestamp[i] for i in index] timeseries.append(ts)

From this I could figure out the editor with the most number of edits and pick him out of the list.

m = max(len(i) for i in timeseries) #pull out the the longest times series mguy = [] for t in timeseries: if len(t)==m: mguy = t

For the rest of the analysis I needed to do some arithmetic with timestamps, and the datetime library from matplolib is very good for this, so I decided to convert the mguy time series to datetime objects.

from matplotlib import * mguydt = [] for item in mguy: mguydt.append(datetime.strptime(mguy[1], "%Y-%m-%d %H:%M:%S"))

Now I was set to aggregate the data by the number of edits mguy had each day. I chose January 1st, 2005 as a starting point.

from matplotlib import * begin = datetime(2005,1,1,0,0) timediff = map[(item - begin) for item in mguydt]] timediff_days = [item.days for item in timediff] ts = [timediff_days.count(i) for i in range(0, max(timediff_days)+1)]

So now that I had my list of daily edits for mguy I was ready to create a lagged auto-correlation graph and see if there were any patterns in mguy’s behavior.

from matplotlib import * def lag_cor(data, lag): in1 = [data[i] for i in range(lag, len(data))] in2 = [data[i] for i in range(0, len(data)-lag)] return corrcoef(in1, in2)[1,0] cor = [lag_cor(ts, i) for i in range(0,501)] from matplotlib.pylab import * figure() plot(cor) show()

And the resulting plot is:

Immediately I could see that, at least at first, there were some pretty regular spikes in around 50 day intervals, supporting the idea of behavior cycles. Moreover, it took a while for the graph to drop off with the first 30 lags or so staying above .3 correlation. In another post I’ll talk about how this can be useful in trying to predict this editor’s future behavior.