Category Archives: Technical

Parallel Computing with R

One of the reasons that R can be quite slow is that by default it uses only one core, regardless of how many your machine actually runs. There are a number of ways to get better computing time using R and with almost no code overhead increase performance by at a factor of at least the number cores locally available. Most of the packages are designed for running network clusters, but they work equally, albeit likely not as quickly, well with just one machine. Luckily many of them have very nice high-level wrappers that essentially hide all of the low-level maintenance. In addition, the examples to follow provide a good introduction to parallel computing in the case you decided to take it to the next level, linking multiple machines together, etc.

I will give a brief survey on the workings of a few of these packages in view of just one machine (extending this to a ‘real’ cluster basically only requires making sure that all packages and dependencies are installed in all machines and passwordless ssh login is enabled).

Continue reading

Statistics and Algebra. An Example.

Written by : Matt

There is a developing field called algebraic statistics which explores probability and statistics problems involving discrete random variables using methods coming from commutative algebra and algebraic geometry. The basic point is that the parameters for such statistical models are often constrained by polynomial relationships – and these are exactly the subject of commutative algebra and algebraic geometry. I would like to learn something more about this relationship, so in this post I’ll describe one example that I worked through – it comes from a book on the subject written by Bernd Sturmfels. Disclaimer : the rest of this post is technical.

Continue reading

Gambling and Shannon’s Entropy Function Part 2.

In the last post I gave an introduction of Kelly’s paper where he describes optimal gambling strategies based on information received over a potentially noisy channel. Here I’ll talk about the general case, where the channel has several inputs symbols, each with a given probability  of being transmitted, and which represent the outcome of some chance event. First, we need to set up some notation:

p(s) –  probability that the transmitted symbol is s.

p(r | s) – the conditional probability that the received symbols is r given that the transmitted symbol is s.

p(r, s) –  the joint probability that the received symbol is r and the transmitted symbol is s.

q(r) –  probability that the received symbol is r.

q(s | r) – the conditional probability that the transmitted symbol is s given that the received symbols is r.

\alpha_s – the odds paid on the occurrence of s, i.e. the number of dollars returned on a one-dollar bet on s.

a(s/r) – the fraction of capital that the gambler decides to bet on the occurrence of s after receiving r.

Continue reading

Are You Ready For Some #football?

While I (Matt) was sitting here watching Monday Night Football, I decided to see who else was doing the same – especially because it’s halftime! You may have heard about Twitter – they have an awesome API which allows us to pull all sorts of data from it. If you use Python, it’s (literally) easy to install using

easy_install twitter

There’s all kinds of cool stuff we could do, but I won’t subject you everything I tried. What I ended up doing is searching for tweets which contained the text ‘MNF’ (for monday night football!), and then searching who was retweeting those tweets. This gives us a directed graph (tweeter -> retweeter) from which we can start to visualize and understand who are the “most important people” talking about the game (besides us, of course). I should say that I learned how to do some of this from the excellent O’Reilly book, “Mining the Social Network” by Matthew Russell.

The first step is to query the API to find tweets containing this tag:

import twitter
tw = twitter.Twitter(domain = "api.twitter.com")
tw.trends()

results = []
for page in range(1,10):
    results.append(tw.search(q = 'MNF', rpp = 100, page = page))
tweets = [ r['text'] \
           for result in results \
           for r in result['results'] ]

The next step is to search each tweet to decide if it was retweeted or not – this involves searching for the text ‘RT’ or ‘via’, which you are no doubt familiar with if you use twitter, and recording the name of the original tweeter. The relevant tool to do this is to use Python’s regular expression library (re), and the relevant comman is:

rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)

After stripping the user names from the retweeted tweets we are going to add the user names into a directed graph which can be done using the Python package networkx. Just loop over all retweeted tweets from the step above, and add them to the graph

g = networkx.DiGraph()
...
g.add_edge(s, tweet["from_user"], {'tweet_id' : tweet['id'] } )

There’s all kinds of cool stuff you can do with this graph object, but I’m just going to skip most of it and show you the picture (since I have to get back to the game, of course). I manipulated it so that we only see the largest connected components of our graph:

There you go, the most important (i.e. had their tweets retweeted the most) MNF watchers are ‘ESPN’, ‘Sportscenter’, ‘JasonWitten’, ‘PeytonsHead’, ‘JordinSparks’, ‘TristinKennedy’, and ‘OmyBoyBaby’. It seems like we’re in good company!