Monthly Archives: September 2011

Are You Ready For Some #football?

While I (Matt) was sitting here watching Monday Night Football, I decided to see who else was doing the same – especially because it’s halftime! You may have heard about Twitter – they have an awesome API which allows us to pull all sorts of data from it. If you use Python, it’s (literally) easy to install using

easy_install twitter

There’s all kinds of cool stuff we could do, but I won’t subject you everything I tried. What I ended up doing is searching for tweets which contained the text ‘MNF’ (for monday night football!), and then searching who was retweeting those tweets. This gives us a directed graph (tweeter -> retweeter) from which we can start to visualize and understand who are the “most important people” talking about the game (besides us, of course). I should say that I learned how to do some of this from the excellent O’Reilly book, “Mining the Social Network” by Matthew Russell.

The first step is to query the API to find tweets containing this tag:

import twitter
tw = twitter.Twitter(domain = "")

results = []
for page in range(1,10):
    results.append( = 'MNF', rpp = 100, page = page))
tweets = [ r['text'] \
           for result in results \
           for r in result['results'] ]

The next step is to search each tweet to decide if it was retweeted or not – this involves searching for the text ‘RT’ or ‘via’, which you are no doubt familiar with if you use twitter, and recording the name of the original tweeter. The relevant tool to do this is to use Python’s regular expression library (re), and the relevant comman is:

rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)

After stripping the user names from the retweeted tweets we are going to add the user names into a directed graph which can be done using the Python package networkx. Just loop over all retweeted tweets from the step above, and add them to the graph

g = networkx.DiGraph()
g.add_edge(s, tweet["from_user"], {'tweet_id' : tweet['id'] } )

There’s all kinds of cool stuff you can do with this graph object, but I’m just going to skip most of it and show you the picture (since I have to get back to the game, of course). I manipulated it so that we only see the largest connected components of our graph:

There you go, the most important (i.e. had their tweets retweeted the most) MNF watchers are ‘ESPN’, ‘Sportscenter’, ‘JasonWitten’, ‘PeytonsHead’, ‘JordinSparks’, ‘TristinKennedy’, and ‘OmyBoyBaby’. It seems like we’re in good company!

Data journalism

The Guardian has decided to start a data journalism page (thanks to Matt for sending me the link the other day). The page lists a number of statistics heavy stories, such as “how countries compare in science and technology jobs” or “natural disasters of the last ten years,” and don’t just list the data, but also prove a way to download it in excel format.

The data sets are very small, but still fun to look at; however, the idea of data journalism as whole is pretty intriguing. Wouldn’t it be amazing if with stories and reports that relied on statistics, journalists started providing the raw data as well?  Of course some  proprietary and privacy issues would have to be treated with care, but a lot of worthwhile things could come out of such a news format.

Rating 401(k)’s

There is a company called Brightscope, that has been rating most of the big 401(k) plans and this has certainly ruffled a few feathers in the investment advisor business (see full article here). Although it is far away from an open source model, in the spirit of what Cathy proposes, because it is unclear exactly how Brighscope itself rates the plans, this is still an indication that there is public demand for such things and perhaps a first step of sorts in the “open” direction. If Brightscope made available it’s rating system that might help  elucidate the situation. In addition, and in a Wikileaks type move, the company put the names and disciplinary records of thousands of stockbrokers and investment advisors.

Perhaps it is time to create open source models to rate a variety of companies and public goods, especially ones that require more expertise and technical knowledge.

Google and the immediate access to information

There was an interesting article by James Glieck in the NY Review of Books a little while ago. The article is a good overview on the evolution of google, it’s role in today’s world, and it’s plans for the future.  One bit that especially struck me is the transcript of a conversation with google’s founders Page and Brin on how they see the company shaping our interaction with information.

“It [google] will be included in people’s brains,” said Page. “When you think about something and don’t really know much about it, you will automatically get information.”

“That’s true,” said Brin. “Ultimately I view Google as a way to augment your brain with the knowledge of the world. Right now you go into your computer and type a phrase, but you can imagine that it could be easier in the future, that you can have just devices you talk into, or you can have computers that pay attention to what’s going on around them….”

…Page said, “Eventually you’ll have the implant, where if you think about a fact, it will just tell you the answer.”

There is something more than just the initial terror of a pseudo-cyborg utopia, or how the founders of the company that has essentially become the gateway to information see themselves re-inventing the human. There is also an underlying confusion between (immediate) access to information, knowledge, and understanding, as well as how these interact.

Google is not an access to knowledge, it’s a portal of information and it is up to the recipient to give that information necessary context. Most of the information on the web comes with very limited context, and at best provide factual accuracy. Of course, immediate access to such facts can be helpful and can also aid in a process but I am not convinced that the world that Brin and Page imagine is necessarily one of progress and innovation – it is certainly not one of understanding, which is a completely different story all together. Many of the great strides in human civilization, whether these be in art, science, technology, etc, have come from limitations, from boundaries and the inability to immediately answer questions, have particular tools, freedoms, or gather certain information. These are no great insights, but perhaps ones that google is ignoring.