Category Archives: Data

Data journalism

The Guardian has decided to start a data journalism page (thanks to Matt for sending me the link the other day). The page lists a number of statistics heavy stories, such as “how countries compare in science and technology jobs” or “natural disasters of the last ten years,” and don’t just list the data, but also prove a way to download it in excel format.

The data sets are very small, but still fun to look at; however, the idea of data journalism as whole is pretty intriguing. Wouldn’t it be amazing if with stories and reports that relied on statistics, journalists started providing the raw data as well?  Of course some  proprietary and privacy issues would have to be treated with care, but a lot of worthwhile things could come out of such a news format.

HTML Parsing and Python

There’s a lot of data on the web – but sometimes getting to it can be quite a challenge.
This post is an introduction on how to use Python to read data out of the HTML source. Python already has (at least) two useful packages designed explicitly for this task: HTMLParser, and SGMLParser. As near as I can tell, they’re quite similar, so I will focus on SGML (standard generalized markup language). The official python documentation for this class is slightly lacking if you’ve never used a parser before, but it really isn’t so hard in the end.

The first thing to do is to find a website that you want to collect data from. I’m going to choose the website http://www.weather.com – more specifically, I’m going to read the New York City hourly weather predictions at http://www.weather.com/weather/hourbyhour/graph/USNY0996. To read the data from a website, you need to understand how it is stored in the html. This will have to involve actually looking at the html!

Of course you can do this from your web browser – but at some point we need to get the html into python, so let’s do it that way.

import urllib
sock = urllib.urlopen("http://www.weather.com/weather/hourbyhour/graph/USNY0996")
wsource = sock.read()
sock.close()
print wsource

There’s actually a lot more you can do with the urllib, but this code just opens the url, reads in the html, and saves it in a string. When we print it out, we see a lot of html that we’re going to have to sort through in order to figure out how our data is stored. The first thing you might notice is that the weather channel is trying to collect information it has stored in its cookie. Let’s not worry about that, and instead, notice that the data we’re interested in is stored in a table, separated by tags like this one:

</pre>
<div class="hbhWxTime">4pm</div>
<pre>

Other websites will be organized differently, but what generally seems to be the case (especially if the data is inserted into the html by some automated process) is that the data will sit inside tags.
Continue reading

Data

A running list of interesting data and data repositories on the web.

1.  CMU stats data repository.

This data set repository contains a large number of data sets of anything from baseball to body-fat statistics.

2. Kaggle

After the announcement of the “Netflix Prize,” Data mining competitions such as the ones listed here are becoming more and more prevalent. These can be fun, challenging, and a great way to work with data as well as have something to talk about at an interview.

3. Economic Research Service

From the site: The International Macroeconomic Data Set provides data from 1969 through 2020 for real (adjusted for inflation) gross domestic product (GDP), population, real exchange rates, and other variables for the 190 countries and 34 regions that are most important for U.S. agricultural trade.

4. Bulk Census Data

Here you can download raw bulk US census data in usable form – very cool! Check out the post about it as well as some other resources.