There’s a lot of data on the web – but sometimes getting to it can be quite a challenge.
This post is an introduction on how to use Python to read data out of the HTML source. Python already has (at least) two useful packages designed explicitly for this task: HTMLParser, and SGMLParser. As near as I can tell, they’re quite similar, so I will focus on SGML (standard generalized markup language). The official python documentation for this class is slightly lacking if you’ve never used a parser before, but it really isn’t so hard in the end.
The first thing to do is to find a website that you want to collect data from. I’m going to choose the website http://www.weather.com – more specifically, I’m going to read the New York City hourly weather predictions at http://www.weather.com/weather/hourbyhour/graph/USNY0996. To read the data from a website, you need to understand how it is stored in the html. This will have to involve actually looking at the html!
Of course you can do this from your web browser – but at some point we need to get the html into python, so let’s do it that way.
sock = urllib.urlopen("http://www.weather.com/weather/hourbyhour/graph/USNY0996")
wsource = sock.read()
There’s actually a lot more you can do with the urllib, but this code just opens the url, reads in the html, and saves it in a string. When we print it out, we see a lot of html that we’re going to have to sort through in order to figure out how our data is stored. The first thing you might notice is that the weather channel is trying to collect information it has stored in its cookie. Let’s not worry about that, and instead, notice that the data we’re interested in is stored in a table, separated by tags like this one:
Other websites will be organized differently, but what generally seems to be the case (especially if the data is inserted into the html by some automated process) is that the data will sit inside tags.