HTML Parsing and Python

There’s a lot of data on the web – but sometimes getting to it can be quite a challenge.
This post is an introduction on how to use Python to read data out of the HTML source. Python already has (at least) two useful packages designed explicitly for this task: HTMLParser, and SGMLParser. As near as I can tell, they’re quite similar, so I will focus on SGML (standard generalized markup language). The official python documentation for this class is slightly lacking if you’ve never used a parser before, but it really isn’t so hard in the end.

The first thing to do is to find a website that you want to collect data from. I’m going to choose the website http://www.weather.com – more specifically, I’m going to read the New York City hourly weather predictions at http://www.weather.com/weather/hourbyhour/graph/USNY0996. To read the data from a website, you need to understand how it is stored in the html. This will have to involve actually looking at the html!

Of course you can do this from your web browser – but at some point we need to get the html into python, so let’s do it that way.

import urllib
sock = urllib.urlopen("http://www.weather.com/weather/hourbyhour/graph/USNY0996")
wsource = sock.read()
sock.close()
print wsource

There’s actually a lot more you can do with the urllib, but this code just opens the url, reads in the html, and saves it in a string. When we print it out, we see a lot of html that we’re going to have to sort through in order to figure out how our data is stored. The first thing you might notice is that the weather channel is trying to collect information it has stored in its cookie. Let’s not worry about that, and instead, notice that the data we’re interested in is stored in a table, separated by tags like this one:

</pre>
<div class="hbhWxTime">4pm</div>
<pre>

Other websites will be organized differently, but what generally seems to be the case (especially if the data is inserted into the html by some automated process) is that the data will sit inside tags.

</pre>
<div class="hbhWxDate">
<div>Sun
Sep 11</div>
</div>
<div class="hbhWxTime">
<div>6 pm</div>
</div>
<div class="hbhWxImg">
<div style="cursor: pointer;" onclick="gotoTableView('/weather/hourbyhour/USNY0996','18')"><img src="http://i.imwx.com/web/common/wxicons/45/30.gif" alt="" width="45" height="45" border="0" /></div>
</div>
<div class="hbhWxTemp">
<div>72° F</div>
</div>
<div class="hbhWxPrecip">
<div>Precip:
10%</div>
</div>
<pre>

We see that actually we don’t want to read the text inside every div tag, only the ones which are labeled with class: time, temp, and precip. The Python SGMLParser is set up to call special functions every time a tag is opened, and every time a tag is closed. For us, the relavant functions will be start_div (and end_div), and handle_data which is called for everything that is not a tag. Here is our python class:

from sgmllib import SGMLParser
import re

class weatherParser(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.timeflag = 0
        self.tempflag = 0
        self.precipflag = 0
    def start_div(self, attr):
         cl = [v for k, v in attr if k == 'class']
         #print cl
         if cl:
             if cl[0] == 'hbhWxTime': self.timeflag = 1
             elif cl[0] == 'hbhWxTemp': self.tempflag = 1
             elif cl[0] == 'hbhWxPrecip': self.precipflag = 1
    def handle_data(self, text):
        if self.timeflag == 1:
            print text, ":",
            self.timeflag = 0;
        if self.tempflag == 1:
            print text, 'degrees.',
            self.tempflag = 0
        if self.precipflag == 1:
            print text,
            self.precipflag = 2
        elif self.precipflag == 2:
            print text
            self.precipflag = 0

In the constructor we reset the parser, and set all the relevant flags to zero. Then every time a “div” tag is encountered, the parser calls our function start_div with the list of attributes and their values. Inside our function, we pick off the “class” attribute, and test to see if our div tag corresponds to time, temperature, or precipitation. When the function handle_data is called on the “content” of the tag, then we will have flagged whether or not we are interested in that data. (Because of how the div tags are arranged on this page, we have to do something slightly different with the precipitation flag.)
Now we’re ready to get a class object and feed it our source:

import weatherParser
wparser = weatherParser.weatherParser()
wparser.feed(wsource)
wparser.close()

The output:

6 pm : 72 degrees. Precip: 10%
7 pm : 71 degrees. Precip: 10%
8 pm : 68 degrees. Precip: 10%
9 pm : 67 degrees. Precip: 10%
10 pm : 67 degrees. Precip: 10%
11 pm : 67 degrees. Precip: 15%
12 am : 67 degrees. Precip: 10%
1 am : 66 degrees. Precip: 10%

A cool night in the city! Now you can set this script to run every hour, update your data base, test the weather.com predictions, or display it on your background screen however you want. Happy scraping.

(posted by Matt DeLand)

Advertisements

5 thoughts on “HTML Parsing and Python

  1. human mathematics says:

    Thanks for this. For the Rubyists it’s Nokogiri and Heroku. For Perlmonks: LWP. One also wants an XPath parser. I think your example looks easier. (But I didn’t have to read through the documentation, you’re just giving us the answer.)

  2. Matt DeLand says:

    I have since learned that there’s a better solution, especially when the html is malformed (which can be often) and that’s to use the Python module LXML (in fact it turns out that the SGML parser has been deprecated). After gathering the source with urllib, then, to gather all the “times” (continuing the example from above).

    import lxml.html
    html = lxml.html.fromstring(wsource)

    times = []

    elements = html.find_class(“hbhWxTime”)
    for i in range(len(elements)):
    times.append(elements[i].text_content())

    If your html is well labeled with the CSS classes, then picking out the data is incredibly easy.

  3. Have you tried to scrape weatherspark.com? That site totally rocks. What I’d like to do is collect LOTS of data about the forecasts of the various weather services (three outside the US, four inside) versus the actual results and “grade” the forecasts on various metrics. I believe all of this information is available on weatherspark.com but in graph form; do you know how to turn it into a flat file?

    • human mathematics says:

      My fiancee claims that AccuWeather, Weather.com, and the local news consistently make the same kind of wrong predictions. I can’t remember all her examples but it would be of the form “During this season, coming out of a cold spell, they predict warmth will come earlier than it actually does.”

      I assume meteorologists do lots of stats and would thus be careful to avoid correlated errors, even correlated conditional errors. If you do succeed in scraping one or more weather prediction services, will you let me know? Because I’d like to test her theory out.

  4. Matt DeLand says:

    The site does rock!
    They were harder to deal with than weather.com, but I think I managed to get their temperature predictions. It’s not working totally perfectly yet but I can send you the script (if you want). Does Weatherspark really have forecasts from 7 different services? I could only find 4 obvious ones.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: