There’s a lot of data on the web – but sometimes getting to it can be quite a challenge.
This post is an introduction on how to use Python to read data out of the HTML source. Python already has (at least) two useful packages designed explicitly for this task: HTMLParser, and SGMLParser. As near as I can tell, they’re quite similar, so I will focus on SGML (standard generalized markup language). The official python documentation for this class is slightly lacking if you’ve never used a parser before, but it really isn’t so hard in the end.
The first thing to do is to find a website that you want to collect data from. I’m going to choose the website http://www.weather.com – more specifically, I’m going to read the New York City hourly weather predictions at http://www.weather.com/weather/hourbyhour/graph/USNY0996. To read the data from a website, you need to understand how it is stored in the html. This will have to involve actually looking at the html!
Of course you can do this from your web browser – but at some point we need to get the html into python, so let’s do it that way.
import urllib sock = urllib.urlopen("http://www.weather.com/weather/hourbyhour/graph/USNY0996") wsource = sock.read() sock.close() print wsource
There’s actually a lot more you can do with the urllib, but this code just opens the url, reads in the html, and saves it in a string. When we print it out, we see a lot of html that we’re going to have to sort through in order to figure out how our data is stored. The first thing you might notice is that the weather channel is trying to collect information it has stored in its cookie. Let’s not worry about that, and instead, notice that the data we’re interested in is stored in a table, separated by tags like this one:
</pre> <div class="hbhWxTime">4pm</div> <pre>
Other websites will be organized differently, but what generally seems to be the case (especially if the data is inserted into the html by some automated process) is that the data will sit inside tags.
</pre> <div class="hbhWxDate"> <div>Sun Sep 11</div> </div> <div class="hbhWxTime"> <div>6 pm</div> </div> <div class="hbhWxImg"> <div style="cursor: pointer;" onclick="gotoTableView('/weather/hourbyhour/USNY0996','18')"><img src="http://i.imwx.com/web/common/wxicons/45/30.gif" alt="" width="45" height="45" border="0" /></div> </div> <div class="hbhWxTemp"> <div>72° F</div> </div> <div class="hbhWxPrecip"> <div>Precip: 10%</div> </div> <pre>
We see that actually we don’t want to read the text inside every div tag, only the ones which are labeled with class: time, temp, and precip. The Python SGMLParser is set up to call special functions every time a tag is opened, and every time a tag is closed. For us, the relavant functions will be start_div (and end_div), and handle_data which is called for everything that is not a tag. Here is our python class:
from sgmllib import SGMLParser import re class weatherParser(SGMLParser): def reset(self): SGMLParser.reset(self) self.timeflag = 0 self.tempflag = 0 self.precipflag = 0 def start_div(self, attr): cl = [v for k, v in attr if k == 'class'] #print cl if cl: if cl == 'hbhWxTime': self.timeflag = 1 elif cl == 'hbhWxTemp': self.tempflag = 1 elif cl == 'hbhWxPrecip': self.precipflag = 1 def handle_data(self, text): if self.timeflag == 1: print text, ":", self.timeflag = 0; if self.tempflag == 1: print text, 'degrees.', self.tempflag = 0 if self.precipflag == 1: print text, self.precipflag = 2 elif self.precipflag == 2: print text self.precipflag = 0
In the constructor we reset the parser, and set all the relevant flags to zero. Then every time a “div” tag is encountered, the parser calls our function start_div with the list of attributes and their values. Inside our function, we pick off the “class” attribute, and test to see if our div tag corresponds to time, temperature, or precipitation. When the function handle_data is called on the “content” of the tag, then we will have flagged whether or not we are interested in that data. (Because of how the div tags are arranged on this page, we have to do something slightly different with the precipitation flag.)
Now we’re ready to get a class object and feed it our source:
import weatherParser wparser = weatherParser.weatherParser() wparser.feed(wsource) wparser.close()
6 pm : 72 degrees. Precip: 10% 7 pm : 71 degrees. Precip: 10% 8 pm : 68 degrees. Precip: 10% 9 pm : 67 degrees. Precip: 10% 10 pm : 67 degrees. Precip: 10% 11 pm : 67 degrees. Precip: 15% 12 am : 67 degrees. Precip: 10% 1 am : 66 degrees. Precip: 10%
A cool night in the city! Now you can set this script to run every hour, update your data base, test the weather.com predictions, or display it on your background screen however you want. Happy scraping.
(posted by Matt DeLand)