Using Python's HTMLParser class
Using Python's HTMLParser class
Note: To get the most out of this tutorial, make sure you understand OO programming in python, specifically the concept of inheritance.
Python is a very powerful and simple language, which includes many functions to perform complex tasks simply, while still remaining flexible. One such function is called HTMLParser, which as you might have guessed from the name, parses HTML and allows you to easily sort through information in tags. This was also the basis of my google interface, which can be found here: http://www.hellboundhackers.org/code/readcode.php?id=1145
To save writing more code, I will refer to my banked code as an example. I made mine object oriented, and linked it to HTMLParser as the parent class. The first step in the init is always to initialize the HTMLParser variables, you can do this with HTMLParser.init(self).
HTMLParser has several inbuilt methods, but unlike other classes, these are intended to be overwritten by you to suit your purpose. The most used of these is handle_starttag. This method has a standard layout, and it takes 2 arguments from HTMLParser, tag and attrs. tag holds the name of the tag (e.g 'a','p','body'), and attrs holds a list of values, for example ['href','www.hellboundhackers.org'].
You must name the method you create handle_starttag, or it wont work. In the example below, I just wanted all non empty 'a' tags, then a specific link from that list. Your program may be different, but the principle is the same, you are sorting through the tags.
Once you have written your tag handler, you can feed it some html. This is actually done with a single command, not a method. Once you have set up a request and 'urlopen'ed it, you will have some html in a variable. You just run self.feed(), and the html is passed automatically to the handler. Here is the most basic parser possible, so you can see the flow of execution.
from HTMLParser import HTMLParser
from urllib2 import urlopen
class Spider(HTMLParser):
def __init__(self, url):
HTMLParser.__init__(self)
req = urlopen(url)
self.feed(req.read())
def handle_starttag(self, tag, attrs):
if tag == \'a\' and attrs:
print \"Found link => %s\" % attrs[0][1]
Spider(\'http://www.hellboundhackers.org\')
If this article helped you, please rate + comment
ghost 16 years ago
Simple article but yet very informative, been thinking of learning Python for a while now, this got me more intersted then before. Thanks for a -Very Good- article JJ:happy:
clone4 16 years ago
simple, nice ;) I wish I didn't have learn php now, otherwise I would have definetely start with phyton, it really is great language –Very Good–;)
ghost 16 years ago
Damn it… I was just getting into it when it ended abruptly! Great content. In your next Py article, go into more depth and show more variety in examples. Can't wait. :)
ghost 16 years ago
your example ['href','www.hellboundhackers.org'] should have http:// in it
your code is missing tabs.
That said, it's very very simple, but it mostly does the job.. you should mention that handle_starttag is called for EVERY tag fed into it with HTMLPArser.feed