Welcome to HBH! If you have tried to register and didn't get a verification email, please using the following link to resend the verification email.

Using Python's HTMLParser class


Using Python's HTMLParser class

By ghostghost | 52177 Reads |
0     0

Note: To get the most out of this tutorial, make sure you understand OO programming in python, specifically the concept of inheritance.

Python is a very powerful and simple language, which includes many functions to perform complex tasks simply, while still remaining flexible. One such function is called HTMLParser, which as you might have guessed from the name, parses HTML and allows you to easily sort through information in tags. This was also the basis of my google interface, which can be found here: http://www.hellboundhackers.org/code/readcode.php?id=1145

To save writing more code, I will refer to my banked code as an example. I made mine object oriented, and linked it to HTMLParser as the parent class. The first step in the init is always to initialize the HTMLParser variables, you can do this with HTMLParser.init(self).

HTMLParser has several inbuilt methods, but unlike other classes, these are intended to be overwritten by you to suit your purpose. The most used of these is handle_starttag. This method has a standard layout, and it takes 2 arguments from HTMLParser, tag and attrs. tag holds the name of the tag (e.g 'a','p','body'), and attrs holds a list of values, for example ['href','www.hellboundhackers.org'].

You must name the method you create handle_starttag, or it wont work. In the example below, I just wanted all non empty 'a' tags, then a specific link from that list. Your program may be different, but the principle is the same, you are sorting through the tags.

Once you have written your tag handler, you can feed it some html. This is actually done with a single command, not a method. Once you have set up a request and 'urlopen'ed it, you will have some html in a variable. You just run self.feed(), and the html is passed automatically to the handler. Here is the most basic parser possible, so you can see the flow of execution.

from HTMLParser import HTMLParser
from urllib2 import urlopen

class Spider(HTMLParser):
      def __init__(self, url):
            HTMLParser.__init__(self)
            req = urlopen(url)
            self.feed(req.read())

      def handle_starttag(self, tag, attrs):
            if tag == \'a\' and attrs:
                  print \"Found link => %s\" % attrs[0][1]
Spider(\'http://www.hellboundhackers.org\')

If this article helped you, please rate + comment

Comments
ghost's avatar
ghost 16 years ago

This definetly help me understand the HTMLParser for python

ghost's avatar
ghost 16 years ago

Simple article but yet very informative, been thinking of learning Python for a while now, this got me more intersted then before. Thanks for a -Very Good- article JJ:happy:

clone4's avatar
clone4 16 years ago

simple, nice ;) I wish I didn't have learn php now, otherwise I would have definetely start with phyton, it really is great language –Very Good–;)

ynori7's avatar
ynori7 16 years ago

Nice article, very good rating from me

ghost's avatar
ghost 16 years ago

Nice article JJ

ghost's avatar
ghost 16 years ago

Damn it… I was just getting into it when it ended abruptly! Great content. In your next Py article, go into more depth and show more variety in examples. Can't wait. :)

ghost's avatar
ghost 16 years ago

your example ['href','www.hellboundhackers.org'] should have http:// in it

your code is missing tabs.

That said, it's very very simple, but it mostly does the job.. you should mention that handle_starttag is called for EVERY tag fed into it with HTMLPArser.feed

ghost's avatar
ghost 16 years ago

Very Nice article!! :D

bahpomet1105's avatar
bahpomet1105 9 years ago

Help me create an awesome spider thanks for the article