Book Review: Spidering Hacks

“Spidering Hacks” by O’Reilly is targeted at everyone who wants to automate surfing the web and has a little bit of programming experience. Though each of the hacks in the book covers a particular topic, similar to books in the Cookbook series, there is also a lot of material that is generally applicable in each of them.

Furthermore the more of them you read, the more you understand how useful a spider can be, and why you should use and build them in a particular manner. Rather than just teaching you about spidering the book shows that combining simple programs, and modifying or combining code from existing programs (like the hacks in the book), can lead you very far very quickly.

Like all books in the Hacks series, this one is made up of 100 hacks. The first few hacks are an introduction to spidering and scraping: what it is, why it is useful, how good spiders should behave, and advice on how to keep your spider out of trouble. Though it is possible to skip this introduction and go right to a particular hack, reading it can save you a lot of problems later, and prevent your spider from getting banned by sites.

The next 25 hacks are a lot more technical and are an introduction to spidering tools and techniques. If you have a little bit of Perl knowledge, these will show you everything you need to know in order to easily modify all the programs presented in the rest of the book, and write programs of your own.

All of the most important spidering Perl-modules and programs are introduced and explained. The simplest of these modules are LWP::Simple and LWP::UserAgent that let you issue http requests and access their results.
Several hacks then explain how to use these modules to do things like posting form data, accessing sites that require authentication and using secure HTTP.

Once you know how to get the content of a page you will need to somehow parse it in order to do something useful with it. There are several ways of doing so, the simplest of which is using regular expressions (Hack 23). These work very well for simple things (e.g. searching a page for a particular word), but more elaborate scraping will be simpler with the HTML::TreeBuilder and HTML::TokeParser modules. Both of these are explained and then used throughout the entire book.

Each of the modules described in this chapter also include short but fully functional (and useful) sample programs. These show how to use the modules to accomplish non-trivial tasks without requiring you to read their entire documentation first. This is something that is stressed throughout the book: Getting you up to speed with a tool or technique and allowing you to learn as you go by providing you with examples and links to other sources.

After covering these tools the next two hacks cover a module called WWW::Mechanize. This lets you do something that would require lots of code with LWP::Simple, HTML::TreeBuilder and regular expressions, in very few lines. Using efficient tools is stressed throughout the book. It might be possible to do everything by writing completely new Perl code, but tools like wget, Xpath, RSS::Extract and the oneÕs discussed above can make the job a lot simpler.

This chapter also gives a lot of background knowledge on things that are not directly related to spidering but will make your life a lot easier such as installing Perl modules, using pipes, or running scripts with cron.

The remaining chapters are very different from the first two. Instead of covering concepts and tools, most of the hacks found in these accomplish a particular goal by combining particular tools. These four chapters cover collecting files, accessing online databases, keeping your collections up to date, and making your files available and easy to spider. Each of the hacks in these chapters can be downloaded and run, but they also come with explanations and lots of advice on how to adapt them for other sites and purposes. By reading the hacks you will really understand how to use all the tools discussed in the beginning of the book.

The third chapter teaches how to browse through a website and find out how you can most easily access its contents from your script. Two great example of this are Hacks 33 and 34: These two case studies show how taking the time to analyze a site and see how it is structured and how information is passed around (e.g. as particular keywords in GET requests) can save a lot of coding effort and keep the script code very simple.

The fourth chapter, Gleaning Data from Databases, is by far the longest chapter in the book. The hacks listed here are all about accessing online databases and then either downloading the information or processing it in some other way. Many of the hacks take a user query feed it into the database, and then feed results from on database into other databases to get further information. Through this the reader also learns how to can combine data from different sites, which makes it easy to replace one of these sites with one of their choosing.

Throughout the book many hacks actually give advice on how different input or output methods can be used with the hack. An excellent example of this is Hack 99: Creating an IM interface, which takes Hack 80 and adds an AIM interface to it. The program creates an AIMbot that upon query sends back the most recent reports from Bugtraq. It is of course very simple to replace the code from Hack 80 with any of the other hacks from the book, or your own custom code, and create something new and useful.

The Hack also includes directions on where to find more information on how to use ICQ or Jabber instead of AIM. In my opinion this Hack best shows what Spidering Hacks is about. The program presented is usable as it is, and there is plenty of information of how to modify it. But what distinguishes the Hack series from Cookbook-style books is that many of the hacks are also very original and innovative ideas rather than just reference implementations of things you might commonly need and know about. The best thing about this book is that every hack shows how experts write spiders, but is also a starting program for a hack of your own.

The last two chapters are rather short and show you how to keep your collection up-to-date and easy to spider for others. The RSS and XML-RPC hacks are especially interesting.

Most of the hacks are useful on Unix (including Mac OS X) and Windows, and the few platform specific hacks (e.g. 86, 82) can easily be adapted to be useful on other platforms. Unix is often assumed to be the platform being used (because Perl is standard on most Unix versions), but the authors usually try to address Windows as well. A good example of this is cron, which is always given as the standard way of running scripts on a regular basis, but Hack 91 explains what to do if cron is not available for your system, and how to do the equivalent on windows.

It should be said, that the book does require a good bit of programming experience and Perl knowledge to fully understand, but many of the hacks are usable and useful straight out of the book. If you, however, want to do anything other than using the exact scripts used in the book, you will need to write or at least edit a bit of Perl code. Fortunately, an introductory Perl book will be enough to get you up to speed. It should also be noted that there is also one Java and one Python hack in the book, but all other programs are in Perl.

Finally O’Reilly is trying to build an online community around the Hacks series, and the book’s website allows people to leave comments on particular hacks, in addition to the usual reader reviews and tables of content.

Buy “Spidering Hacks“
at Amazon.com for less

5 Comments

2004-01-14 3:55 pm

Anonymous
Too bad it’s in perl, php would be more usefull for me as it’s what im working woth at the moment. I kind of dislike the cookbook (the php one), i remember i had a hard time finding about string concatenation the first time i used it, had to use google to find out that it’s with a “.” instead of the “+” (i come from a c/c++ backgroung :-P).
2004-01-14 4:04 pm

Anonymous
PHP isn’t such a good GLUE. To spider something like Perl will act a lot better. If you really want something cool, a threaded Python script would be the way to go. Python has an AIM module to facilitate the aimbot functions of this as well. PHP is a great web development system, but isn’t really designed as well to be used as a GLUE.
2004-01-14 4:08 pm

Anonymous
I was interested in hack #35 (“gathering movies from the Library of Congress” as noted by an amazon.com reviewer), but strange enough, the source code archive downloaded from oreilly.com doesn’t contain any code for this particular “hack”…
2004-01-14 4:22 pm

Anonymous
PHP isn’t such a good GLUE. To spider something like Perl will act a lot better.

Yes, tell that to my boss. If they say php then php it is…
2004-01-15 11:13 am

Anonymous
So cool that I just bought it