posted by Can Sar on Wed 14th Jan 2004 01:58 UTC

"Spidering Hacks book review, Page 2"
The third chapter teaches how to browse through a website and find out how you can most easily access its contents from your script. Two great example of this are Hacks 33 and 34: These two case studies show how taking the time to analyze a site and see how it is structured and how information is passed around (e.g. as particular keywords in GET requests) can save a lot of coding effort and keep the script code very simple.

The fourth chapter, Gleaning Data from Databases, is by far the longest chapter in the book. The hacks listed here are all about accessing online databases and then either downloading the information or processing it in some other way. Many of the hacks take a user query feed it into the database, and then feed results from on database into other databases to get further information. Through this the reader also learns how to can combine data from different sites, which makes it easy to replace one of these sites with one of their choosing.

Throughout the book many hacks actually give advice on how different input or output methods can be used with the hack. An excellent example of this is Hack 99: Creating an IM interface, which takes Hack 80 and adds an AIM interface to it. The program creates an AIMbot that upon query sends back the most recent reports from Bugtraq. It is of course very simple to replace the code from Hack 80 with any of the other hacks from the book, or your own custom code, and create something new and useful.

The Hack also includes directions on where to find more information on how to use ICQ or Jabber instead of AIM. In my opinion this Hack best shows what Spidering Hacks is about. The program presented is usable as it is, and there is plenty of information of how to modify it. But what distinguishes the Hack series from Cookbook-style books is that many of the hacks are also very original and innovative ideas rather than just reference implementations of things you might commonly need and know about. The best thing about this book is that every hack shows how experts write spiders, but is also a starting program for a hack of your own.

The last two chapters are rather short and show you how to keep your collection up-to-date and easy to spider for others. The RSS and XML-RPC hacks are especially interesting.

Most of the hacks are useful on Unix (including Mac OS X) and Windows, and the few platform specific hacks (e.g. 86, 82) can easily be adapted to be useful on other platforms. Unix is often assumed to be the platform being used (because Perl is standard on most Unix versions), but the authors usually try to address Windows as well. A good example of this is cron, which is always given as the standard way of running scripts on a regular basis, but Hack 91 explains what to do if cron is not available for your system, and how to do the equivalent on windows.

It should be said, that the book does require a good bit of programming experience and Perl knowledge to fully understand, but many of the hacks are usable and useful straight out of the book. If you, however, want to do anything other than using the exact scripts used in the book, you will need to write or at least edit a bit of Perl code. Fortunately, an introductory Perl book will be enough to get you up to speed. It should also be noted that there is also one Java and one Python hack in the book, but all other programs are in Perl.

Finally O'Reilly is trying to build an online community around the Hacks series, and the book's website allows people to leave comments on particular hacks, in addition to the usual reader reviews and tables of content.

Buy "Spidering Hacks"
at Amazon.com for less
Table of contents
  1. "Spidering Hacks book review, Page 1"
  2. "Spidering Hacks book review, Page 2"
e p (0)    5 Comment(s)