posted by Can Sar on Wed 14th Jan 2004 01:58 UTC
Icon"Spidering Hacks" by O'Reilly is targeted at everyone who wants to automate surfing the web and has a little bit of programming experience. Though each of the hacks in the book covers a particular topic, similar to books in the Cookbook series, there is also a lot of material that is generally applicable in each of them.

Furthermore the more of them you read, the more you understand how useful a spider can be, and why you should use and build them in a particular manner. Rather than just teaching you about spidering the book shows that combining simple programs, and modifying or combining code from existing programs (like the hacks in the book), can lead you very far very quickly.

Like all books in the Hacks series, this one is made up of 100 hacks. The first few hacks are an introduction to spidering and scraping: what it is, why it is useful, how good spiders should behave, and advice on how to keep your spider out of trouble. Though it is possible to skip this introduction and go right to a particular hack, reading it can save you a lot of problems later, and prevent your spider from getting banned by sites.

The next 25 hacks are a lot more technical and are an introduction to spidering tools and techniques. If you have a little bit of Perl knowledge, these will show you everything you need to know in order to easily modify all the programs presented in the rest of the book, and write programs of your own.

All of the most important spidering Perl-modules and programs are introduced and explained. The simplest of these modules are LWP::Simple and LWP::UserAgent that let you issue http requests and access their results.
Several hacks then explain how to use these modules to do things like posting form data, accessing sites that require authentication and using secure HTTP.

Once you know how to get the content of a page you will need to somehow parse it in order to do something useful with it. There are several ways of doing so, the simplest of which is using regular expressions (Hack 23). These work very well for simple things (e.g. searching a page for a particular word), but more elaborate scraping will be simpler with the HTML::TreeBuilder and HTML::TokeParser modules. Both of these are explained and then used throughout the entire book.

Each of the modules described in this chapter also include short but fully functional (and useful) sample programs. These show how to use the modules to accomplish non-trivial tasks without requiring you to read their entire documentation first. This is something that is stressed throughout the book: Getting you up to speed with a tool or technique and allowing you to learn as you go by providing you with examples and links to other sources.

After covering these tools the next two hacks cover a module called WWW::Mechanize. This lets you do something that would require lots of code with LWP::Simple, HTML::TreeBuilder and regular expressions, in very few lines. Using efficient tools is stressed throughout the book. It might be possible to do everything by writing completely new Perl code, but tools like wget, Xpath, RSS::Extract and the oneีs discussed above can make the job a lot simpler.

This chapter also gives a lot of background knowledge on things that are not directly related to spidering but will make your life a lot easier such as installing Perl modules, using pipes, or running scripts with cron.

The remaining chapters are very different from the first two. Instead of covering concepts and tools, most of the hacks found in these accomplish a particular goal by combining particular tools. These four chapters cover collecting files, accessing online databases, keeping your collections up to date, and making your files available and easy to spider. Each of the hacks in these chapters can be downloaded and run, but they also come with explanations and lots of advice on how to adapt them for other sites and purposes. By reading the hacks you will really understand how to use all the tools discussed in the beginning of the book.

Table of contents
  1. "Spidering Hacks book review, Page 1"
  2. "Spidering Hacks book review, Page 2"
e p (0)    5 Comment(s)

Technology White Papers

See More