The Evolution of Web Scraping
May 8th, 2008One of my very first professional programming jobs was to write scrapers to extract Job Listings from various sites and massage them into a format for import into the CareerSite application. In 1997 this meant Perl, enough complicated regular expressions to make you go blind, and LWP::UserAgent. Forms were rarely an issue and one just had to crawl their way through the site happily lifting the content. Even if a form did get in the way it was relatively simple exercise to set the params and fire off a properly formed POST. Cookies and Javascript were rarely an issue.
A typical script might look like this:
my $ua = LWP::UserAgent->new;
$ua->agent("MikeBot");
$response = $ua->post( "http://www.monster.com/search",
{
"category" => "IT",
"company_name" => "IBM"
}
);
...
WWW::Mechanize came along later, extended LWP::UserAgent, and solved the repetitive tasks of cookie handling, form processing, traversing links, etc. WWW::Mechanize is still actively maintained and can be used to effectively crawl reasonable web sites.
Recently, I volunteered to assist one of our clients in the evaluation of various tools and approaches to web content scraping for the purpose of obtaining competitive pricing information. Fast forwarding to 2008, I have discovered that the tools and techniques of several years ago are no longer effective due to the heavy use of Javascript/Ajax dynamic forms and the proliferation of .NET.
I would have to defer to someone like Cris Barbero as to why .NET and view state can introduce friction in my crawling efforts, I just know it is no longer a matter of POSTing the right form fields. To be fair, JSF has also sufficiently added a layer of complexity as well. Clinging to the good old days of being able to fake a browser in a fairly small Perl library weren’t going to cut it in this new and richly interactive world.
Enter Watir. For those that are not aware the Watir project is a Ruby based open source browser automation tool generally used for testing. It can also be leveraged for content scraping. The important thing is that it drives a browser instead of trying to act like a browser. Acting like a browser is much more difficult these days.
We have used Watir to build a monitoring application for one of our clients and for the most part it has worked well despite IE’s best attempts to sabotage the effort.
I liked the idea of Watir but could not stomach the idea of trying to build a robust crawler on the back of IE and Windows. A few short googles later and Firewatir came into my life. I could now drive Firefox on a linux system or more importantly for prototyping purposes, my beloved MacBook.
I must digress and profess my love for the beauty and terseness of Perl. Kids these days can’t understand it because C programming is nothing more than an Academic exercise. The rage for scripting these days is Ruby. All the cool kids are using it and I have been searching for an excuse to learn the language syntax for some time. The important thing to remember is that Ruby != Rails. Rails is a gigantic and indisputable FAIL in all but the most simple of contexts.
Here is a sample of how easy it is to use Firewatir:
#Include the FireWatir file.
require 'firewatir'
#include the FireWatir Module.
include FireWatir
ff=Firefox.new
#Open yahoo mail.
ff.goto("http://mail.yahoo.com")
#Put your user name.
ff.text_field(:name,"login").set("User_Name")
#Put your password.
ff.text_field(:name,"passwd").set("Password")
#Click Sign In button.
ff.button(:value,"Sign In").click
#Click Sign Out button.
ff.link(:text, "Sign Out").click
#Close the browser.
ff.close
I can now easily crawl, test, or functionally monitor websites using an actual browser on a robust operating system.

















2 Responses to “The Evolution of Web Scraping”
By That Guy on May 8, 2008 | Reply
I’m a scraper - I take static HTML pages and convert them to RSS feeds. While I don’t know your specific technical requirements, I recommend you look at feed43.com for scraping. I login, point their service to the web page, and set up parsing instructions. You can experiment with free feeds to see if it helps.
By ftorres on May 9, 2008 | Reply
if you want to write web scrapping in C# or VB.NET look at http://www.InCisif.net.