scrAPI is an excellent ruby gem for scraping web sites. The example is a little confusing, so lets de-construct it.
The original example code from the cheat sheet is supposed to scrape an eBay search listing. What you’re supposed to do in order to make it a working example is to cut and paste the entire search URL (from your browser window no less), and tell scrAPI to fetch and parse it.
# Note, that you will probably need to do gem install tidy to
# get the .so (shared object file for linux), and .dll (Windows)
require "rubygems"
require "scrapi" # Comes from gem install scrapi
# For each ebay_auction, do this..
ebay_auction = Scraper.define do
process "h3.ens>a", :description => :text, :url => "@href"
# Search for an h3 tag with the word "ens" inside its class name.
# Move on to the "A" tag.
# From the "A" tag, copy out the text of what is inside the "A" bracket tags
# religiously shaped soda can
# ("Religiously shaped soda can") as instructed by :text. Copy that into :description.
# Copy the (http) link as told by "@href" into :url.
process "td.ebcPr>span", :price => :text
# Next, search for a td tag with
# the word ebcPr inside its class name. Move onto the "span" tag. From the span
# tag, extract the price that the "span" tag encloses as instructed by :text, and
# save it into :price.
process "div.ebPicture>a>img", :image=>"@src"
# Here, search for a div tag with the text ebPicture inside its class name. Follow
# into an a tag, and then onto an image tag. Taking what is in the image's "src"
# parameter, store the URL link to the image into the :image.
result :description, :url, :price, :image # Return all of this into the caller.
end
# This is the main scraping function
ebay = Scraper.define do
array :auctions
process "table.ebItemlist tr.single", :auctions=>ebay_auction
# This tells the scraper where to start, it'll jump down to a table element,
# with a class name "ebItemlist" and a tr element with the name "single" in its class.
result :auctions
end
BTW, i’m doing this is on Windows because of an issue with SSL-based svn repository CRUD not working within Linux.
On my Windows installation, I used the instant Ruby One Click installer, and then I did the following to get scrAPI:
# From a command prompt:
> gem install tidy
> gem install scrapi
I had to modify the code a little bit so that I could use scrapi (the require statement at the top couldn’t find scrapi, and errored out when it was trying to make a new Scraper.
Here is a working example:
require "rubygems"
require 'c:\ruby\lib\ruby\gems\1.8\gems\scrapi-1.2.0\lib\scrapi.rb'
ebay_auction = Scraper.define do
process "h3.ens>a", :description => :text, :url => "@href"
process "td.ebcPr>span", :price => :text
process "div.ebPicture>a>img", :image=>"@src"
result :description, :url, :price, :image
end
ebay = Scraper.define do
array :auctions
process "table.ebItemlist tr.single", :auctions=>ebay_auction
result :auctions
end
puts ebay.scrape(URI.parse("http://search.ebay.com/[do your own search, and replace it here]"))
The URI is one that I searched for earlier (just the term “g15”).
Example and code based on work done by: Assaf Arkin at http://labnotes.org
Update (March 11 07): The formatting errors are due to how much WordPress likes to eat code. Enclosing everything in pre tags is a workaround, but it munges the code into a bundle of fun. I’ll redo it eventually, and drop it into dzone’s code snippets.