Using scrAPI — Some documentation

scrAPI is an excellent ruby gem for scraping web sites. The example is a little confusing, so lets de-construct it.

The original example code from the cheat sheet is supposed to scrape an eBay search listing. What you’re supposed to do in order to make it a working example is to cut and paste the entire search URL (from your browser window no less), and tell scrAPI to fetch and parse it.


# Note, that you will probably need to do gem install tidy to
# get the .so (shared object file for linux), and .dll (Windows)
require "rubygems"
require "scrapi" # Comes from gem install scrapi

# For each ebay_auction, do this..
ebay_auction = Scraper.define do
process "h3.ens>a", :description => :text, :url => "@href"
# Search for an h3 tag with the word "ens" inside its class name.
# Move on to the "A" tag.
# From the "A" tag, copy out the text of what is inside the "A" bracket tags
# religiously shaped soda can
# ("Religiously shaped soda can") as instructed by :text. Copy that into :description.
# Copy the (http) link as told by "@href" into :url.
process "td.ebcPr>span", :price => :text
# Next, search for a td tag with
# the word ebcPr inside its class name. Move onto the "span" tag. From the span
# tag, extract the price that the "span" tag encloses as instructed by :text, and
# save it into :price.
process "div.ebPicture>a>img", :image=>"@src"
# Here, search for a div tag with the text ebPicture inside its class name. Follow
# into an a tag, and then onto an image tag. Taking what is in the image's "src"
# parameter, store the URL link to the image into the :image.
result :description, :url, :price, :image # Return all of this into the caller.
end

# This is the main scraping function
ebay = Scraper.define do
array :auctions
process "table.ebItemlist tr.single", :auctions=>ebay_auction
# This tells the scraper where to start, it'll jump down to a table element,
# with a class name "ebItemlist" and a tr element with the name "single" in its class.
result :auctions
end

BTW, i’m doing this is on Windows because of an issue with SSL-based svn repository CRUD not working within Linux.

On my Windows installation, I used the instant Ruby One Click installer, and then I did the following to get scrAPI:

# From a command prompt:
> gem install tidy
> gem install scrapi

I had to modify the code a little bit so that I could use scrapi (the require statement at the top couldn’t find scrapi, and errored out when it was trying to make a new Scraper.

Here is a working example:

require "rubygems"
require 'c:\ruby\lib\ruby\gems\1.8\gems\scrapi-1.2.0\lib\scrapi.rb'

ebay_auction = Scraper.define do
process "h3.ens>a", :description => :text, :url => "@href"
process "td.ebcPr>span", :price => :text
process "div.ebPicture>a>img", :image=>"@src"
result :description, :url, :price, :image
end

ebay = Scraper.define do
array :auctions
process "table.ebItemlist tr.single", :auctions=>ebay_auction
result :auctions
end

puts ebay.scrape(URI.parse("http://search.ebay.com/[do your own search, and replace it here]"))

The URI is one that I searched for earlier (just the term “g15”).

Example and code based on work done by: Assaf Arkin at http://labnotes.org

Update (March 11 07): The formatting errors are due to how much WordPress likes to eat code. Enclosing everything in pre tags is a workaround, but it munges the code into a bundle of fun. I’ll redo it eventually, and drop it into dzone’s code snippets.

Advertisements

6 responses to “Using scrAPI — Some documentation

  1. mark

    thanks this helped me :)

  2. naelee

    Helped me a lot as well. Thanks so much! The more ruby code you can explain just like this, the better! This is super! I’m a newbie to programming and Ruby, so this helps define things much better for me!

  3. naelee

    # Note, that you will probably need to do gem install tidy to
    # get the .so (shared object file for linux), and .dll (Windows)
    require “rubygems”
    require “scrapi” # Comes from gem install scrapi
    Here is a snippet of the first part of this code…and my question about it is below that:

    # For each ebay_auction, do this..
    ebay_auction = Scraper.define do
    process “h3.ens>a”, :description => :text, :url => “@href”
    # Search for an h3 tag with the word “ens” inside its class name.
    # Move on to the “A” tag.
    # From the “A” tag, copy out the text of what is inside the “A” bracket tags
    # religiously shaped soda can
    # (”Religiously shaped soda can”) as instructed by :text. Copy that into :description.
    # Copy the (http) link as told by “@href” into :url.

    Where is @href defined??

  4. naelee

    oops, in my previous post, I meant to put, “Here is a snippet of the first part of this code…” before the entire code reference… instead, I accidentally stuck it in the middle!

  5. morecode

    It has been a very long time since i’ve done scrAPI work, so i’m a bit rusty.

    I believe though, @href should come from: process “h3.ens>a”, :description => :text, :url => “@href”

  6. jukus

    Thanks for putting this together, labnotes should link this tbh.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: