Using scrAPI — Some documentation

scrAPI is an excellent ruby gem for scraping web sites. The example is a little confusing, so lets de-construct it.

The original example code from the cheat sheet is supposed to scrape an eBay search listing. What you’re supposed to do in order to make it a working example is to cut and paste the entire search URL (from your browser window no less), and tell scrAPI to fetch and parse it.


# Note, that you will probably need to do gem install tidy to
# get the .so (shared object file for linux), and .dll (Windows)
require "rubygems"
require "scrapi" # Comes from gem install scrapi

# For each ebay_auction, do this..
ebay_auction = Scraper.define do
process "h3.ens>a", :description => :text, :url => "@href"
# Search for an h3 tag with the word "ens" inside its class name.
# Move on to the "A" tag.
# From the "A" tag, copy out the text of what is inside the "A" bracket tags
# religiously shaped soda can
# ("Religiously shaped soda can") as instructed by :text. Copy that into :description.
# Copy the (http) link as told by "@href" into :url.
process "td.ebcPr>span", :price => :text
# Next, search for a td tag with
# the word ebcPr inside its class name. Move onto the "span" tag. From the span
# tag, extract the price that the "span" tag encloses as instructed by :text, and
# save it into :price.
process "div.ebPicture>a>img", :image=>"@src"
# Here, search for a div tag with the text ebPicture inside its class name. Follow
# into an a tag, and then onto an image tag. Taking what is in the image's "src"
# parameter, store the URL link to the image into the :image.
result :description, :url, :price, :image # Return all of this into the caller.
end

# This is the main scraping function
ebay = Scraper.define do
array :auctions
process "table.ebItemlist tr.single", :auctions=>ebay_auction
# This tells the scraper where to start, it'll jump down to a table element,
# with a class name "ebItemlist" and a tr element with the name "single" in its class.
result :auctions
end

BTW, i’m doing this is on Windows because of an issue with SSL-based svn repository CRUD not working within Linux.

On my Windows installation, I used the instant Ruby One Click installer, and then I did the following to get scrAPI:

# From a command prompt:
> gem install tidy
> gem install scrapi

I had to modify the code a little bit so that I could use scrapi (the require statement at the top couldn’t find scrapi, and errored out when it was trying to make a new Scraper.

Here is a working example:

require "rubygems"
require 'c:\ruby\lib\ruby\gems\1.8\gems\scrapi-1.2.0\lib\scrapi.rb'

ebay_auction = Scraper.define do
process "h3.ens>a", :description => :text, :url => "@href"
process "td.ebcPr>span", :price => :text
process "div.ebPicture>a>img", :image=>"@src"
result :description, :url, :price, :image
end

ebay = Scraper.define do
array :auctions
process "table.ebItemlist tr.single", :auctions=>ebay_auction
result :auctions
end

puts ebay.scrape(URI.parse("http://search.ebay.com/[do your own search, and replace it here]"))

The URI is one that I searched for earlier (just the term “g15”).

Example and code based on work done by: Assaf Arkin at http://labnotes.org

Update (March 11 07): The formatting errors are due to how much WordPress likes to eat code. Enclosing everything in pre tags is a workaround, but it munges the code into a bundle of fun. I’ll redo it eventually, and drop it into dzone’s code snippets.

Advertisements

Dual booting between Windows and Linux (on separate hard disks)

My personal desktop system is comprised of two separate operating systems. I dual boot between these.
Note, depending on how GRUB is configured on your system, your configuration file might be menu.lst, or grub.conf. These files are normally located in /boot, and are the ones that need to be edited for this guide.

Hardware:
2x 160 GB Seagate Barracuda Hard Disks in a software RAID-1 setup for Ubuntu Linux.
1x 80GB Western Digital Special Edition Hard Disk for Windows XP Media Center Edition.

Windows has its own bootloader called NTLDR that it uses.
Linux by today’s standards will use GRUB (instead of LILO).

There is no need to replace NTLDR at all. Some have even gotten NTLDR to boot Linux by copying off the first 512 bytes off of their Linux partition. There have also been some reports of problems with people who replace NTLDR and have complaining virus scanners. So its probably a good idea to leave it intact and alone.

We’ll cover GRUB here, since a lot of people like using it to boot.

Normally, dual booting is very simple if you have both Linux and Windows on the same disk (you simply specify the options in /boot/grub/menu.lst as such):

# The normal way to do it with Windows on the same disk
title Microsoft Windows XP
root (hd0,1)
savedefault
makeactive
chainloader +1

Where hd0 (/dev/sda) is the disk (you have to guess — more on that in a minute), and partition 2 (which is /dev/sda2 in `fdisk -l`).

It is a little different when you have Windows on a separate disk alltogether. My hard disks were setup as follows.

IDE Channel 0:
Seagate 160GB (Master)
Seagate 160GB (Slave)

IDE Channel 1:
CD/DVD Drive (Master)
Western Digital 80GB (Slave)

Now, GRUB numberings (hdX,X), are very different than what you will see in `fdisk -l`. I spent quite a bit of time wondering why (hd2,0) wouldn’t boot into Windows.

Update: Don’t want to go through the pain of partition hunt and peck? Try some of this Ruby code.

To figure out which one is your Windows partition, go through on the GRUB menu (by rebooting) and get a `c`ommand shell.
Try root(hd0,0) incrementing all the way up to 7 (hd0,1) (hd0,2) (hd0,3)..and see if you get something on the lines of “unknown, partition type 0x7.”

If you don’t, move on to the next disk until you get it, or an error message telling you that the disk doesn’t exist. (hd1,0) (hd1,1)..

When you get a partition type of 0x7, thats Windows NTFS.

The rest from here is simple. Windows will need be told that it is on the master hard disk, so that Windows will work properly (there are some reports of problems without this for some reason).

This is fairly easy in GRUB. Insert the following:

# For Windows thats on another disk
# (Replace ‘(hd1)’ with the drive that Windows is on for you)
# You may also want to put this at the bottom, since some automated grub editors will overwrite this entry when they’re used
title Windows XP
map (hd0) (hd1)
map (hd1) (hd0)
chainloader (hd1,0)+1

Save, and reboot.

Parts derived from UbuntuGuide.org and other sources.

How to fix FATAL ERROR: Bad primary partition 1: Partition ends in the final partial cylinder

This came up with the Lexar Jumpdrive that I was attempting to re-partition. According to sources round the net, the partition tables might have been overlapping.

Here is the error in its full context (if you didn’t catch the title):
FATAL ERROR: Bad primary partition 1: Partition ends in the final partial cylinder

The solution is to first blow away all the partitions, write, (reboot is advised if you’re asked to reboot to see changes to the table), then create your new partitions over it.

Update: Just to add some more detail to this process.. This was done entirely in Linux. Since cfdisk (curses fdisk program) bombs out without allowing you to edit your partitions, you’ll need to edit the usb drive’s partitions by using ol’ fdisk.

Its not that bad once you get used to it. Start it as you would with cfdisk (`fdisk /dev/usb_drive`), hit the help menu, get a printout of all the partitions, delete the partitions (using the help menu again), and then write the partition table to the usb drive.

Reboot if it asks. You should be able to re-partition normally after this.

The meaning of 8-bit, 16-bit, and 32-bit

Quite simply (but requiring more detail for other uses):

2^8 = 256
2^16 = 65536
2^32 = 4,294,967,296
2^x = 2 to the power of X-bit.

So 8-bits are simply 2 to the power of 8, which gives you 256 decimal numbers. You’re basically doing number conversions, similar to how you would convert pounds to kilos.

These numbers incidentally, are what some people call “special numbers,” because they appear everywhere in computers. Port numbers for instance, are 16 bit. (You can have port numbers from 1-65535 — port 0 is reserved)

Ruby on Rails 1.2.1 is out (Ruby on Rails)

$ sudo gem update rails --include-dependencies

If you have RAILS_GEM_VERSION uncommented and set to a specific version of Rails in environment.rb, make sure that this is commented out, before aimlessly wandering about wondering why some new functions don’t work (like I did).

(Dave Thomas, author of Agile Web Development with Rails certainly has had his fair share of screaming when it came to getting people to upgrade)

You may want to just simply do:

$ sudo gem update --include-dependencies

(As new versions of capistrano/mongrel and others have been released too)