Skip to content
lachlanjc edited this page Dec 31, 2014 · 4 revisions

Unfortunately, real world web pages are not clean and lean and asking to be scraped by a 6-year-old. Instead, we usually have to clean up and/or transform the data we retrieve before using it. This is where callbacks come in. Callbacks are specified using a block, similar to Nested properties. The key difference is that callbacks take the selector parameter as well, whereas nested properties take no arguments other than the block itself. Let's see the example below, again with our GitHub scraper:

class GithubScraper
  include Wombat::Crawler
  base_url "http://www.github.com"
  path "/"

  explore "xpath=//ul/li[2]/a" do |e|
    e.gsub(/Explore/, "LOVE")
  end
end

Outputs:

{
  "explore"=>"LOVE GitHub"
}

See the difference? The explore property takes the selector argument and a block. The block will be called once the property has been extracted from the page, with the exact text we got from there. Then, once inside the block, it is your chance to clean it up, transform, modify, whatever. The returned value from that block will be the final value for that property. In this case, the Github page gives us the string Explore GitHub for that selector, then we replace the word Explore with the word LOVE, and get "LOVE GitHub".

Clone this wiki locally