-
Notifications
You must be signed in to change notification settings - Fork 129
Callbacks
Unfortunately, real world web pages are not clean and lean and asking to be scraped by a 6-year-old. Instead, we usually have to clean up and/or transform the data we retrieve before using it. This is where callbacks come in. Callbacks are specified using a block, similar to Nested properties. The key difference is that callbacks take the selector parameter as well, whereas nested properties take no arguments other than the block itself. Let's see the example below, again with our GitHub scraper:
class GithubScraper
include Wombat::Crawler
base_url "http://www.github.com"
path "/"
explore "xpath=//ul/li[2]/a" do |e|
e.gsub(/Explore/, "LOVE")
end
end
Outputs:
{
"explore"=>"LOVE GitHub"
}
See the difference? The explore property takes the selector argument and a block. The block will be called once the property has been extracted from the page, with the exact text we got from there. Then, once inside the block, it is your chance to clean it up, transform, modify, whatever. The returned value from that block will be the final value for that property. In this case, the Github page gives us the string Explore GitHub
for that selector, then we replace the word Explore
with the word LOVE
, and get "LOVE GitHub".