Skip to content
felipecsl edited this page Jul 31, 2012 · 8 revisions

Iterators

If you need to iterate over a list of nodes that match the same selector, you can use :iterator properties. It takes a selector and a block and yields as many times as the number of elements that were found in that page for that selector. Basically this construct narrows down the context to the elements returned by the selector given to :iterator and scrapes each one of the elements found in the page. This one is a bit complicated, we better give a good example:

Wombat.crawl do
  base_url "http://www.github.com"
  path "/explore"

  repositories "css=ol.ranked-repositories>li", :iterator do
    repo 'css=h3'
    description 'css=p.description'
  end
end

Outputs:

{  
  "repositories"=> 
  [
    {
      "repo"=>"EightMedia / hammer.js",
      "description"=> "A javascript library for multi-touch gestures :// You can touch this"
    },
    {
      "repo"=>"gummikana / email_mom.php",
      "description"=>"A small script that emails my mom when I'm abroad telling her that I'm alive."
    },
    {"
      repo"=>"hagino3000 / Struct.js",
      "description"=>"C Struct like object for JavaScript"
    },
    {
      "repo"=>"tumblr / policy", 
      "description"=>""
    },
    {
      "repo"=>"NaturalNode / natural",
      "description"=>"general natural language facilities for node"
    }
  ]
}

Remember that, by default, properties will return only the first element that matches the given selector. So, for the example above, even if there are several h3 elements inside each li, only the first matching element will be returned. If you want to retrieve all the matching elements, use the option :list instead of the default :text.

Clone this wiki locally