Scrapy Headless

This is a plugin to make it easier to use scrapy with headless browsers, at the moment it only works with selenium grid as a driver.

Installation

For now the project is in a private bit bucket repo, so install it from there:

pip install scrapy-headless

Usage

You will first need to have a selenium grid server running, you may find some examples on: https://github.com/SeleniumHQ/docker-selenium/wiki/Getting-Started-with-Docker-Compose

The easiest way is by using docker-compose, here is a example docker-compose.yml file:

selenium-hub:
  image: selenium/hub
  ports:
  - 4444:4444

chrome:
  image: selenium/node-chrome
  links:
  - selenium-hub:hub
  environment:
    - HUB_PORT_4444_TCP_ADDR=hub
    - GRID_TIMEOUT=180 # Default timeout is 30s might be low for Selenium
  volumes:
  - /dev/shm:/dev/shm

And just,

$ docker-compose up -d

And, if you want more browser instances

$ docker-compose up -d --scale chrome=3 # For 3 browsers

On scrapy you will need to update your settings, for example:

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

SELENIUM_GRID_URL = 'http://localhost:4444/wb/hub'  # Example for local grid with docker-compose
SELENIUM_NODES = 1  # Number of nodes(browsers) you are running on your grid
SELENIUM_CAPABILITIES = DesiredCapabilities.CHROME  # Example for Chrome

# You need also to change the default download handlers, like so:
DOWNLOAD_HANDLERS = {
    "http": "scrapy_selenium.SeleniumDownloadHandler",
    "https": "scrapy_selenium.SeleniumDownloadHandler",
}

You may also set a proxy for your selenium requests:

SELENIUM_PROXY = 'http://proxy.url:port'

Now all you need to do, is on your spider, for the requests you want handled by selenium use HeadlessRequest instead of scrapy's Request, for example:

from scrapy import Spider
from scrapy_headless import HeadlessRequest


class SomeSpider(Spider):
    ...
    def some_parser(self, response):
        ...
        yield HeadlessRequest(some_url, callback=self.other_parser)

If you need to do something with the driver after getting the url you may also set a driver_callback:

from scrapy import Spider
from scrapy_headless import HeadlessRequest


class SomeSpider(Spider):
    ...
    def some_parser(self, response):
        ...
        yield HeadlessRequest(some_url, callback=self.other_parser, driver_callback=self.process_webdriver)

    def process_webdriver(self, driver):
        ...

Future

Ideally this download handler should be able to use any of the following:

Selenium Grid
Selenium (without grid)
Pyppeteer

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
scrapy_headless		scrapy_headless
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapy Headless

Installation

Usage

Future

About

Releases

Packages

Contributors 3

Languages

License

scrapy-plugins/scrapy-headless

Folders and files

Latest commit

History

Repository files navigation

Scrapy Headless

Installation

Usage

Future

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages