docker build --force-rm -t nutch .
Selenium hub with 10 Chrome nodes and 10 Firefox nodes each in headless mode
docker-compose -f docker-compose_selenium_nutch_solr.yaml up -d --scale chrome=10 --scale firefox=10
docker-compose -f docker-compose_nutch_solr.yaml up -d
docker-compose -f docker-compose_selenium_nutch_solr_tor.yaml up -d --scale firefox=40
This is an option when not using Selenium HUB.
- Install Chrome browser:
- edit sources.list
vi /etc/apt/sources.list
# add at the bottom of the file
deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main
- Download the signing key
wget https://dl.google.com/linux/linux_signing_key.pub
apt-key add linux_signing_key.pub
- Install the stable version of Google Chrome
apt update
apt install google-chrome-stable
NB You may need to upgrade and then update your packages:
apt upgrade
apt update
- download chrome driver from the download page
cd ~
wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
rm chromedriver_linux64.zip
- Change the location of the ChromeDriver binary path if necessary in nutch-default.xml or nutch-site.xml by specifying
the value for
selenium.grid.binary
This is an option when not using Selenium HUB.
- Install Firefox browser:
apt install firefox
- download gecko driver from the download page
cd ~
wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
tar -zxvf geckodriver-v0.23.0-linux64.tar.gz
rm geckodriver-v0.23.0-linux64.tar.gz
- Change the location of the gecko binary path if necessary in nutch-default.xml or nutch-site.xml by specifying
the value for
selenium.grid.binary
This is an option when not using Selenium HUB.
- Install Opera browser by downloading the last version from link
wget http://download4.operacdn.com/ftp/pub/opera/desktop/56.0.3051.99/linux/opera-stable_56.0.3051.99_amd64.deb
dpkg -i opera-stable_56.0.3051.99_amd64.deb
apt install -f
NB Update to the appropriate Opera version.
- download opera driver from the download page
cd ~
wget wget https://github.com/operasoftware/operachromiumdriver/releases/download/v.2.40/operadriver_linux64.zip
unzip operadriver_linux64.zip
rm operadriver_linux64.zip
mv operadriver_linux64/operadriver /root
chmod +x operadriver
- Change the location of the gecko binary path if necessary in nutch-default.xml or nutch-site.xml by specifying
the value for
selenium.grid.binary
- Set the value for
selenium.driver
inconf/nutch-site.xml
to the selenium driver you want to test - If you don't have a screen being attached to the server, set
selenium.enable.headless
totrue
- crawl
# connect to the nutch container
docker exec -it nutch bash
# execute the crawl
/root/nutch/bin/crawl -i -D solr.server.url=http://solr:8983/solr/mycore -s urls crawler 1
- check the result
- Test your result in Solr by opening in your browser: localhost:8983/
- navigate to the created node
mycore
, - execute the default query fetch:
*:*
Regarding the redirects: if you want to follow redirects immediately in the fetcher you simply could adjust http.redirect.max
(e.g., set it to 3) and Fetcher will follow the redirects immediately.
Btw., for quick testing you could just set the required parameters in the command-line, e.g.:
% bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-tika' \
-Dselenium.grid.binary=.../geckodriver \
-Dselenium.enable.headless=true \
-followRedirects \
-dumpText https://nutch.apache.org