Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper with Selenium config behaves differently with or without headless option #65

Open
retdop opened this issue Feb 20, 2019 · 2 comments

Comments

@retdop
Copy link

retdop commented Feb 20, 2019

Crawling/Testing not working with that config.

screenshot from 2019-02-20 18-18-44

The xpath doesn't seem to find any element.
screenshot from 2019-02-20 18-20-31

Crawler is working when I comment options_selenium.add_argument('headless') in masterspider.py line 101.

This is very weird as chromedriver is supposed to behave identically with or without headless.

PagesJaunes is known to have implemented scraping protections . This may be related.

@retdop
Copy link
Author

retdop commented Feb 20, 2019

Enhancement suggestion:

  • add an option to start chrome as headless
  • in the configuration form of a spider, add the option to write your own header

See https://medium.com/@addnab/puppeteer-quick-fix-for-differences-between-headless-and-headful-versions-of-a-webpage-5b168bd5fe4a

@retdop
Copy link
Author

retdop commented Feb 20, 2019

After some research, it seems really complicated to change the header of requests in Selenium (the easiest way is to use a local proxy...).

Also, there seems to be quite a few differences between chrome and chrome headless. It may be on purpose. So the best solution would actually be to propose an option to use Firefox (geckodriver) instead of Chrome, which actually solve the problem here (tested).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant