[Discussion] Avoiding blacklists: rotating proxies or varying scraping params #11
Replies: 4 comments
-
I don't think those features will be implemented directly on this project. Eventually I would like to provide an Spider class interface to be used with Scrapy. Scrapy already has support for auto throttle and has some recommended practices on how to distribute scrapers and avoid getting banned. If there is another scraping framework with those features we might also provide an interface for it. |
Beta Was this translation helpful? Give feedback.
-
I can implement rotating proxies if there is an interest. I don't want to do it if you won't accept it though. If I added the ability to rotate proxies, would you accept the pull request? https://pypi.org/project/requests/
OR
You can easily update code yourself to include a list of proxies as an argument. But I am willing to do it. |
Beta Was this translation helpful? Give feedback.
-
How likely is it to get banned from Facebook if I were to scrape the 3 Facebook page with a minimum of 2 pages each every 15 minute? All through a single machine with a single IP address (without logging in to Facebook) I should really get started looking into Scrapy and all the stuff you all have mentioned above. |
Beta Was this translation helpful? Give feedback.
-
I have been scraping FB groups for 6 years using my own c# app (until it broke recently). It seems (from my experience) that scraping from a mature FB account with lots of "normal" activity will not get you banned. I have downloaded groups with > 1 million posts and comments, which entailed more than 8 weeks continuous downloading. If you create a new FB account and start scraping from it you will be insta-banned. Also, DO NOT rotate proxies. Your account will be immediately locked as logging in from two places in close succession usually means some hacker has your login details, so FB lock your account until you go through verification. I have been downloading social media since 1999. |
Beta Was this translation helpful? Give feedback.
-
Seeing as to how getting blacklisted is likely a concern to many using a module such as this, I'd like to start a discussion on ideas to avoid doing so.
What are anyone's thoughts on implementing features that reduce the chance of getting blacklisted, such as rotating proxies, or allowing adjustment of scraping params such as time in between requests?
Has anybody implemented something similar in their own usage? What's worked and what hasn't?
Beta Was this translation helpful? Give feedback.
All reactions