Weedmaps.com Listings Scraper
This bot scrapes all the listings and the listing info in a region.
Clone the repository:
git clone https://github.com/skdcodes/freelancer-python-luminati-weedmaps.git
cd freelancer-python-luminati-weedmaps
You require python3
and pip3
installed. Install the python dependencies by running:
pip3 install -r requirements.txt
Run the script:
python3 scrape.py
After the script is complete, an excel sheet will be created at dumps/data.xlsx
-
You can choose to scrape listings by either
region
or a GeoPointpin
. Either of the options can be set by modifying the variableGATHER_TYPE = "pin"
andGATHER_TYPE= "region"
as below:# scrape.py GATHER_TYPE = "region"
When
region
is chosen, only the variablesREGION
andRADIUS
are considered.When
pin
is chosen, only variablesCENTER
andRADIUS
are considered. -
The
REGION
variable is the region "slug" from which you would like to scrape listings. For example, in order to scrape for all locations one would be setREGION = "earth"
. Similarly, for a more specific region, say California, one should setREGION = "california"
# scrape.py REGION = "california" RADIUS = 75
-
The
CENTER
variable is a GeoPoint like object which takes in a latitudelat
and longitudelng
. All listings around this coordinate with aRADIUS
would be scraped.# scrape.py CENTER = {"lat": 34.04871368408203, "lng": -118.2420196533203} RADIUS = 75
For an arbitrary GeoPoint and Radius, at max
10,000
listings can be retrieved. -
Threading is implemented. By default
20
requests are performed at time.# scrape.py MAX_WORKERS = 20
-
A rotating proxy service is of paramount importance. I've used Luminati.io service for this purpose to burst down requests.
# scrape.py PROXIES = { 'http': 'http://lum-customer-hl_233xze5-zone-static:g32kc5833f20t@zproxy.lum-superproxy.io:22225', 'https': 'http://lum-customer-hl_233xze5-zone-static:g32kc5833f20t@zproxy.lum-superproxy.io:22225', }
- As of this writing a total of
2,104
listings were successfully scraped around California, LA in a70mi
radius with center34.04871368408203, -118.2420196533203
. - The listings search uses Elasticsearch, therefore a maximum of
10,000
listings can be retrieved. - Only
User-Agent
has to be set to cloak your bot. - Rate limiting per IP is implemented. Therefore a service like Luminati.io is very essential to burst down requests.