Skip to content

A bot that scrapes all available listings from https://weedmaps.com by GeoPoint. It saves each listing's details into an Excel sheet.

Notifications You must be signed in to change notification settings

darwin403/weedmaps-listings-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Weedmaps.com Listings Scraper

This bot scrapes all the listings and the listing info in a region.

Install

Clone the repository:

git clone https://github.com/skdcodes/freelancer-python-luminati-weedmaps.git
cd freelancer-python-luminati-weedmaps

You require python3 and pip3 installed. Install the python dependencies by running:

pip3 install -r requirements.txt

Usage

Run the script:

python3 scrape.py

After the script is complete, an excel sheet will be created at dumps/data.xlsx

Features

  • You can choose to scrape listings by either region or a GeoPoint pin. Either of the options can be set by modifying the variable GATHER_TYPE = "pin" and GATHER_TYPE= "region" as below:

    # scrape.py
    
    GATHER_TYPE = "region"

    When region is chosen, only the variables REGION and RADIUS are considered.

    When pin is chosen, only variables CENTER and RADIUS are considered.

  • The REGION variable is the region "slug" from which you would like to scrape listings. For example, in order to scrape for all locations one would be set REGION = "earth". Similarly, for a more specific region, say California, one should set REGION = "california"

    # scrape.py
    
    REGION = "california"
    RADIUS = 75
  • The CENTER variable is a GeoPoint like object which takes in a latitude lat and longitude lng. All listings around this coordinate with a RADIUS would be scraped.

    # scrape.py
    
    CENTER = {"lat": 34.04871368408203, "lng": -118.2420196533203}
    RADIUS = 75

    For an arbitrary GeoPoint and Radius, at max 10,000 listings can be retrieved.

  • Threading is implemented. By default 20 requests are performed at time.

    # scrape.py
    
    MAX_WORKERS = 20
  • A rotating proxy service is of paramount importance. I've used Luminati.io service for this purpose to burst down requests.

    # scrape.py
    
    PROXIES = {
        'http': 'http://lum-customer-hl_233xze5-zone-static:g32kc5833f20t@zproxy.lum-superproxy.io:22225',
        'https': 'http://lum-customer-hl_233xze5-zone-static:g32kc5833f20t@zproxy.lum-superproxy.io:22225',
    }

Notes

  • As of this writing a total of 2,104 listings were successfully scraped around California, LA in a 70mi radius with center 34.04871368408203, -118.2420196533203.
  • The listings search uses Elasticsearch, therefore a maximum of 10,000 listings can be retrieved.
  • Only User-Agent has to be set to cloak your bot.
  • Rate limiting per IP is implemented. Therefore a service like Luminati.io is very essential to burst down requests.

About

A bot that scrapes all available listings from https://weedmaps.com by GeoPoint. It saves each listing's details into an Excel sheet.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages