This tool crawls the product catalogue of the online grocery store Oda. The codebase defines abstractions that can be used to build custom adapters for different crawling use cases.
The project is my solution to a home assignment when applying for a position at the company in oct 2021. For a detailed presentation of my solution (choices/approach/tradeoffs/improvements) check out my notes on the solution.
- python >3.9
- pipenv
You can then install dependencies like:
pipenv install
This project has 2 dependencies:
requests
as http clientbeautifulsoup4
as html parser
python -m unittest discover tests
python main.py
There is a hardcoded max number of page visits, but it's possible to stop the parsing at any moment with CTRL+C
.
Before stopping, the program will save the current state in 3 files:
oda_frontier_<DATETIME>.json
, for debuggingoda_visited_<DATETIME>.json
, for debuggingoda_products_<DATETIME>.csv
<-- this is the main application output
This is the outcome of a full run. Check out the /output folder.
Here we get a glimpse of how the crawler discovered and visited all urls.
This shows:
- an initial discovery phase
- a peak around 1500 visits (with the frontier topping at ~2800 urls)
- almost-linear smooth slope going through all the products in the frontier
- a final "bumpy" ride when reaching the bottom part of the frontier, discovering the remaining products
Using the crawler output I created some visualizations on https://rawgraphs.io/
All products grouped by categories (and subcategories):
All products grouped by category, sized by price:
Guess which category is the red one where half the products are cheap while the other half expensive.