Oda Product Crawler

This tool crawls the product catalogue of the online grocery store Oda. The codebase defines abstractions that can be used to build custom adapters for different crawling use cases.

The project is my solution to a home assignment when applying for a position at the company in oct 2021. For a detailed presentation of my solution (choices/approach/tradeoffs/improvements) check out my notes on the solution.

Prerequisites

python >3.9
pipenv

You can then install dependencies like:

pipenv install

This project has 2 dependencies:

requests as http client
beautifulsoup4 as html parser

How to Test

python -m unittest discover tests

How to Run

python main.py

There is a hardcoded max number of page visits, but it's possible to stop the parsing at any moment with CTRL+C.

Before stopping, the program will save the current state in 3 files:

oda_frontier_<DATETIME>.json, for debugging
oda_visited_<DATETIME>.json, for debugging
oda_products_<DATETIME>.csv <-- this is the main application output

Results

This is the outcome of a full run. Check out the /output folder.

Burndown chart

Here we get a glimpse of how the crawler discovered and visited all urls.

This shows:

an initial discovery phase
a peak around 1500 visits (with the frontier topping at ~2800 urls)
almost-linear smooth slope going through all the products in the frontier
a final "bumpy" ride when reaching the bottom part of the frontier, discovering the remaining products

Visualizations

Using the crawler output I created some visualizations on https://rawgraphs.io/

All products grouped by categories (and subcategories):

All products grouped by category, sized by price:

Guess which category is the red one where half the products are cheap while the other half expensive.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
crawler		crawler
output		output
tests		tests
visualizations		visualizations
.gitignore		.gitignore
LICENSE		LICENSE
NOTES.md		NOTES.md
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oda Product Crawler

Prerequisites

How to Test

How to Run

Results

Burndown chart

Visualizations

About

Languages

License

dangrasso/oda-crawler

Folders and files

Latest commit

History

Repository files navigation

Oda Product Crawler

Prerequisites

How to Test

How to Run

Results

Burndown chart

Visualizations

About

Topics

Resources

License

Stars

Watchers

Forks

Languages