Convention & Expos Tradefest Scraper

Events scraper app with Scrapy and Selenium.

Project consists to allow a user to scrape the "furniture" section of tradefest.io (platform to find conventions & expos events) using Scrapy as a framework to extract, transform and store data.

Index:

Usage: with Docker
1. Getting and Running the docker image
2. What data are we scraping and where is stored
Information:
Maintainer

Usage: with Docker

1. Getting and Running the docker image

image is hosted in docker hub registry freely available fantaso/scrapy_tradefest

Pulling image from Dockerhub

docker pull fantaso/scrapy_tradefest

or directly run it

mkdir -p output/logs \
&& docker run --rm -v "$(pwd)"/output:/home/app/output -t fantaso/scrapy_tradefest

Here -v "$(pwd)"/output:/home/app/output we are just binding a volume to synchronize a folder output/ in our computer and mapping it to a folder inside the docker container /home/app/output where the data scraped will be stored.
Because of permissions and docker problems binding to sync the folder in the docker with our local machine. we need to create first the folder with mkdir -p output/logs to avoid problems running the docker.

NOTE: This is what we get when we want to run a container binding a volume to our local machine using a non root user to run the container.

2. What data are we scraping and where is stored

We want to store the data scraped in our machine. So, we are binding a folder inside the docker container (output/) to a folder inside our local machine (PC were docker container runs).

Output folder contains:

feeds contains all the scraped data in different formats (csv, xml, json)
logs contains the scraping log files
media contains the images we wanted to scraped as well as automatically generated thumbnails from the images scraped in different sizes (small, medium).

NOTE: all data files generated for logs and feeds are named or formatted with the current time when the docker image is run. `e.g: "2020-06-28 23:47:18.csv"```

Fields to be scraped from each event or expo:

url: url of the detailed event
listed_name: name of the event in the paginated list
detailed_name: name of the event in the detailed event page
date: date of the event
city: city where the event is taken place
country: country where the event is taken place
venue: location or place where the event is taken place
duration: time duration of the event
final_grade: rating of the event (client naming requirement!)
total_reviews: quantity of reviewers (related to the "final_grade")
attendees: quantity of people that attended the event
exhibitors: quantity of exhibitors that were part of the event
hashtags: tags of the event
website: official website of the event.
description: descriptions of the event
image_urls: url for the logo (image) of the of the event

Information:

Technology Stack
Python		Back-End
Scrapy		Scraper Framework
Selenium		Browser Automation
Docker		Container

Maintainer

Get in touch -–> fantaso

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
readme		readme
tradefest_scraper		tradefest_scraper
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Convention & Expos Tradefest Scraper

Index:

Usage: with Docker

Information:

Maintainer

Usage: with Docker

1. Getting and Running the docker image

2. What data are we scraping and where is stored

Information:

Maintainer

About

Releases

Packages

Languages

Fantaso/scrapy-tradefest

Folders and files

Latest commit

History

Repository files navigation

Convention & Expos Tradefest Scraper

Index:

Usage: with Docker

Information:

Maintainer

Usage: with Docker

1. Getting and Running the docker image

2. What data are we scraping and where is stored

Information:

Maintainer

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages