Events scraper app with Scrapy and Selenium.
Project consists to allow a user to scrape the "furniture" section of tradefest.io (platform to find conventions & expos events) using Scrapy as a framework to extract, transform and store data.
-
- Getting and Running the docker image
- What data are we scraping and where is stored
image is hosted in docker hub registry freely available fantaso/scrapy_tradefest
Pulling image from Dockerhub
docker pull fantaso/scrapy_tradefest
or directly run it
mkdir -p output/logs \
&& docker run --rm -v "$(pwd)"/output:/home/app/output -t fantaso/scrapy_tradefest
- Here
-v "$(pwd)"/output:/home/app/output
we are just binding a volume to synchronize a folderoutput/
in our computer and mapping it to a folder inside the docker container/home/app/output
where the data scraped will be stored. - Because of permissions and docker problems binding to sync the folder in the docker
with our local machine. we need to create first the folder with
mkdir -p output/logs
to avoid problems running the docker.
NOTE: This is what we get when we want to run a container binding a volume to our local machine using a non root user to run the container.
We want to store the data scraped in our machine. So, we are binding a folder inside
the docker container (output/
) to a folder inside our local machine (PC were docker container runs).
Output folder contains:
feeds
contains all the scraped data in different formats (csv, xml, json)logs
contains the scraping log filesmedia
contains the images we wanted to scraped as well as automatically generated thumbnails from the images scraped in different sizes (small, medium).
NOTE: all data files generated for logs and feeds are named or formatted with the current time when the docker image is run. `e.g: "2020-06-28 23:47:18.csv"```
Fields to be scraped from each event or expo:
url
: url of the detailed eventlisted_name
: name of the event in the paginated listdetailed_name
: name of the event in the detailed event pagedate
: date of the eventcity
: city where the event is taken placecountry
: country where the event is taken placevenue
: location or place where the event is taken placeduration
: time duration of the eventfinal_grade
: rating of the event (client naming requirement!)total_reviews
: quantity of reviewers (related to the "final_grade")attendees
: quantity of people that attended the eventexhibitors
: quantity of exhibitors that were part of the eventhashtags
: tags of the eventwebsite
: official website of the event.description
: descriptions of the eventimage_urls
: url for the logo (image) of the of the event
Technology Stack | ||
---|---|---|
Python | Back-End | |
Scrapy | Scraper Framework | |
Selenium | Browser Automation | |
Docker | Container |
Get in touch -–> fantaso