This project is intended to make a pipeline of data analysis about opportunities for data science career announced at Indeed. However, this pipeline can classify job opportunities of whenever sector, beyond data science.
This pipeline generates a .html file with:
- Clusters 2D Graph
- Clusters Keywords Ranking
- TF-IDF Ranking
Check the "Brazillian Data Science Jobs Market: A Deep Analysis" on the web!
Folder | Description |
---|---|
db/ | Folder where your Scrapy database will be saved |
output/ | Folder where your graphs and results will be saved |
ARGS | USAGE |
---|---|
[db-title] | It is your Scrapy database title (e. g., datascience_db) |
[urls-file] | It is your Indeed URL filename (take a look at sample.urls) |
[toxicwords-file] | It is the filename of list of words for not use in the analysis (take a look at sample.toxicwords) |
[num-clusters] | Number of clusters to identify, in a range (e. g., 2-8) or single (e. g., 8) |
Paraphrasing The Beatles: " All you need is docker 🐳 "
git clone https://github.com/HelioNeves/mut.git
cd /mut
docker build . -t mut
docker run -ti --name MUT-env mut /bin/bash
python3 scraper.py [db-title] [urls-file]
python3 app.py [db-title] [toxicwords-file] [num-clusters]