A prototype for scraping glassdoor ratings for a given portfolio holdings file (powered by google's custom search API and Selenium)
- Take a list of names from a holdings file
- Determine the glassdoor homepage (via google's search api) for each company (using the "cleaned up" company name as the query)
- Scrape glassdoor's main page for company information and top-level ratings and an additional modal/pop-up for granular ratings (mimicking user clicks via Selenium)
- Organize and merge the output back as columns into the original holdings file.
This is a prototype for further development. While you might find snippets contained here helpful, there is still some hand-holding to get from one step to another, in addition to setting up a google custom search engine API account, VPN set-up, etc. Feel free to message me if you need any guidance.
Removes common junk and other share class stuff from name.
Input: "all_2020_12_18.csv"
. These is just a list of names we want to collect information on
Output: company_queries_2020_12_18.csv
Input: company_queries_2020_12_18.csv
Output: ./google_results/json/<company_id>.json
Input: ./google_results/json/<company_id>.json
Output: ./google_results/top_google_results_2020_12_18.csv
Input: ./google_results/top_google_results_2020_12_18.csv
Outputs:
./extracts/overview/<glassdor_link.html>
(main page)./extracts/overview_extra/<glassdor_link.html>
(additional info)./extracts/errors/<glassdor_link.html>
(pages that encountered errors)
Note: Sleeps randomly (min 10 seconds, max 30 seconds)
Input:
./extracts/overview/<glassdor_link.html>
(main page)./extracts/overview_extra/<glassdor_link.html>
(additional info) Outputs:./extracted_glassdoor.csv
Note: Uses multiprocessing
to loop through all the raw html files
Formatting for output specifications; Uses company websites from original data to verify mapping with company homepage data item from glassdoor Input:
./extracted_glassdoor.csv
Outputs:./glassdoor_ratings.csv
(main page)