This project focuses on analyzing job salary data from Indeed using Python, Selenium, and SQL, and visualizing the results in Power BI. The data is scraped from Indeed job postings, pre-processed using Pandas in Python, and then loaded into a SQL database. SQL queries are used to analyze the data and generate insights, and the results are visualized in Power BI to create an interactive dashboard.
The project is structured as follows:
- data_collection.ipynb: Python script for web scraping Indeed job postings using Selenium and storing the data in CSV format.
- data_preprocessing.py: Python script for data pre-processing using Pandas, cleaning and transforming the scraped data for analysis.
- database.sql: SQL script for creating the database and tables to store the pre-processed job salary data.
- sql_queries.sql: SQL queries for analyzing the job salary data, calculating averages, aggregating data, and extracting insights.
- job_salary_analysis.pbix: Power BI project file containing the dashboard visualizations and data connections.
To collect job salary data, we use Python and Selenium to scrape Indeed job postings. The data_collection.ipynb script navigates to the desired job search page, extracts relevant information such as job title, salary, location, employer, and employer rating, and stores the data in a CSV file.
The scraped data in CSV format is pre-processed using the data_preprocessing.py script. This script uses Pandas to clean the data, handle missing values, and transform the data into a suitable format for analysis. This includes removing duplicates, converting data types, and performing any necessary data transformations.
To store and analyze the job salary data, we set up a SQL database using the database.sql script. The script creates the necessary tables and schema to store the pre-processed data. The data from the CSV file is then loaded into the database for further analysis.
The sql_queries.sql script contains SQL queries for analyzing the job salary data. These queries include calculating average salaries, aggregating data by location or employer, identifying top-paying employers, and extracting other relevant insights. These queries leverage SQL functions, joins, and aggregations to generate meaningful results.
The job_salary_analysis.pbix file is a Power BI project file that connects to the SQL database and imports the pre-processed job salary data. It includes multiple interactive visualizations and dashboards to present the analysis results. The visualizations can be customized to display key metrics, trends, comparisons, and any other insights derived from the SQL queries.
To run this project, follow these steps:
- Install the necessary Python libraries: Selenium, and Pandas.
- Execute the web_scraper.ipynb script to scrape job salary data from Indeed and save it as a CSV file.
- Run the pre-processor.ipynb script to clean and transform the scraped data.
- Execute the database.sql script to set up the SQL database and tables.
- Load the pre-processed data into the SQL database using your preferred method (e.g., SQL client).
- Run the SQL queries from sql_queries.sql to analyze the job salary data and extract insights.
- Open the job_salary_analysis.pbix Power BI project file to explore the visualizations and dashboards.
- Feel free to customize the scripts, queries, and visualizations according to your specific requirements and data analysis goals.
The HR Job Salary Analysis project demonstrates the end-to-end process of collecting job salary data from Indeed, pre-processing it using Python, storing it in a SQL database, analyzing the data using SQL queries, and visualizing the results in Power BI. This workflow allows for a comprehensive exploration of job salary trends, comparisons, and other insights to support data-driven decision-making in HR departments.