SC1015-Data-Science-Project

About

Mini-Project for SC1015 (Introduction to Data Science and Artificial Intelligence), focuses on attrition of employees. Dataset obtained from Kaggle, IBM HR Analytics Employee Attrition & Performance by PAVANSUBHASH

For detailed walkthrough, please view the source code in order from:

DataVisualisation.ipynb
DataCleaning.ipynb
MachineLearningModel_(Non_resampled).ipynb
Resampling.ipynb
MachineLearningModel_(resampled).ipynb

#Contributors

Xcoga
DivineValleys
kiannylim

Problem Definition

Which factors play a bigger role in determining attrition rates in company (Personal/Workplace)?

Motivation

Attrition rate is the rate at which people leave the company over time. It is a metric that determines how well the company is at retaining employees. Companies need to retain talents and capable workers in order to maintain functionality and competitiveness with other businesses in the industry. Hence, we need to know what factors to improve upon in order to retain desirable employees

Models Used

Decision Tree
Random Forest Classifier

Conclusion

Resampling of dataset using SMOTE() significantly increases performance of decision tree/random forest in predicting "Yes" attrition values
'StandardHours','EmployeeCount' and 'Over18', are variables with no deviation. There is no relationship with Attrition.
'EmployeeNumber' is an index with no relation with Attrition
Through Chi Square Test, 'MonthlyIncome','DistanceFromHome','PercentSalaryHike','PerformanceRating', 'YearsSinceLastPromotion' are variables dropped since they are independent on Attrition.
Random Forest performed consistently well with remaining variables after resampling.
Feature Importances showed that JobSatisfaction, Age, JobInvolement, TotalWorkingYears and EnvironmentalSatisfaction (in decreasing order) are the top 5 most important features in predicting attrition rates

What did we learn from this project?

Handling skewed datasets by using resampling from imblearn packages
Random Forest Classifier using sklearn.ensemble package
Using Label Encoding to turn categorical data into integer format using sklearn.preprocessing package
Using GitHub
Using contingency Table for chi square test
Using Chi Square test to find independence of variables to clean data using scipy package
Converting categorical/numerical columns to a list
Using cross validation to tune hyperparameters for Random Forest model
Feature Importance analysis to figure out which variables are the most important in predicting attrition in workers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC1015-Data-Science-Project

About

Problem Definition

Motivation

Models Used

Conclusion

What did we learn from this project?

References

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Datasets		Datasets
DataCleaning.ipynb		DataCleaning.ipynb
DataVisualisation.ipynb		DataVisualisation.ipynb
MachineLearningModel_(Non_resampled).ipynb		MachineLearningModel_(Non_resampled).ipynb
MachineLearningModel_(resampled).ipynb		MachineLearningModel_(resampled).ipynb
README.md		README.md
Resampling.ipynb		Resampling.ipynb

kiannylim/SC1015-Data-Science-Project

Folders and files

Latest commit

History

Repository files navigation

SC1015-Data-Science-Project

About

Problem Definition

Motivation

Models Used

Conclusion

What did we learn from this project?

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages