- In this Notebook, we try to analyse and predict on Attriction on the IBM employee Dataset.
- Dataset has 1470 rows and 35 columns.
- We divide the data into Numerical and Categorical type, to analyse them.
- Using the categorical data, we plot a graph showing the counts for all the entries
-Using the Numerical data, we check for the outliers
- We also check for the columns with high multi-collinearity.
- Here we remove:
- Redundant values
- Values with high collinearity.
- Removing columns having more than 96% similar values.
- Columns that have least correlation with the target columns.
- Data was reshaped to 1340 rows and 35 columns.
- RobustScaler is a transformation technique that removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). It is also robust to outliers, which makes it ideal for data where there are too many outliers that will drastically reduce the number of training data.
- We use XGBoostClassifier to model our data with hyper-parameter tuning.
- Model gets us an accuracy score of 87.30% and log_loss of 4.38.