This is my first machine learning project implemented to test my knowledge of the Random Forest Classifier as introduced in FastAI's introduction to machine learning course.
There are two notebooks in this project.
- actual.ipynb - The full steps I've taken (including debugging and reasoning)
- submission.ipynb - Removed all unnecessary steps, straight-forward notebook to produce the submission file (csv).
- FastAI v0.7x
- Functions to calculate Quadratic Weighted Kappa
- train.csv - contains the main features
- train_labels.csv - the actual labels for the training set
- test.csv - features of a test set
The goal of this project is to classify the accuracy_group of a participant based on the results of the games he/she has played.
More information can be acquired here.
- Inner merge the training features with the training labels (to form 865k observations, on 16 columns).
- Convert data type of categorical columns.
- Numericalise the categorical values, impute missing values with mean or mode, and one-hot encode categorical variables. Finally, split into features and labels.
Using a very basic Random Forest Classifier, a model is fitted and evaluated on the training and validation sets.
As observed, the model severely overfits to the training set.
At each step, I will create backups and shortcuts to refer to the objects I created in the python workspace for convenience. For example,
These steps will not be explicitly mentioned below.
- Feature Importance
I retained only variables with feature importance greater than 0.010, corresponding to 15 variables.
Again, a simple Random Forest Classifier model is fitted and evaluated on this new dataset.
The only improvement is that significantly less time is needed to train the model. In general, the less features we retain, the poorer the prediction.
- Dendrogram To identify similar variables, a dendrogram was plotted.
Columns that are similar were taken out one at a time to identify their effects.
Two variables are identified to be redundant, hence they are removed.
A Random Forest Classifier is again fitted.
Overfitting is significantly reduced. Validation kappa increased from 0.5826 to 0.6161
Using randomised hyperparameters search, the best hyperparameters are obtained as follows.
Using only the significant variables as found above, together with the best hyperparameters identified, the final model is fitted.
The validation kappa increased to 0.6267, which is the best performance so far.
The pre-processing steps were applied to the test set to ensure compatibility with the model obtained. Although a complete set of predictions were obtained, due to some unknown technical issues, the notebook and submission file were unable to be uploaded to Kaggle for evaluation.
Hence, the only performance metric for this model is the validation kappa, which is higher than all the test set kappa on the Kaggle leaderboard after the deadline.