Chapter No. | Title |
---|---|
1 | Problem Statement |
2 | Implementation |
2.1 | About Dataset |
2.2 | Exploratory Data Analysis and Data Pre-processing |
2.3 | Feature Engineering |
3 | Training Process |
3.1 | Models Used |
3.2 | Metric Used |
3.3 | Parameter Tuning |
3.4 | Best Parameters |
4 | Conclusion |
5 | References |
In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different ammunition, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.
You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 players per group.
You must create a model which predicts players' finishing placement based on their final stats, on a scale from 1 (first place) to 0 (last place).
The PUBG Dataset has up to 100 players in each match which are uniquely identified based on their matchId. The players can form a team in a match, for which they will have the same groupId and the same final placement in that particular match.
The data consists of different groupings, hence the data has variety of groups based on the number of members in the team(not more than 4) and matchType can be solo, duo, squad and customs.Also the matchType can be further more classified based on the perspective mode like TPP and FPP.
Approximately there are 3 million training data points and 1.3 million testing data points. There are in total 29 features. They are summarised as follows:
Sr.No. | Feature | Type | Description |
---|---|---|---|
1 | Id | String | Unique Id for each Player. |
2 | matchId | String | Id to identify matches. |
3 | groupId | String | Id to identify the group. |
4 | assists | Real | Number of enemy players this player damaged that were killed by teammates. |
5 | boosts | Real | Number of boost items used. |
6 | damageDealt | Real | Total damage dealt. Note: Self inflicted damage is subtracted. |
7 | DBNOs | Real | Number of enemy players knocked. |
8 | headshotKills | Real | Number of enemy players killed with headshots. |
9 | heals | Real | Number of healing items used. |
10 | killPlace | Real | Ranking in match of number of enemy players killed. |
11 | killPoints | Real | Kills-based external ranking of player. |
12 | kills | Real | Number of enemy players killed. |
13 | killStreaks | Real | Max number of enemy players killed in a short amount of time. |
14 | longestKill | Real | Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat. |
15 | matchDuration | Real | Duration of match in seconds. |
16 | maxPlace | Real | Worst placement we have data for in the match. |
17 | numGroups | Real | Number of groups we have data for in the match. |
18 | rankPoints | Real | Elo-like ranking of players. |
19 | revives | Real | Number of times this player revived teammates. |
20 | rideDistance | Real | Total distance travelled in vehicles measured in metres. |
21 | roadKills | Real | Number of kills while in a vehicle. |
22 | swimDistance | Real | Total distance travelled by swimming measured in metres. |
23 | teamKills | Real | Number of times this player killed a teammate. |
24 | vehicleDestroys | Real | Number of vehicles destroyed. |
25 | walkDistance | Real | Total distance travelled on foot measured in metres. |
26 | weaponsAcquired | Real | Number of weapons picked up. |
27 | winPoints | Real | Win-based external ranking of players. |
28 | matchType | Categorical | Identifies the matchType. |
29 | winPlacePerc | Real | This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. |
The EDA was quite interesting as the training dataset was about 3 million rows in size.The size of the training dataset was about 688.7 MB, hence the task to handle it would have been somewhat difficult if it would have been involved in any computations.
So by looking at the datatypes of the columns, most of the types were float64 and int64, so we downcasted the datatype of all the numerical columns to as small as possible and reduced the size of the training dataset to 237.5 MB.
Before | After |
Hence now the computation will be quite fast as compared to the original dataset.
Total number of null values in the dataset was only one, and it was removed. Dropped the Id column as it will be of no use in decision making.
There are 16 match types as shown below with combinations of fpp, tpp, solo, duo, squad,etc.So we are generalising them into only solo, duo and squad.After that applying LabelEncoding to matchType column.
Mapping of Label Encoding: solo - 1 ; duo - 0 ; squad - 2
We will be using these encoding for the rest of our project work from now on.
- assists and kills:
Number of assists the player has done for the team and the number of kills a player has done.From the below graphs it can be seen the count of zeros is very high but still an important feature while determining the final rank.
-
roadKills and teamKills: Roadkills indicate the number of people killed while travelling in a vehicle whereas TeamKills indicate the number of people killed by a team member within the same team. These features seem to be useless as it is highly unlikely that this will happen which can be proven from the figures below.
-
headshotKills and DBNOs: Headshot kills indicate the number of kills done by the player with headshot and DBNOs indicate the number of enemies knocked by a player. These features are important as they indicate skill of a player which can be a good metric to judge the final placement prediction of the player.
- boosts and heals: Boosts and Heals are the items which increase the health of the player in the game, boosts have an immediate effect whereas heals take longer time.However, both can be important features for further decision making.
According to the data provided, in a match, people with the same groupId form a group and that group has the same target placement in that match. This was according to us one of the main challenges the model faced as for the same target value, it had different feature values, leading to confusion for model learning. So, to alleviate that, I decided to group the data points based on groupId and matchId and aggregate their feature values to be represented as one row for each group in the match.
So based on the idea mentioned above we thought of representing all the players in the same team as a one entity/player/team.Hence we reduced the dataset by grouping the rows based on groupId, and now each row will represent a team or an individual in the case of solo mode.
Now what about the aggregation of the other columns, so for that we have used sum, mean and max, for e.g:
- kills: We have taken the sum of the kills scored by all the teammates.
- killPlace: For the killPlace , we have taken the mean of that of all players.
- rideDistance: So for the ride distance we have taken max of that of all players in the same team.
So the idea behind the logic of which aggregation is applied to which columns is as follows:
-
So basically the feature which describes any teamwork we will take sum of it ( e.g kills, assists).
-
If it's a scaling feature we’ll be taking the mean of it.
-
If the feature describes the quality of a player in a team we'll take max of it hence his/her team gets affected positively.
Following table shows the columns and the corresponding aggregation function which is applied to it.
Columns | Functions | Columns | Functions |
---|---|---|---|
matchId | max | maxPlace | mean |
assists | sum | numGroups | mean |
boosts | sum | rankPoints | max |
damageDealt | sum | matchType | mean |
DBNOs | sum | revives | sum |
headshotKills | sum | rideDistance | max |
matchId | max | maxPlace | mean |
assists | sum | numGroups | mean |
boosts | sum | rankPoints | max |
damageDealt | sum | matchType | mean |
DBNOs | sum | revives | sum |
headshotKills | sum | rideDistance | max |
heals | sum | roadKills | sum |
killPlace | mean | swimDistance | sum |
killPoints | max | teamKills | sum |
kills | sum | vehicleDestroys | sum |
killStreaks | max | walkDistance | max |
longestKill | mean | weaponsAcquired | sum |
matchDuration | max | winPoints | max |
Here we have significantly reduced the dataset memory, but is it legit reducing the dataset in this way ? Lets see some plots and figure out:So we plotted the discrete features and found that the distribution was similar like the original distribution.
We also plotted the continuous features for both the original dataset and the reduced one and noticed that they were also similar. Let's have a look at it.
So here both distributions are looking similar, let's have a look on how the correlation of the columns with winPlacePerc is affected.
Let's check the correlation of all the features with winPlacePerc before and after.
So as per the above table we can see there is not much of a difference between the original dataset correlation and the reduced dataset correlation of the features with winPlacePerc.
Hence from the above observation we are taking the Reduced_GroupBy dataset into consideration for the further training purpose.
1) walkDistance | boosts | kills(size of points) | winPlacePerc:
From the above graph, we can observe that as boosts consumption increases players chance of winning the match increases, also logically a player which has high chance of winning tends to be in fight and needs boost, also we can see walkDistance also matters in winning as it will be high for the player/team who has high chances of winning, because to be in the game, players have to be in safe zone for that they need to travel.
2) heals | boosts | damageDealt(size of points) | winPlacePerc:
Here the above graph depicts that for high winPlacePerc, along with boosts and heals, the player having high damageDealt also has more tendency to have high winPlacePerc.
3) boosts and heals | winPlacePerc:
From the above graph we can see Boosts and Heals show positive relation with winPlacePerc, Boosts shows more than Heal. Maybe we can do some stuff with both of these features later.
4) kills(matchType wise) | winPlacePerc:
From the above graph we can say that as the number of kills increases chances of winning increases but it does not matter much as we go from match type from solo to squad, because in squad we have to play more strategically and focus is not much on kills in squad.
- Handling some Anomalies:
While analysing the dataset we found some irregularities in the dataset itself hence now we’ll try to handle those anomalies one by one.
- Players have done kills without travelling any type of distance:
So the above graph is of the players who travel zero distance (distance = walk + ride + swim) yet they have killed enemies seems suspicious, hence removing those rows!!
- Longest Kill =0 metre, kill >0:
So here we can see the longest kill is zero metre yet there are some non-zero kills which is not possible logically, hence dropping those rows too!
- TeamKills and rideDistance:
In pubg, a player can kill his/her team-mate only if he has a grenade(weapon) or he/she has driven a vehicle over his/her team-mate. But from the above graph there are some players who have killed teamplayer yet they have not acquired any weapon or drove a car/vehicle!
- roadKills and rideDistance:
From the above graph, there are some players who have killed enemies while riding a car i.e roadKills, but those players have not rode any vehicles, hence dropping those rows too!
Similarly we have observed some more anomalies stated in the next page.
- Players have not walked but have consumed heals and boosts which is not possible hence dropping those rows!
- It's not possible to acquire weapons if a player has not walked a distance.
- A player cannot assist a teammate if the walkDistance is 0.
- A player cannot deal damage if he/she has not walked a single metre.
Hence after performing the Data Pre-processing we reduced the original dataset’s size by a significant amount.
Summary of dataset transition uptill now:
We tried adding new features in the system based on our knowledge of the game, those new features are as follows :
1. killsPerMeter = kills / walkDistance
2. healsPerMeter = heals / walkDistance
3. totalHeals = heals + boosts
4. totalHealsPerMeter = totalHeals / walkDistance
5. totalDistance = walkDistance + rideDistance + swimDistance
6. headshotRate = headshotKills / kills
7. assistsAndRevives = assists + revives
8. itemsAcquired = heals + boosts + weaponsAcquired
9. healsOverBoosts = heals / boosts
10. walkDistanceOverHeals = walkDistance / heals
11. walkDistanceAndHeals = walkDistance * heals
12. walkDistanceOverKills = walkDistance / kills
13. walkDistanceAndKills = walkDistance * kills
14. boostsOverTotalDistance = boosts / totalDistance
15. boostsAndTotalDistance = boosts * totalDistance
After finding the correlation of these features with the target, they had a high correlation indicating these will be good features for learning.
We tried various models to train on the dataset which are the following:
- Linear Regression:
As it is a simple model, comparisons can be made with respect to this model. Linear Regression is a statistical method to predict the relationship between an independent variable and a dependent variable. This problem dealt with the prediction of a predictor variable. In Linear Regression, the unknown function that maps the dependent variable to the independent variable has its model parameters estimated from the data. After fitting the linear Regression model, if additional data is provided to the model, it predicts the predictor variable automatically.
The model assumes to have a linear relationship in the following way,
This is then solved using an ordinary least square solution wherein the parameters of the model are chosen to minimise the least square values between the predicted and the actual value of the predictor variable which is given as follows:
- Ridge Regression:
Ridge is an extension of the Ordinary Linear Regression wherein a regularizer term is added. The regularizer term is used to penalise the higher order weights and to increase the sparsity of weights in the model. Regularizer is used in the case of overfitting and the amount of regularisation to be added can be decided. A prior term is added when using Ridge Regression wherein the prior term for Ridge is Gaussian. In the given dataset, chances of overfitting were very low as the number of data points were extremely high compared to the number of features, but we still wanted to see if that MSE value changes after using a regularizer. The MAE values were exactly the same as that of the Ordinary Least Square solution indicating that the use of regularizer is not needed.
- Random Forest:
Random Forest is one of the main models used for predictive modelling as it uses the ensemble model approach. As it is a non-linear model, I wanted to try this on my dataset and as expected the loss reduced after using Random forest. Random Forest as an ensemble model as multiple decision trees are built during training. During testing, the average of the decisions from multiple trees is taken and assigned to be the final predicted value. Random Forest is a strong learner which combines multiple Decision Trees i.e. weak learners to build the system. Random forest works by randomly sampling multiple subsets from the whole dataset with replacement.
This is called bagging. Due to this, the variance of the final model is reduced in turn leading to a consistent estimator.
- LightGBM
Light Gradient Boosting Method (LightGBM) is a gradient boosting method that uses a tree- based algorithm. Gradient Boosting is a method where weak learners are added to build a strong learner using gradient based approaches. The specialty of LightGBM is that it is a leaf-based algorithm compared to all other approaches which are level-based. In this method, the tree is grown on leaves and hence as the depth of the tree increases, the complexity of the model increases.
However, for large datasets LightGBM is extremely popular as it runs on high speed with large datasets and also requires lower memory to run. It focuses on decreasing the final accuracy thereby growing the tree on the leaf with maximum delta loss. It also supports GPU learning. For
smaller datasets, it might lead to overfitting but as the dataset I have used is very large, it works the best. However, as a lot of parameters are present, hyperparameter tuning is a bit cumbersome.
- XGBoost
XGBoost is the abbreviation for eXtreme Boosting. This also uses the Gradient Boosting Decision Tree algorithm. Gradient Boosting is an approach where new models are added to the existing models to decrease the loss and the combined result from all these models is used as the final prediction. It uses the gradient descent algorithm to minimize the loss when adding new models. The execution time of the XGboost model is extremely small and it also uses the leaf-based tree growing. XGBoost is a very popular model used in Kaggle competitions due to it’s ability to handle large datasets.
As we had multiple models, to identify the best model’s performance, we used Mean Squared Error (MSE) metric.
Mean Squared Error is the measure of the square of the difference between actual value and the predicted value, average over all the datapoints.
- Random Forest Parameter Tuning :
Models | Parameters | MSE |
---|---|---|
Linear Regression | n_jobs=-1 | 0.012892 |
Ridge Regression | alpha=10, max_iter=1000, solver='svd' | 0.012892 |
Random Forest | max_depth=35, max_features=None, min_samples_split=20,n_estimators=95, n_jobs=-1,oob_score=True, `warm_start=True,criterion="squared_error" |
0.005542 |
XGBoost | gamma=0.0295,n_estimators=125, max_depth=15, eta=0.113, subsample=0.8, colsample_bytree=0.8, tree_method='gpu_hist',max_leaves = 1250,reg_alpha =0.0995,colsample_bylevel = 0.8,num_parallel_tree =20 | 0.004973 |
LightGBM | colsample_bytree=0.8, learning_rate=0.03, max_depth=30, min_split_gain=0.00015, n_estimators=250, num_leaves=2200,reg_alpha=0.1, reg_lambda=0.001, subsample=0.8, subsample_for_bin=45000, n_jobs =-1, max_bin =700, num_iterations=5200, min_data_in_bin = 12 |
0.004829 |
In this project, a variety of machine learning algorithms and models were experimented. As we have mentioned earlier, we found that the algorithm which works best for this dataset is where grouping of data points is done, and feature dimensions is increased by adding more features from this grouping and also some manual features. Also, LightGBM being fast and efficient for large datasets works the best.
-
Dataset - https://www.kaggle.com/competitions/pubg-version-3/data
-
Linear Regression - Linear Regression Documentation
-
Random Forest Regressor - Random Forest Regressor Documentation
-
Light GBM - https://lightgbm.readthedocs.io/en/v3.3.2/
- Kaggle Notebook Link: https://www.kaggle.com/code/rajkachhadiya/pubg-eda-and-feature-engineering-and-lightgbm
- Kaggle Profile : https://www.kaggle.com/rajkachhadiya