Source: ESPN Cricket World Cup
Expand For Steps
Step 1: Install R Studio
Step 2: Download ODI Matches - Data
Step 3: Clean Data and Get Format ready
Step 4: Clone/Download the Repository
Step 5: Make necessary changes [e.g add new matches data in WC_Train.csv file]
Step 6: Do necessary data analysis EDA
Step 7: Run Random Forest Model
Step 8: Store Results in Random Forest Prediction.csv
Step 9: Run Logistic Regression Model
Sept 10: Store Results in Logistic Regression Prediction.csv
Step 11: Run Compare Model Predict
Step 12: Store Models vs Actual Results in Comapre Predict - RF vs. LR
- Objective
- Approach
- Data Collection
- Data Cleaning
- Exploratory Data Analysis
- Build Random Forest Model
- Random Forest Results
- Build Logistic regression Model
- Logistic Regression Results
- Compare Model Performance
- LICENSE
- Acknowledge
To Predict ICC World Cup 2019 Cricket Matches, based on Team’s individual past performances.
- Collect data from – Link
- Data Cleaning and Data Normalization
- Exploratory Data Analysis Link - Repository
- Build Random Forest Model Link - Repository
- Performance of RF model & Results Link - Repository
- Build Logistic Regression Model Link - Repository
- Performance of LR model & Results Link - Repository
- Compare Models performance vs. Actual Match Results Results
In this study, our approach is to predict ICC WC 2019 matches based on past ODI matches results. Now, stronger teams like Australia, India, New Zealand etc would perform better and weaker teams like Pakistan, West Indies would perish – we are not saying this – but our past ODI matches data study reveales the strong and weak team contender for World Cup 2019.
Hence, we decided to study past ODI matches since 2007 to 2018. To collect dataset, we followed HowStats
For data collection, we extract, ODI matches year on year [since 1987] and stored the dataset in excel sheets. However, for our study we considered only ODI matched played from 2007 to 2018. Because, we believe very old matches results [like early 1990s] should not have significant impact on team wise performance for 2019 WC. Hence, we decided to study latest team wise performances.
After extracting data from Howstats, we stored datasets in excel file sheets – year wise.
For cleaning purpose, we used ‘Test to Colum’ function very frequently [Basically we used few excel function to clean entire dataset]
NOTE: Due to lake of data for Afghanistan team matches, we decided to exclude team Afghanistan from the study. [If we would had considered Afghanistan team for WC 2019 world cup prediction study, probably model would have shown team Afghanistan is losing every match – and could become biased!]
For the WC 2019 cricket matches prediction study we decided to count data from 2007 to 2018. However, in many studies we found that more data make model better, True! But, for the objective of the study, we limited ourselves for number of observations. Because for particular study we feel – early 1990s team performance (Especially players which plays significant impact towards winning/loosing particular match.) Like West Indies was star performing team, but in a last decade and longer, the team is barley able to give consistence winning.
We also assume, higher the number of matches team plays, higher the ODI experience and this leads to overall performance of the team.
For the training dataset, we choose 983 observations, where most of the variables are factors.
> dim(ws) ## Dimension of dataset
> str(ws) ## Structure of dataset
And hence, before building supervised learning model we converted factors into dummy variables. Based on rpivotTable(wc) function, we found interesting study.
As we can see based on the above chart table, since last 2 years (2017 & 2018) – England team & India Team gave winning performance and are trending at the top positions.
Similarly, you can see the 2011 World Cup final match was between India and Sri Lanka. In these cluster of years Australia was top contender for finals, but how come Sri Lanka reached to the finals! This is because India knockouts Australia in 2nd Quarter Finals. And Sri Lanka faced New Zealand in Semi Finals – and Sri Lanka won by 5 wickets.
Similarly, in World Cup 2015, based on the following bar chart, we can see how New Zealand has emerged from 2012 to 2014 and challenged Australia in 2015 WC finals.
In World Cup 2019, strong contender for world cup are India, England, New Zealand and South Africa.
Successfully uploaded dataset in R, and we created train variable for 2007 to 2018 cricket matches.
NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model.
wc = read.csv('WC_Train.csv')
## Data From 2007 World Cup till 2018 Cricket Matches
train = wc[which(wc$Year >= 2007 & wc$Year <=2018),]
For supervised learning technique RF, we created Team A & Team B’s category variables into dummy variables.
## Creat dummy variable sfor Team A and Team B TRAIN
Team.A.matrix = model.matrix(~ Trim.Team.A - 1, data = train)
train = data.frame(train, Team.A.matrix)
Team.B.matrix = model.matrix(~ Trim.Team.B - 1, data = train)
train = data.frame(train, Team.B.matrix)
As discussed earlier, in the study Target variable is Team.A.Won, which is counts of Team A level team winning particular match – as count ‘1’ and Team A lost particular match – as count ‘0’. Here, count ‘0’ means Team B team won particular match. And, hence with library function randomForest() we build random forest model for train dataset. After tuning the model, we predicted results in ‘class’ type and ‘prob’ type.
print(wc.rf.tune)
test1$Team.A.Win = predict(wc.rf.tune, test1, type = 'class')
test1$Team.A.Score = predict(wc.rf.tune, test1, type = 'prob')
And results ae stored in Random Forest Prediction.csv file
Due to high error rate in random Forest model - [And even after tuning the model, we were not able to reduce the error]
Based on the results we were not fully satisfied. And hence decided to work on supervised learning technique Logistic Regression to predict ICC Cricket 2019 World Cup matches.
NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model.
Afger 26 June MAtch Results are store in - Random Forest Prediction after 25th June Matches. csv file
Similarly, for Logistic Regression we created a train dataset for ODI matches from 2007 to 2018, and created dummy variables to Target Team.A.Won variable with all the independent variables.
NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model NOTE: As on 08th July Codes has been tuned - For SemiFinal Predictions
logit = Team.A.Won ~ . # Few Variables arenot significant, However, due to Teams we decided to consider All variables.
logit.plot = glm(logit, data = train, family = binomial)
summary(logit.plot)
However, we also found few dummy variables for independent variables set are not significant for the study [like Bangladesh and West Indies]. And Finally, we decided to consider all the teams dummy variables for the study.
Based on the model logit.plot we predicted the test1 file matched for 2019 World Cup. And stored the results in Logistic Regression Prediction.csv file. We also did evaluation of the Logistic Regression model. However, we believe correct evaluation of the model is actual match result.
To evalute the model we ploted ROC curve and calculated the accuracy for the predicted results.
## Model Evaluation
m3.matrix = confusion.matrix(test1$Team.A.Win, predict.logit, threshold = 0.5)
m3.matrix
library(pROC)
m3.roc = roc(test1$Team.A.Win, predict.logit)
m3.roc
plot(m3.roc)
## ON RESULT RATIOS DATA SET
accuracy.logit<-sum(diag(m3.matrix))/sum(m3.matrix)
accuracy.logit
[1] 0.7567568
As shown model accuracy is 75%, and following are the predicted results from the WC 2019 matches.
NOTE: As on 26th June Codes has been tuned - For more accurate results - Also included WC 2019 matches to train model. NOTE: As on 08th July Codes has been tuned - For SemiFinal Predictions
Afger 26 June Match Results are store in - Logistic Regression Prediction after 25th June Matches. csv file
We also build Chaid model to predict the WC 2019 matched, However, we didn't get good outputs from the predicting model. Hence we didn't highlited the model in the study. CHAID codes in Repository
Out of total 37 matched - 4 matched had NO results (due to the Rain), CHAID predicted only 17 matched correct (Actually team won the match). Hence, success ration for the model is 48.5% (17/35 matches). CHAID results
Based on the two supervised learning techniques we build model which can predict WC 2019 matched outcome even before actual match starts. And we compared the model results vs. actual matches result.
Hence, we uploaded both the models RF and LR results in -- > Compare Predict - RF vs. LR
colnames(ComparePredict)[colnames(ComparePredict) == 'Team.A.Win'] = 'RF Team.A.Win'
colnames(ComparePredict)[colnames(ComparePredict) == 'Team.A.Win.1'] = 'LR Team.A.Win'
colnames(ComparePredict)[colnames(ComparePredict) == 'Team.A.Score.1'] = 'Prob % RF Team.A.Win'
colnames(ComparePredict)[colnames(ComparePredict) == 'predict.logit'] = 'Prob % LR Team.A.Win'
In the same .csv file we also manually entered actual match result.
- Random Forest Predicted 23 correct matches out of 34 : 67.6% correct
- Logistic Regression Predicted 22 correct matched out of 34 : 64.7% correct
Note: Afghanistan team matches and Match abandoned due to rain are not included in the result score.
However, few matches were very close call, e.g. in terms of % probability of winning for the team.
NOTE : Python code update for Neural Network Technique to predict WC 2019 results.
FINAL RESULT : ENGLAND WON THE ICC WORLD CUP 2019 [Our prediction was probability for England winning WC 2019 is 74.12% and New Zealand winning WC 2019 is 25.8%] We would be more happy if our results probability were near to 50%, because match went into the Super over, and both the teams were so much close to win the trophy.
- Work closly on Overfitting - in model building.
- Build Model based on CART
- Build MOdel based on LDA
- BUild Model based on Neural Network
- Data Collection from various sources
- First time worked on Real Time Machine Learning Project. It was intresting to choose Data from and previous matches and build Random Forest and Logistic Regression models.
- Initially we tried to build CHAID, however, due to data (numurical) we were not able to fit model - the way we wanted. And hence, we decided to create dummy variable for categorical variables and build Random Forest (RF) and Logistic Regression (LR) models.
- Our initial though was RF would not give good results, and hence we were dependent on LR. But, we saw that in few matches RF worked very well.
- Convert probability results into binary (0 or 1) [Logistic Regression] based on Match Win - Used ifelse() function. Simple!
This Project/Repository is Licensed under MIT license.
This Project/Repository is part of Great Learning - Cricket World Cup Challenge.