Merge pull request #5 from fkhan72/master

included EDA in the proposal and created the EDA file
UBC-MDS · Jan 18, 2020 · 4cbe450 · 4cbe450
2 parents 1aa1ec2 + 5e26c5e
commit 4cbe450
Show file tree

Hide file tree

Showing 2 changed files with 338 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -12,13 +12,14 @@ For this project, we'll be using data on applications for vanity license plates
 ### Research Question  
 What features are the strongest predictors of a rejected license plate?
 
-### Analysis Plan  
-We will be performing our analysis in Python to take advantage of the scikit-learn package. We will use sklearn's CountVectorizer funciton to engineer features from each plate. The features will be character strings of varying lengths (n-grams). We will use sklearn's GridSearchCV to optimize the length of n-grams used (i.e. 2,3 and 4 letter strings, 4,5 and 6 letter strings, etc.). Then, we'll fit a MultinomialNB model to the training split of the data and evaluate model performance. Once we feel the model is optimized, we can identify the strongest predictors of rejected license plates using the predict_proba attribute of the fit model.
+### EDA
+The dataset set contains two sets of `plate` configurations that were accepted and rejected. The analysis showed that there is significant class imbalance. Out of 133,636 total examples, only 1646 belong to the rejected class, which is only 1.23% of the total examples. This means that there is significant class imbalance.  
 
-We are aware that the dataset is imbalanced (100,000+ examples in the accepted class and ~2000 examples in the rejected class) and are exploring ways to account for this. One method we're considering is to undersample the accepted class.
 
-### EDA plan  
+### Analysis Plan  
+We will be performing our analysis in Python to take advantage of the scikit-learn package. We will use sklearn's CountVectorizer funciton to engineer features from each plate. The features will be character strings of varying lengths (n-grams). We will use sklearn's GridSearchCV to optimize the length of n-grams used (i.e. 2,3 and 4 letter strings, 4,5 and 6 letter strings, etc.). Then, we'll fit a MultinomialNB model to the training split of the data and evaluate model performance. Once we feel the model is optimized, we can identify the strongest predictors of rejected license plates using the predict_proba attribute of the fit model.
 
+We are aware that the dataset has significant class imbalanced and are exploring ways to account for this. One method we're considering is to undersample the accepted class.
 
 ### Report plan
 We plan to display a subset of the strongest features in a bar graph, and are considering adding a word map to enhance this visualization. We can also show a confusion matrix to asses the efficacy of our model in performing its prescribed task.