Merge pull request #6 from cgostic/master

Edit proposal for grammar
UBC-MDS · Jan 19, 2020 · 2ceaa3b · 2ceaa3b
2 parents 4cbe450 + 3e8f16c
commit 2ceaa3b
Showing 1 changed file with 4 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -1,9 +1,7 @@
-# DCSI_522_group-415
-
+# L1c3nc3 t0 C0d3  
+**DCSI_522_group-415**  
 Authors: Keanna Knebel, Cari Gostic, Furqan Khan
 
-# L1c3nc3 t0 C0d3
-
 ## Project Proposal
 
 ### Dataset  
@@ -13,13 +11,12 @@ For this project, we'll be using data on applications for vanity license plates
 What features are the strongest predictors of a rejected license plate?
 
 ### EDA
-The dataset set contains two sets of `plate` configurations that were accepted and rejected. The analysis showed that there is significant class imbalance. Out of 133,636 total examples, only 1646 belong to the rejected class, which is only 1.23% of the total examples. This means that there is significant class imbalance.  
-
+The dataset set contains two classifications of `plate` configurations, accepted and rejected. A useful visualization that we created in our EDA was a bar graph comparing the counts of observations for each class. The analysis showed that there is significant class imbalance. Out of 133,636 total examples, only 1646 belong to the rejected class, which is only 1.23% of the total examples. This is important to note, because not accounting for this imbalance will likely result in a model that achieves a very high overall score, but performs very poorly when predicting the rejected class.  
 
 ### Analysis Plan  
 We will be performing our analysis in Python to take advantage of the scikit-learn package. We will use sklearn's CountVectorizer funciton to engineer features from each plate. The features will be character strings of varying lengths (n-grams). We will use sklearn's GridSearchCV to optimize the length of n-grams used (i.e. 2,3 and 4 letter strings, 4,5 and 6 letter strings, etc.). Then, we'll fit a MultinomialNB model to the training split of the data and evaluate model performance. Once we feel the model is optimized, we can identify the strongest predictors of rejected license plates using the predict_proba attribute of the fit model.
 
-We are aware that the dataset has significant class imbalanced and are exploring ways to account for this. One method we're considering is to undersample the accepted class.
+We are also exploring ways to account for the significant class imbalance. One method we're considering is to undersample the accepted class.
 
 ### Report plan
 We plan to display a subset of the strongest features in a bar graph, and are considering adding a word map to enhance this visualization. We can also show a confusion matrix to asses the efficacy of our model in performing its prescribed task.