Skip to content

Commit

Permalink
Merge pull request #6 from cgostic/master
Browse files Browse the repository at this point in the history
Edit proposal for grammar
  • Loading branch information
Keanna-K authored Jan 19, 2020
2 parents 4cbe450 + 3e8f16c commit 2ceaa3b
Showing 1 changed file with 4 additions and 7 deletions.
11 changes: 4 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# DCSI_522_group-415

# L1c3nc3 t0 C0d3
**DCSI_522_group-415**
Authors: Keanna Knebel, Cari Gostic, Furqan Khan

# L1c3nc3 t0 C0d3

## Project Proposal

### Dataset
Expand All @@ -13,13 +11,12 @@ For this project, we'll be using data on applications for vanity license plates
What features are the strongest predictors of a rejected license plate?

### EDA
The dataset set contains two sets of `plate` configurations that were accepted and rejected. The analysis showed that there is significant class imbalance. Out of 133,636 total examples, only 1646 belong to the rejected class, which is only 1.23% of the total examples. This means that there is significant class imbalance.

The dataset set contains two classifications of `plate` configurations, accepted and rejected. A useful visualization that we created in our EDA was a bar graph comparing the counts of observations for each class. The analysis showed that there is significant class imbalance. Out of 133,636 total examples, only 1646 belong to the rejected class, which is only 1.23% of the total examples. This is important to note, because not accounting for this imbalance will likely result in a model that achieves a very high overall score, but performs very poorly when predicting the rejected class.

### Analysis Plan
We will be performing our analysis in Python to take advantage of the scikit-learn package. We will use sklearn's CountVectorizer funciton to engineer features from each plate. The features will be character strings of varying lengths (n-grams). We will use sklearn's GridSearchCV to optimize the length of n-grams used (i.e. 2,3 and 4 letter strings, 4,5 and 6 letter strings, etc.). Then, we'll fit a MultinomialNB model to the training split of the data and evaluate model performance. Once we feel the model is optimized, we can identify the strongest predictors of rejected license plates using the predict_proba attribute of the fit model.

We are aware that the dataset has significant class imbalanced and are exploring ways to account for this. One method we're considering is to undersample the accepted class.
We are also exploring ways to account for the significant class imbalance. One method we're considering is to undersample the accepted class.

### Report plan
We plan to display a subset of the strongest features in a bar graph, and are considering adding a word map to enhance this visualization. We can also show a confusion matrix to asses the efficacy of our model in performing its prescribed task.

0 comments on commit 2ceaa3b

Please sign in to comment.