Skip to content

Commit

Permalink
Merge pull request #5 from fkhan72/master
Browse files Browse the repository at this point in the history
included EDA in the proposal and created the EDA file
  • Loading branch information
cgostic authored Jan 18, 2020
2 parents 1aa1ec2 + 5e26c5e commit 4cbe450
Show file tree
Hide file tree
Showing 2 changed files with 338 additions and 4 deletions.
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,14 @@ For this project, we'll be using data on applications for vanity license plates
### Research Question
What features are the strongest predictors of a rejected license plate?

### Analysis Plan
We will be performing our analysis in Python to take advantage of the scikit-learn package. We will use sklearn's CountVectorizer funciton to engineer features from each plate. The features will be character strings of varying lengths (n-grams). We will use sklearn's GridSearchCV to optimize the length of n-grams used (i.e. 2,3 and 4 letter strings, 4,5 and 6 letter strings, etc.). Then, we'll fit a MultinomialNB model to the training split of the data and evaluate model performance. Once we feel the model is optimized, we can identify the strongest predictors of rejected license plates using the predict_proba attribute of the fit model.
### EDA
The dataset set contains two sets of `plate` configurations that were accepted and rejected. The analysis showed that there is significant class imbalance. Out of 133,636 total examples, only 1646 belong to the rejected class, which is only 1.23% of the total examples. This means that there is significant class imbalance.

We are aware that the dataset is imbalanced (100,000+ examples in the accepted class and ~2000 examples in the rejected class) and are exploring ways to account for this. One method we're considering is to undersample the accepted class.

### EDA plan
### Analysis Plan
We will be performing our analysis in Python to take advantage of the scikit-learn package. We will use sklearn's CountVectorizer funciton to engineer features from each plate. The features will be character strings of varying lengths (n-grams). We will use sklearn's GridSearchCV to optimize the length of n-grams used (i.e. 2,3 and 4 letter strings, 4,5 and 6 letter strings, etc.). Then, we'll fit a MultinomialNB model to the training split of the data and evaluate model performance. Once we feel the model is optimized, we can identify the strongest predictors of rejected license plates using the predict_proba attribute of the fit model.

We are aware that the dataset has significant class imbalanced and are exploring ways to account for this. One method we're considering is to undersample the accepted class.

### Report plan
We plan to display a subset of the strongest features in a bar graph, and are considering adding a word map to enhance this visualization. We can also show a confusion matrix to asses the efficacy of our model in performing its prescribed task.
Loading

0 comments on commit 4cbe450

Please sign in to comment.