Skip to content

Commit

Permalink
Merge pull request #46 from fkhan72/master
Browse files Browse the repository at this point in the history
makefile graph included in README
  • Loading branch information
cgostic authored Feb 8, 2020
2 parents 0915587 + 5fd5908 commit 8e20ffb
Show file tree
Hide file tree
Showing 13 changed files with 62 additions and 18 deletions.
Binary file added Makefile_graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 24 additions & 2 deletions README.html

Large diffs are not rendered by default.

27 changes: 26 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

# L1c3nc3 t0 C0d3
**DCSI_522_group-415**
Authors: Keanna Knebel, Cari Gostic, Furqan Khan
Expand All @@ -14,6 +15,26 @@ The final report can be found [here.](https://ubc-mds.github.io/DSCI_522_group_4

## Usage

### 1. Using Docker
*note - the instructions in this section also depends on running this in a unix shell (e.g., terminal or Git Bash), if you are using Windows Command Prompt, replace `/$(pwd)` with PATH_ON_YOUR_COMPUTER.*

1. Install [Docker](https://www.docker.com/get-started)
2. Download/clone this repository
3. Use the command line to navigate to the root of this downloaded/cloned repo
4. Type the following to run the analysis:

```
docker run --rm -v /$(pwd):/home/522_project fkhan72/522_proj:v1.0 make -C /home/522_project all
```

5. Type the following to clean up the analysis

```
docker run --rm -v /$(pwd):/home/522_project fkhan72/522_proj:v1.0 make -C /home/522_project clean
```

### 2. Using Bash/Terminal

To replicate the analysis performed in this project, clone this GitHub repository, install the required [dependencies](#package-dependencies) listed below, and run the following commands in your command line/terminal from the root directory of this project:

1. 01_download_data.R
Expand Down Expand Up @@ -41,7 +62,7 @@ python scripts/04_data_model.py --file_path_read="data/processed/" --filename_x_
Rscript -e "rmarkdown::render('docs/05_generate_report.rmd')"
```

### Running complete project
#### Running complete project

To run the entire project, run the following commands in your command line/terminal from the root directory of this project:

Expand All @@ -55,7 +76,11 @@ To clear the generated outputs from the scripts, run the following commands in y
make clean
```

#### Make file graph

![](Makefile_graph.png)


## Package Dependencies

### Python 3.7.3 and Python packages:
Expand Down
5 changes: 1 addition & 4 deletions docs/05_generate_report.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Definitive answers to these questions will benefit the the NYSDMV by reducing th
## Data
The datasets used in this analysis include all accepted vanity plate submissions and rejected vanity plate submissions between October, 2010 and September, 2014. @wnyc The datasets were sourced from the WNYC Data News Team's license-plates Repository and can be found [here.](https://github.com/datanews/license-plates) The rejected plates were rejected during the second stage of review. In other words, they do not contain any character strings from the red-guide. The raw data has two columns, `Date` and `plate`, where `Date` is not included in this analysis. Another column `outcome`, was added before combining the accepted and rejected datasets to indicate the class of the observation as "accepted" or "rejected". The `plate` column contains the submitted alphanumeric character string of length 2 to 8. An initial evaluation shows a large imbalance in the count of observations for the two classes, with 131,990 in class "accepted" and 1,646 in class "rejected" (**Figure 1**).

```{r class imbalance image, fig.cap = 'Figure 1: The numer of examples per classification',echo=FALSE, fig.align='center'}
```{r class imbalance image, fig.cap = 'Figure 1: The numer of examples per classification',out.width = "300px", echo=FALSE, fig.align='center'}
knitr::include_graphics("../results/examples_per_classification.png")
```

Expand Down Expand Up @@ -68,8 +68,6 @@ knitr::include_graphics("../results/train_val_error.png")

Additionally, using an `ngram_range = (2,2)` is preferable in that it greatly reduces the number of correlated features, and reduces model overfitting. For example, if we included all 2- and 3-length n-grams, every 3-length n-gram would be correlated to two 2-length n-grams (e.g. `CAT` is made up of `CA`, `AT`). Every more complex feature we could add can be described by the combination of two or more simple length-2 n-grams. Longer n-grams are also more specific to the training dataset, and are therefore less likely to generalize to outside data. Our `MultinomialNB`, run against a feature set transformed by a `CountVectorizer` with with `analyzer = 'char'` and `ngram_range = (2,2)`, achieves the validation model metrics shown in **Table 1**.



```{r eval metrics validation, fig.cap = 'Table 1: Model validation evaluation metrics',echo=FALSE, fig.align='center', out.width = '30%'}
knitr::include_graphics("../results/clf_val_report.png")
```
Expand All @@ -80,7 +78,6 @@ Though the recall for the rejected class is fairly low, it's precision is decent

The optimized model shows similar accuracy on the testing dataset. As seen in **Table 2**, our model achieves a testing accuracy score of 0.7397 overall. Like the validation metrics showed, the model predicts the `accepted` class better than it predicts the `rejected class` which is not ideal, but it also has a higher precision than recall, which, as discussed above, it preferable in this context.


```{r eval metrics testing, fig.cap = 'Table 2: Model testing evaluation metrics', echo=FALSE, fig.align='center', out.width = '30%'}
knitr::include_graphics("../results/clf_test_report.png")
```
Expand Down
20 changes: 10 additions & 10 deletions docs/05_generate_report.html

Large diffs are not rendered by default.

Binary file modified results/class_proportion_bl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/clf_test_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/clf_val_report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/examples_per_classification.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/ngram_length_counts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/predictors_2_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified results/train_val_error.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion scripts/03_EDA.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ def main(file_path_raw, file_path_pro, accepted_plates_csv, rejected_plates_csv,
).properties(title = "Counts of n-gram frequency by length",
width = 400,
height = 80,
columns = 1,
# columns = 1,
background = 'white'))

n_g_len_chart.save(file_path_img+'ngram_length_counts.png', scale_factor = 2.0)
Expand Down

0 comments on commit 8e20ffb

Please sign in to comment.