Merge pull request #46 from fkhan72/master

makefile graph included in README
UBC-MDS · Feb 8, 2020 · 8e20ffb · 8e20ffb
2 parents 0915587 + 5fd5908
commit 8e20ffb
Show file tree

Hide file tree

Showing 13 changed files with 62 additions and 18 deletions.
diff --git a/Makefile_graph.png b/Makefile_graph.png
diff --git a/README.html b/README.html
diff --git a/README.md b/README.md
@@ -1,3 +1,4 @@
+
 # L1c3nc3 t0 C0d3  
 **DCSI_522_group-415**  
 Authors: Keanna Knebel, Cari Gostic, Furqan Khan
@@ -14,6 +15,26 @@ The final report can be found [here.](https://ubc-mds.github.io/DSCI_522_group_4
 
 ## Usage
 
+### 1. Using Docker
+*note - the instructions in this section also depends on running this in a unix shell (e.g., terminal or Git Bash), if you are using Windows Command Prompt, replace `/$(pwd)` with PATH_ON_YOUR_COMPUTER.*
+
+1. Install [Docker](https://www.docker.com/get-started)
+2. Download/clone this repository
+3. Use the command line to navigate to the root of this downloaded/cloned repo
+4. Type the following to run the analysis:
+
+```
+docker run --rm -v /$(pwd):/home/522_project fkhan72/522_proj:v1.0 make -C /home/522_project all
+```
+
+5. Type the following to clean up the analysis  
+
+```
+docker run --rm -v /$(pwd):/home/522_project fkhan72/522_proj:v1.0 make -C /home/522_project clean
+```
+
+### 2. Using Bash/Terminal 
+
 To replicate the analysis performed in this project, clone this GitHub repository, install the required [dependencies](#package-dependencies) listed below, and run the following commands in your command line/terminal from the root directory of this project:
 
 1. 01_download_data.R
@@ -41,7 +62,7 @@ python scripts/04_data_model.py --file_path_read="data/processed/" --filename_x_
 Rscript -e "rmarkdown::render('docs/05_generate_report.rmd')"
 ```
 
-### Running complete project
+#### Running complete project
 
 To run the entire project, run the following commands in your command line/terminal from the root directory of this project:
 
@@ -55,7 +76,11 @@ To clear the generated outputs from the scripts, run the following commands in y
 make clean
 ```
 
+#### Make file graph
+
+![](Makefile_graph.png)
 
+
 ## Package Dependencies
 
 ### Python 3.7.3 and Python packages:

diff --git a/docs/05_generate_report.Rmd b/docs/05_generate_report.Rmd
@@ -35,7 +35,7 @@ Definitive answers to these questions will benefit the the NYSDMV by reducing th
 ## Data
 The datasets used in this analysis include all accepted vanity plate submissions and rejected vanity plate submissions between October, 2010 and September, 2014. @wnyc The datasets were sourced from the WNYC Data News Team's license-plates Repository and can be found [here.](https://github.com/datanews/license-plates) The rejected plates were rejected during the second stage of review. In other words, they do not contain any character strings from the red-guide. The raw data has two columns, `Date` and `plate`, where `Date` is not included in this analysis. Another column `outcome`, was added before combining the accepted and rejected datasets to indicate the class of the observation as "accepted" or "rejected". The `plate` column contains the submitted alphanumeric character string of length 2 to 8. An initial evaluation shows a large imbalance in the count of observations for the two classes, with 131,990 in class "accepted" and 1,646 in class "rejected" (**Figure 1**). 
 
-```{r class imbalance image, fig.cap = 'Figure 1: The numer of examples per classification',echo=FALSE, fig.align='center'}
+```{r class imbalance image, fig.cap = 'Figure 1: The numer of examples per classification',out.width = "300px", echo=FALSE, fig.align='center'}
 knitr::include_graphics("../results/examples_per_classification.png")
 ```
 
@@ -68,8 +68,6 @@ knitr::include_graphics("../results/train_val_error.png")
 
 Additionally, using an `ngram_range = (2,2)` is preferable in that it greatly reduces the number of correlated features, and reduces model overfitting. For example, if we included all 2- and 3-length n-grams, every 3-length n-gram would be correlated to two 2-length n-grams (e.g. `CAT` is made up of `CA`, `AT`). Every more complex feature we could add can be described by the combination of two or more simple length-2 n-grams. Longer n-grams are also more specific to the training dataset, and are therefore less likely to generalize to outside data. Our `MultinomialNB`, run against a feature set transformed by a `CountVectorizer` with with `analyzer = 'char'` and `ngram_range = (2,2)`, achieves the validation model metrics shown in **Table 1**.
 
-
-
 ```{r eval metrics validation, fig.cap = 'Table 1: Model validation evaluation metrics',echo=FALSE, fig.align='center', out.width = '30%'}
 knitr::include_graphics("../results/clf_val_report.png")
 ```
@@ -80,7 +78,6 @@ Though the recall for the rejected class is fairly low, it's precision is decent
 
 The optimized model shows similar accuracy on the testing dataset. As seen in **Table 2**, our model achieves a testing accuracy score of 0.7397 overall. Like the validation metrics showed, the model predicts the `accepted` class better than it predicts the `rejected class` which is not ideal, but it also has a higher precision than recall, which, as discussed above, it preferable in this context.
 
-
 ```{r eval metrics testing, fig.cap = 'Table 2: Model testing evaluation metrics', echo=FALSE, fig.align='center', out.width = '30%'}
 knitr::include_graphics("../results/clf_test_report.png")
 ```

diff --git a/docs/05_generate_report.html b/docs/05_generate_report.html
diff --git a/results/class_proportion_bl.png b/results/class_proportion_bl.png
diff --git a/results/clf_test_report.png b/results/clf_test_report.png
diff --git a/results/clf_val_report.png b/results/clf_val_report.png
diff --git a/results/examples_per_classification.png b/results/examples_per_classification.png
diff --git a/results/ngram_length_counts.png b/results/ngram_length_counts.png
diff --git a/results/predictors_2_2.png b/results/predictors_2_2.png
diff --git a/results/train_val_error.png b/results/train_val_error.png
diff --git a/scripts/03_EDA.py b/scripts/03_EDA.py
@@ -108,7 +108,7 @@ def main(file_path_raw, file_path_pro, accepted_plates_csv, rejected_plates_csv,
         ).properties(title = "Counts of n-gram frequency by length", 
                     width = 400, 
                     height = 80, 
-                    columns = 1,
+                   # columns = 1,
                     background = 'white'))
 
     n_g_len_chart.save(file_path_img+'ngram_length_counts.png', scale_factor = 2.0)