IMD_D3_M2_Base_R_Plots.Rmd

---
title: "Day 3: Data Viz - Module 2"
author: "Ellen Cheng & Kate Miller"
output:
  html_document
---

#### Plotting with Base R {.tabset}

```{r, echo = FALSE, eval = TRUE, warning = FALSE, message = FALSE}

viz_dat <- readRDS("Data/viz_dat.RDS")
```

<details open><summary class='drop'>Data Used in this Section</summary>
```{r, echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE}

# This section requires the `viz_dat` data frame created earlier in the module. If you don't have this data frames in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument
```
</details>
<br>

<details open><summary class='drop'>Packages Used in this Section</summary>
```{r, echo = TRUE, message = FALSE, warning = FALSE}

# Packages used in this section
library(tidyverse)
```
</details>
<br>

<details open><summary class='drop'>Why plot in base `R`?</summary>
Plots can reveal data issues that we had overlooked with our data checks. I sometimes use base R plotting functions to create quick-and-dirty plots for this purpose. If I want to do a lot of plot customizations, I find `ggplot` functions easier to remember and use.

In addition to quick plots, base R plotting functions have another important use. Many `R` packages have functions that create specific data visualizations when `R`'s generic `plot()` is applied to an object of a particular class. For example, `lm()` is a statistical function that fits a general linear model to data and saves the model as an object of class `lm`. Run the code below to see what happens when we apply `plot()` to an object of class `lm`.

```{r lm_plots, echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE}

# NOTE: `lm` is part of the `stats` package that is automatically installed with `R`. This code comes from the Examples section of `?lm`.

# Create fake data
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)

# Simple linear regression of `weight` as a function of group
mod <- lm(weight ~ group) # assign model results to the name `mod`
summary(mod) # model summary
class(mod) # class of `mod` is `lm`
plot(mod) # residual plots
```

We aren't going to discuss statistics or residual plots today, but it's useful to know these common uses for `plot()`.
</details>
<br>

<details open><summary class='drop'>Feeding Data to Base R Plotting Functions</summary>

Look in the help file for the R graphics package that is automatically installed with `R` (in RStudio's 'Packages' tab, click the hyperlink for the package called 'graphics'). Here you will find many functions for base R graphics. For example, the `boxplot()` function draws box plots. 

Typically, we can feed data to most base R plotting functions in one of two ways:

1. Provide the data as a formula of the form `y ~ x` where `y` is a vector of data for the y-axis and `x` is a vector of data for the x-axis. For example, `plot(y ~ x, data = my_data)` or `plot(my_data$y ~ my_data$x)`.

2. Provide the vectors as separate arguments. For example, `plot(x = my_data$x, y = my_data$y)` When providing vectors as separate arguments, each vector must be a standalone vector. That is, we can't just give column names and provide the data frame via a data argument.

With the non-formula approach, the argument names may not be `x` and `y`. For example, with bar plots, the argument name `height` is used instead of `y`. Refer to the help (e.g., `?barplot`) for each plotting function to see what arguments it accepts.

I tend to use the formula approach for feeding data to base R plotting functions, because it is more consistent across the different types of plots. In other words, I'm too lazy to look up what argument names to use with each plot type for the non-formula approach.

</details>
<br>

<details open><summary class='drop'>Create a Box Plot</summary>

Let's use base R functions to create a box plot of park visitors by year, for a subset of `viz_dat` that meets these criteria:

- **region** is Intermountain
- **unit_type** is National Monument
- **year** is 2000, 2005, 2010, or 2015 (each year will be a separate box plot)

We want to know how much spread there is (across park units) in the number of visitors to National Monuments of the Intermountain region. We also want to know if visitor counts differed much among the four years.

```{r base_boxplots, echo = TRUE, warning = FALSE, message = FALSE}
# Check out `?boxplot`

# Subset the data
subdat <- viz_dat %>%
  dplyr::filter(
    year %in% c(2000, 2005, 2010, 2015),
    unit_type == "National Monument",
    region == "Intermountain")

boxplot(visitors ~ year, data = subdat) # the basic boxplot, with data provided as a formula

# An alternative way to feed the data as a formula
# boxplot(subdat$visitors ~ subdat$year) # no need for `data` argument
```

This plot is okay, but even for a basic plot I would want to make the y-axis labels more understandable. Because the visitor numbers are large, `R` is using scientific notation for the y-axis labels. So instead of 200,000 we see 2e+05. Let's change that in the box plot and add plot and axes titles.

```{r base_boxplots2, echo = TRUE, warning = FALSE, message = FALSE}

# Save this boxplot to `p_box` so we can look at the underlying structure
p_box <- boxplot(visitors/1000 ~ year, # so now, 200,000 will be shown as 200 on the y-axis
        data = subdat,
        main = "National Monuments in the Intermountain Region", # plot title
        xlab = "Year", # x-axis title
        ylab = "Number of visitors, in 1000's") # y-axis title
```

Not bad for basic box plots. Now 
look at the data underlying the box plots. In `?boxplot`, the section titled 'Value' explains what these numbers mean. For any 

```{r str_base_boxplots2, echo = TRUE, warning = FALSE, message = FALSE}
p_box
```

From the boxplots and numbers, it looks like visits were generally down in 2010 and 2015 compared to 2000 and 2005. The spread of visitor counts across these 34 National Monuments is huge, though.

</details>
<br>

<details open><summary class='drop'>Graphical Parameters for Base R Plots</summary>

Type `?par` in the console to see the many graphical parameters (arguments) that can be used to modify base R plots. Unless you plan to use base R functions to create complex plots, you can probably get by with remembering a small handful of useful graphical parameters, such as `pch =` to change scatter plot point types, `col =` to set color, and a variety of `cex` arguments to change font size and point size.

Let's create a basic scatter plot and use some of these graphical parameters to gussy up the plot.

</details>
<br>

<details open><summary class='drop'>Create a scatter plot</summary>

<div class="alert alert-info">
  <strong>CHALLENGE:  Create a scatter plot showing the relationship between park visits (y-axis) and gas price (x-axis) for Arches National Park (unit code is ARCH).</strong>
  
  We want to know if annual visitor counts at Arches National Park is correlated with gas price (annual national average).
  
  The data for this plot should be the 16 records of data for Arches National Park (one data point per year). Assign this subset of data to the name 'arch_dat' because we will be using it again for later plots.
  
  Make these modifications to the default scatter plot:
  
  - Show visitor count as 1000's of visitors (i.e., divide the visitor count by 1000) as we did for the box plots.
  
  - Use the graphical parameter `pch` to change the point type from the default open circle to a solid circle (see `?points` to find the number corresponding to a filled circle. The examples at the end of `?points` show how to use `pch` as a plot argument)
  
  - Increase point size to 1.5 instead of the default of 1
  
  **Remember, the Internet is a great source of help when coding in R!**
  </div>

The final plot should show this relationship:

```{r base_scatter, echo = FALSE, message = FALSE, warning = FALSE}

# Subset the data
arch_dat <- subset(viz_dat, unit_code == "ARCH")

# Visitors vs. gas price
plot(visitors/1000 ~ gas_constant, 
     data = arch_dat, 
     pch = 19, # change point shape to a solid circle
     cex = 1.5, # increase point size
     main = "Annual Gas Prices vs Visitor Counts at Arches National Park, 2000 - 2015", 
     xlab = "Gas price (constant 2015 dollars/gallon)", 
     ylab = "Number of visitors, in 1000's")
```

<details><summary class='drop'>Answer to Challenge Question</summary>
```{r, echo = TRUE, eval = FALSE, message = FALSE, warning = FALSE}

# Subset the data
arch_dat <- subset(viz_dat, unit_code == "ARCH")

# Visitors vs. gas price
plot(visitors/1000 ~ gas_constant, 
     data = arch_dat, 
     pch = 19, # change point shape to a solid circle
     cex = 1.5, # increase point size
     main = "Annual Gas Prices vs Visitor Counts at Arches National Park, 2000 - 2015", 
     xlab = "Gas price (constant 2015 dollars/gallon)", 
     ylab = "Number of visitors, in 1000's")
```
</details>

This plot shows that annual visitor counts increased as gas prices increased. That's not the relationship we might have expected to see. But correlation is not causation, and the resolution of data influences the pattern we see. Maybe if we had used average June - July gas prices within 330 km of Arches National Park (instead of annual national averages) we might have seen a different relationship with visitor counts. Maybe, maybe not.

The scatter plot shows another interesting bit of information. What is that funny point that is marching to its own beat?! Gas price was low (~2.50 per gallon in 2015 dollars) and the visitor count was really high, ~ 1.4 million. A quick look at the data shows that this funny point was for the year 2015. 

It would be interesting to see if 2015 was a high visitor count year for other park units as well. We will save that question for ggplot.

For now, let's see if we gain any additional insight from plots showing trends in visits to Arches National Park and trends in national gas prices. Try creating these plots on your own by modifying the code slightly. 

```{r, echo = TRUE, warning = FALSE, message = FALSE}
# Visitors vs. year
plot(visitors/1000 ~ year, data = arch_dat, pch = 19, cex = 1.5, main = "Visits to Arches National Park, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's")
```

The plot above shows that from 2004 to 2015, visitor numbers at Arches National Park increased every year. Visitor counts REALLY jumped in 2014 and 2015.

```{r, echo = TRUE, warning = FALSE, message = FALSE}
# Gas price vs. year
plot(gas_constant ~ year, data = arch_dat, pch = 19, cex = 1.5, main = "Average National Gas Prices, 2000 - 2015", xlab = "Year", ylab = "Gas price (constant 2015 dollars/gallon)")
```

The plot above shows that gas prices have gone up and down between 2000 and 2015. There was a huge drop in gas price between 2008 and 2009 and again between 2014 and 2015.

These scatter plots give interesting insight on trends in national gas prices and trends in visits to Arches National Park. The pattern we saw in our initial scatter plot--positive correlation between annual visitor counts and national average gas price--was almost certainly a spurious correlation. It's important to look at data from many different directions and to not jump to conclusions based on a narrow exploration of the data.
</details>
<br>

<details open><summary class='drop'>Create a Line Graph</summary>

Line graphs are often used for showing changes in continuous data over time. Let's make one last plot in base R before we move on to `ggplot`.

<div class="alert alert-info">
  <strong> CHALLENGE: Re-create the plot of trends in visitor counts at Arches National Park, using a line graph instead of a scatter plot. If you're feeling ambitious, make it a line graph with over-plotted points.</strong>
</div>

The final plot (with over-plotted points) should show this relationship:

```{r base_line, echo = FALSE, message = FALSE, warning = FALSE}

# Line graph
plot(visitors/1000 ~ year, data = arch_dat, type = "o", main = "Visits to Arches National Park, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's") # can use `type = "o" to get line plots with points
```

<details><summary class='drop'>Answer to Challenge Question</summary>
```{r, echo = TRUE, eval = FALSE, message = FALSE, warning = FALSE}

# Line graph
plot(visitors/1000 ~ year, data = arch_dat, type = "o", main = "Visits to Arches National Park, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's") # can use `type = "o" to get line plots with points
```
</details>