91-appendixA.Rmd

\cleardoublepage

# (APPENDIX) Appendix {-}

# Statistical Background {#appendixA}

```{r setup_appA, include=FALSE}
chap <- "A"
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**

knitr::opts_chunk$set(
  tidy = FALSE, 
  out.width = '\\textwidth', 
  fig.height = 4,
  warning = FALSE
  )

options(scipen = 99, digits = 3)

# Set random number generator see value for replicable pseudorandomness.
set.seed(76)
```


## Basic statistical terms {#appendix-stat-terms}

Note that all the following statistical terms apply only to *numerical* variables, except the *distribution* which can exist for both numerical and categorical variables. 

### Mean

The *mean* is the most commonly reported measure of center.  It is commonly called the *average* though this term can be a little ambiguous.  The mean is the sum of all of the data elements divided by how many elements there are. If we have $n$ data points, the mean is given by: 

$$Mean = \frac{x_1 + x_2 + \cdots + x_n}{n}$$

### Median

The median is calculated by first sorting a variable's data from smallest to largest.  After sorting the data, the middle element in the list is the *median*.  If the middle falls between two values, then the median is the mean of those two middle values.

### Standard deviation

We will next discuss the *standard deviation* ($sd$) of a variable.  The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far we expect a given data value will be from its mean:

$$sd = \sqrt{\frac{(x_1 - Mean)^2 + (x_2 - Mean)^2 + \cdots + (x_n - Mean)^2}{n - 1}}$$

### Five-number summary

The *five-number summary* consists of five summary statistics: the minimum, the first quantile AKA 25^th^ percentile, the second quantile AKA median or 50^th^ percentile, the third quantile AKA 75^th^, and the maximum. The five-number summary of a variable is used when constructing boxplots, as seen in Section \@ref(boxplots).

The quantiles are calculated as

- first quantile ($Q_1$): the median of the first half of the sorted data
- third quantile ($Q_3$): the median of the second half of the sorted data

The *interquartile range (IQR)*\index{interquartile range (IQR)} is defined as $Q_3 - Q_1$ and is a measure of how spread out the middle 50% of values are. The IQR corresponds to the length of the box in a boxplot.

The median and the interquartile range are not influenced by the presence of outliers in the ways that the mean and standard deviation are. They are, thus, recommended for skewed datasets. We say in this case that the median and interquartile range are more *robust to outliers*.

### Distribution

The *distribution* of a variable shows how frequently different values of a variable occur. Looking at the visualization of a distribution can show where the values are centered, show how the values vary, and give some information about where a typical value might fall.  It can also alert you to the presence of outliers. 

Recall from Chapter \@ref(viz) that we can visualize the distribution of a numerical variable using binning in a histogram and that we can visualize the distribution of a categorical variable using a barplot.


### Outliers

*Outliers* correspond to values in the dataset that fall far outside the range of "ordinary" values.  In the context of a boxplot, by default they correspond to values below $Q_1 - (1.5 \cdot IQR)$ or above $Q_3 + (1.5 \cdot IQR)$.


## Normal distribution {#appendix-normal-curve}

Let's next discuss one particular kind of distribution: \index{distribution!normal} *normal distributions*. Such bell-shaped distributions are defined by two values: 1) the *mean* $\mu$ ("mu") which locates the center of the distribution and 2) the *standard deviation* $\sigma$ ("sigma") which determines the variation of the distribution. In Figure \@ref(fig:normal-curves), we plot three normal distributions where:

1. The solid normal curve has mean $\mu = 5$ \& standard deviation $\sigma = 2$.
1. The dashed normal curve has mean $\mu = 5$ \& standard deviation $\sigma = 5$.
1. The dotted normal curve has mean $\mu = 15$ \& standard deviation $\sigma = 2$.

```{r normal-curves, echo=FALSE, fig.cap="Three normal distributions.", purl=FALSE, out.width="80%"}
library(tidyverse)
all_points <- tibble(
  domain = seq(from = -10, to = 25, by = 0.01),
  `mu = 5, sigma = 2` = dnorm(x = domain, mean = 5, sd = 2),
  `mu = 5, sigma = 5` = dnorm(x = domain, mean = 5, sd = 5),
  `mu = 15, sigma = 2` = dnorm(x = domain, mean = 15, sd = 2)
)  %>% 
  gather(key = "Distribution", value = "value", - domain) %>% 
  mutate(
    Distribution = factor(
      Distribution, 
      levels = c("mu = 5, sigma = 2", 
                 "mu = 5, sigma = 5", 
                 "mu = 15, sigma = 2")
      )
    )

for_labels <- all_points %>% 
  filter(between(domain, 3.795, 3.805) & Distribution == "mu = 5, sigma = 2" |
         between(domain, 0.005, 0.0105) & Distribution == "mu = 5, sigma = 5" |
         between(domain, 16.005, 16.015) & Distribution == "mu = 15, sigma = 2")

all_points %>% 
  ggplot(aes(x = domain, y = value, linetype = Distribution)) +
  geom_line() +
  ggrepel::geom_label_repel(data = for_labels, aes(label = Distribution),
                            nudge_x = c(-1, -2.1, 1)) +
  theme_light() +
  theme(
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    legend.position = "none"
  )
```

Notice how the solid and dashed line normal curves have the same center due to their common mean $\mu$ = 5. However the dashed line normal curve is wider due to its larger standard deviation of $\sigma$ = 5. On the other hand, the solid and dotted line normal curves have the same variation due to their common standard deviation $\sigma$ = 2. However, they are centered at different locations. 

When the mean $\mu$ = 0 and the standard deviation $\sigma$ = 1, the normal distribution has a special name: the *standard normal distribution* or the *$z$-curve*\index{distribution!standard normal}.

Furthermore, if a variable follows a normal curve, there are *three rules of thumb* we can use:

1. 68% of values will lie within $\pm$ 1 standard deviation of the mean.
1. 95% of values will lie within $\pm$ 1.96 $\approx$ 2 standard deviations of the mean.
1. 99.7% of values will lie within $\pm$ 3 standard deviations of the mean.

Let's illustrate this on a standard normal curve in Figure \@ref(fig:normal-rule-of-thumb). The dashed lines are at -3, -1.96, -1, 0, 1, 1.96, and 3. These 7 lines cut up the x-axis into 8 segments. The areas under the normal curve for each of the 8 segments are marked and add up to 100%. For example:

1. The middle two segments represent the interval -1 to 1. The shaded area above this interval represents 34% + 34% = 68% of the area under the curve. In other words, 68% of values. 
1. The middle four segments represent the interval -1.96 to 1.96. The shaded area above this interval represents 13.5% + 34% + 34% + 13.5%= 95% of the area under the curve. In other words, 95% of values. 
1. The middle six segments represent the interval -3 to 3. The shaded area above this interval represents 2.35% + 13.5% + 34% + 34% + 13.5% + 2.35% = 99.7% of the area under the curve. In other words, 99.7% of values. 

```{r normal-rule-of-thumb, echo=FALSE, fig.cap="Rules of thumb about areas under normal curves", purl=FALSE, out.width = "80%"}
shade_3_sd <- function(x) {
    y <- dnorm(x, mean = 0, sd = 1)
    y[x <= -3 | x >= 3] <- NA
    return(y)
}
shade_2_sd <- function(x) {
    y <- dnorm(x, mean = 0, sd = 1)
    y[x <= -1.96 | x >= 1.96] <- NA
    return(y)
}
shade_1_sd <- function(x) {
    y <- dnorm(x, mean = 0, sd = 1)
    y[x <= -1 | x >= 1] <- NA
    return(y)
}

labels <- tibble(
  x = c(-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5),
  label = c("0.15%", "2.35%", "13.5%", "34%", "34%", "13.5%", "2.35%", "0.15%")
) %>% 
  mutate(y = rep(0.3, times = n()))

ggplot(data = tibble(x = c(-4, 4)), aes(x)) +
  geom_text(data = labels, aes(y=y, label = label)) + 
  # Trace normal curve
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), n = 1000) + 
  # Shade and delineate +/- 3 SD
  stat_function(fun = shade_3_sd, geom = "area", fill = "black", alpha = 0.25, n = 1000) +
  # annotate(geom = "segment", x = c(3, -3), xend = c(3, -3), y = 0, yend = dnorm(3, mean = 0, sd = 1)) +
  # Shade and delineate +/- 2 SD
  stat_function(fun = shade_2_sd, geom = "area", fill = "black", alpha = 0.25, n = 1000) +
  # annotate(geom = "segment", x = c(1.96, -1.96), xend = c(1.96, -1.96), y = 0, yend = dnorm(1.96, mean = 0, sd = 1)) +
  # Shade and delineate +/- 1 SD
  stat_function(fun = shade_1_sd, geom = "area", fill = "black", alpha = 0.25, n = 1000) +
  # annotate(geom = "segment", x = c(1, -1), xend = c(1, -1), y = 0, yend = dnorm(1, mean = 0, sd = 1)) + 
  geom_vline(xintercept = c(-3, -1.96, -1, 0, 1, 1.96, 3), linetype = "dashed", alpha = 0.5) +
  # Axes
  scale_x_continuous(breaks = seq(from = -3, to = 3, by = 1)) +
  labs(x = "z", y = "") +
  theme(axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank())
```

```{block, type='learncheck', purl=FALSE}
\vspace{-0.25in}
**_Learning check_**
\vspace{-0.25in}
```

<!--
Consider LC using this later on: <https://gallery.shinyapps.io/dist_calc/>
-->

Say you have a normal distribution with mean $\mu = 6$ and standard deviation $\sigma = 3$. 

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What proportion of the area under the normal curve is less than 3? Greater than 12? Between 0 and 12?

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** What is the 2.5th percentile of the area under the normal curve? The 95th percentile? The 100th percentile?


```{block, type='learncheck', purl=FALSE}
\vspace{-0.25in}
\vspace{-0.25in}
```


## log10 transformations {#appendix-log10-transformations}

At its simplest, log10 transformations return base 10 *logarithms*. For example, since $1000 = 10^3$, running `log10(1000)` returns `3` in R. To undo a log10-transformation, we raise 10 to this value. For example, to undo the previous log10-transformation and return the original value of 1000, we raise 10 to the power of 3 by running `10^(3) = 1000` in R. \index{log-transformations}

Log-transformations allow us to focus on changes in *orders of magnitude*. In other words, they allow us to focus on *multiplicative changes* instead of *additive ones*. Let's illustrate this idea in Table \@ref(tab:logten) with examples of prices of consumer goods in 2019 US dollars. 

<!--
We can also frame such changes as being relative percentage increases/decreases instead of absolute increases/decreases. 
-->

```{r logten, echo=FALSE}
tibble(Price = c(1,10,100,1000,10000,100000,1000000)) %>% 
  mutate(
    `log10(Price)` = log10(Price),
    Price = dollar(Price),
    `Order of magnitude` = c("Singles", "Tens", "Hundreds", "Thousands", "Tens of thousands", "Hundreds of thousands", "Millions"),
    `Examples` = c("Cups of coffee", "Books", "Mobile phones", "High definition TV's", "Cars", "Luxury cars & houses", "Luxury houses")
    ) %>% 
  knitr::kable(
    digits = 3, 
    caption = "log10-transformed prices, orders of magnitude, and examples", 
    booktabs = TRUE
  ) %>% 
  kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16), 
                latex_options = c("hold_position"))
```

Let's make some remarks about log10-transformations based on Table \@ref(tab:logten):

1. When purchasing a cup of coffee, we tend to think of prices ranging in single dollars, such as \$2 or \$3. However when purchasing a mobile phone, we don't tend to think of their prices in units of single dollars such as \$313 or \$727. Instead, we tend to think of their prices in units of hundreds of dollars like \$300 or \$700. Thus, cups of coffee and mobile phones are of different *orders of magnitude* of price.
1. Let's say we want to know the log10-transformed value of \$76. This would be hard to compute exactly without a calculator. However, since \$76 is between \$10 and \$100 and since log10(10) = 1 and log10(100) = 2, we know log10(76) will be between 1 and 2. In fact, log10(76) is 1.880814.
1. log10-transformations are *monotonic*, meaning they preserve orders. So if Price A is lower than Price B, then log10(Price A) will also be lower than log10(Price B).
1. Most importantly, increments of one in log10-scale correspond to *relative multiplicative changes* in the original scale and not *absolute additive changes*. For example, increasing a log10(Price) from 3 to 4 corresponds to a multiplicative increase by a factor of 10: \$100 to \$1000.