session-05-presentation.qmd

---
title: Statistical programming
title-slide-attributes:
  data-background-image: mvtec-cover-statistical-programming-4x3.png
  data-background-size: contain
  data-background-opacity: "0.1"
subtitle: Inferential statistics
author: Marc Comas-Cufí
format: 
  revealjs:
    self-contained: true
    smaller: true
    logo: mvtec-cover-statistical-programming-4x3.png
fontsize: 12pt
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, fig.align = 'center', dev.args = list(bg = 'transparent'),
                      collapse=TRUE, comment = "#>")
colorize <- function(x, color) {
  if (knitr::is_latex_output()) {
    sprintf("\\textcolor{%s}{%s}", color, x)
  } else if (knitr::is_html_output()) {
    sprintf("<span style='color: %s;'>%s</span>", color, 
      x)
  } else x
}
library(tidyverse)
library(latex2exp)
library(lubridate)
library(egg)
library(broom)
library(car)
# install.packages('gisfski')
theme_set(theme_minimal())
```

## Today's session

```{r, echo=FALSE, results='asis'}
cat(readr::read_lines("session-05-content.md"), sep='\n')
```

# Statistical inference 

## Objective

Given a sample $X = \{x_1, \dots, x_N\}$, we are interested in analysing a model $\text{Model}(\theta)$, $\theta \in \Theta$, under the assumption that $x_i$'s are independent and identically distributed, i.e. $x_i \sim \text{Model}(\theta)$.

### Examples

* Given $X = \{x_1, \dots, x_N\}$ numeric, ... _study $E[x_i]$, assuming $x_i$'s follow the same numerical distribution $f(\theta)$_.
* Given $X = \{x_1, \dots, x_N\}$ binary 0/1, ... _study $E[x_i]$, assuming $x_i$'s follow a $Bin(n=1,\pi)$_.

## Types of inference

* __Confidence intervals__. Find $\theta_{-}$ and $\theta_{+}$ such that certain $\text{P}(\theta_{-} \leq \theta \leq \theta_{+})$ holds.
* __Hypothesis testing__. Is a certain hypothesis, $H_0$, about $\text{Model}(\theta)$ plausible?


# Confidence intervals

## About confidence intervals {.smaller}

* Confidence interval are regions where we think certain parameters $\theta$ is. 
* For a level of confidence $1-\alpha$ (we should think $\alpha$ small), we talk about $(1-\alpha)100\%$ confidence interval (CI) for parameter $\theta$.
* A $(1-\alpha)100\%$ CI for $\theta$ is a region between $\theta_{-}$ and $\theta_{+}$ such that
$$
\text{P}(\theta_{-} \leq \theta \leq \theta_{+}) = 1-\alpha.
$$
* We can have confidence intervals for different types of parameters $\theta$:
    * For the expected value of a random variable: __mean__, $\mu$, if continuous or __proportion__, $\pi$, if dichotomous (binomial with $n=1$).
    * For coefficients/parameters of a model.

## Confidence intervals for proportions {.smaller background-color=#CCE5FF}

Let's assume $X \sim Bin(n=1, \pi)$, for example:

```{r, echo=TRUE}
X1 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0)
```

* What is our best guess for $\pi$?
* How can we measure our uncertainty about $\pi$?

Suppose we have:

```{r, echo=TRUE}
X2 = c(0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0)
```

* If we decide to use this new sample, how it will affect to our uncertainty of $\pi$?

<!--

## Maximum likelihood

* If $X \sim Bin(n=1, \pi)$ we have,

$$
\text{P}(X = 1) = \pi \text{ and } \text{P}(X=0) = 1-\pi
$$

in a more compact way $\text{P}(X = x) = \pi^x (1-\pi)^{1-x}$.

* For an independent sample $x_1, \dots, x_n$, the likelihood of $\pi given the sample is 
$$
\mathcal{L}\left(\pi | x_1,\dots,x_n\right) = \prod_{i=1}^n \text{P}(X = x_i) = \prod_{i=1}^n \pi^{x_i} (1-\pi)^{1-{x_i}}.
$$

* We can obtain the "best" parameter by maximising $\mathcal{L}\left(\pi | x_1,\dots,x_n\right)$ with respect $\pi$.

-->

## Bootstrap {.smaller}

* Bootstrap is a simple and powerful tool that, among others, is useful to build confidence intervals for any type of parameter.
* The method consists in resampling from the original sample multiple times __using replacement__.

```{r, echo=TRUE}
v_proportions = replicate(10000, mean(sample(X1, length(X1), replace = TRUE)))

# Summary the vector of means
summary(v_proportions)
```

* A 95%-CI for the proportion using sample `X1` would be:

```{r, echo=TRUE}
alpha = 0.05
quantile(v_proportions, c(alpha/2, 1-alpha/2))
```

## Asymptotic results

If $X \sim Bin(n=1, \pi)$ and $N$ is high enough,
$$
\bar{X} \sim N(\mathbb{E}[X], \sqrt{\text{var}[X]/N}) \iff \pi = \mathbb{E}[X] \sim N(\bar{X}, \sqrt{\pi (1-\pi)/N}).
$$

* Assuming $\pi (1-\pi) \approx \bar{X} (1-\bar{X})$, we know how $\pi$ is distributed,
$$
\pi = \mathbb{E}[X] \sim N(\bar{X}, \sqrt{\bar{X} (1-\bar{X})/N})
$$

* A 95%-CI for the proportion using sample `X1` would be:

```{r, echo=TRUE}
p = mean(X1)
N = length(X1)
qnorm(c(alpha/2, 1-alpha/2), p, sqrt(p*(1-p)/N))

# Equivalently,
# p + qnorm(c(alpha/2, 1-alpha/2)) * sqrt(p*(1-p)/N) or
# p + c(1,-1) * qnorm(alpha/2) * sqrt(p*(1-p)/N)
```


## Sample size and uncertainty

* The higher the sample the lower the uncertainty. $N_1 = 20$, $N_2=100$.

#### Bootstrap

```{r, echo=TRUE}
replicate(10000, mean(sample(X1, length(X1), replace = TRUE))) %>%
  quantile(c(alpha/2, 1-alpha/2))
replicate(10000, mean(sample(X2, length(X2), replace = TRUE))) %>%
  quantile(c(alpha/2, 1-alpha/2))
```

#### Asymptotic aproximation

```{r, echo=TRUE}
N1 = length(X1); N2 = length(X2)
p1 = mean(X1); p2 = mean(X2)
qnorm(c(alpha/2, 1-alpha/2), p1, sqrt(p1*(1-p1)/N1))
qnorm(c(alpha/2, 1-alpha/2), p2, sqrt(p2*(1-p2)/N2))
```

<!--

 ## Activity: Visualizing uncertainty

* Taking into an account the generating process, what visualization do you prefer?

```{r, echo=TRUE}
df = map_df(LETTERS[1:10], ~tibble(group = .x, p = mean(rbinom(100, 1, 0.25))))
```

```{r, fig.height=4.2, fig.width=7}
p1 = ggplot(data=df) +
  geom_point(aes(y = group, x = p), size = 2) +
  labs(x = '', y = 'Some groups', title = 'Proportions of outcomes in certain groups', subtitle = 'Unordered proportions')

p2 = ggplot(data=mutate(df, group = fct_reorder(group, p))) +
  geom_point(aes(y = group, x = p), size = 2) +
  labs(x = '', y = '', subtitle = 'Ordered proportions')

p3 = ggplot(data=mutate(df, group = fct_reorder(group, p))) +
  geom_segment(aes(y = group, yend = group, x = p-1.96*sqrt(p*(1-p)/50), xend = p+1.96*sqrt(p*(1-p)/50))) +
  geom_point(aes(y = group, x = p), size = 2) +
  labs(x = 'Some outcome', y = 'Some groups', subtitle = 'Ordered proportions with 95% CI')

p4 = ggplot(data=df) +
  geom_segment(aes(y = group, yend = group, x = p-1.96*sqrt(p*(1-p)/50), xend = p+1.96*sqrt(p*(1-p)/50))) +
  geom_point(aes(y = group, x = p), size = 2) +
  labs(x = 'Some outcome', y = '', subtitle = 'Unordered proportions with 95% CI')

ggarrange(p1, p2, p3, p4, nrow = 2)
```


-->

## Confidence interval for means

Given a numeric sample:

```{r, echo=TRUE}
X = c(11.2, 10.28, 10.24, 10.69, 8.34, 11.06, 9.53, 10.6, 8.78, 9.16,
      9.52, 10.63, 10.18, 9.06, 9.87, 11.18, 9.91, 9.26, 9.25, 10.35,
      9.52, 10.61, 9.83, 10.14, 9.42, 8.82, 8.84, 9.65, 11.09, 9.49,
      10.03, 10.59, 10.64, 10.74, 10.59, 9.04, 8.52, 9.83, 9.62, 9.45,
      10.83, 9.65, 10.37, 10.38, 9.66, 10.45, 9.99, 11.11, 10.47, 9.41)
```

* Bootstrap approach

```{r, echo=TRUE}
N = length(X)
x_means = replicate(10000, mean(sample(X, N, replace = TRUE)))
quantile(x_means, c(alpha/2, 1-alpha/2))
```

* Asymptotic results (sampling distribution)

```{r, echo=TRUE}
mean(X) + qt(c(alpha/2, 1-alpha/2), 49) * sqrt(var(X)/N)
```

```{r, eval=FALSE, include=FALSE}
df_mu = tibble(
  mu = seq(8,12,length = 1000),
  vers = sapply(mu, function(mu) prod(dnorm(X, mu, 1)))
)
ggplot(data=df_mu) +
  geom_line(aes(x = mu, y = vers))
mu_opt = slice_max(df_mu, vers) %>% pull(mu)
df_sigma = tibble(
  sigma = seq(0.5, 1, length=1000),
  vers = sapply(sigma, function(sigma) prod(dnorm(X, mu_opt, sigma)))
)
ggplot(data=df_sigma) +
  geom_line(aes(x = sigma, y = vers))
sigma_opt = slice_max(df_sigma, vers) %>% pull(sigma)
mu_opt
mean(X)
sigma_opt
sd(X)
```

## `diamonds` activity {background-color=#CCE5FF}

Explain the price's of the diamonds using information from the other variables. What are the more relevant variables to explain the price? 

```{r}
data(diamonds, package = 'ggplot2')
diamonds
```

* `diamonds` is a well-known dataset and analyzed on many websites through the internet. You can borrow ideas from there.
* If you want, you can use either `quarto` o `Rmarkdown` to generate your document.

# Hypothesis testing

## About hypothesis testing

* Hypothesis testing is a process for which __after assuming a certain hypothesis__ we associate our sample to the realization of a certain random variable.
    * Step 1. __Set a hypothesis__ $H_0$ about the distribution generating $X$.
    * Step 2. __Collect__ a sample $X$.
    * Step 3. __Compute a test statistics from sample__ $X$, for which we know its distribution. 
    * Step 4. __Calculate the probability__, $p$, to obtain test statistics associated to samples that are less likely to hold hypothesis $H_0$. The probability $p$ is called $p$-value.
    * Step 5. __Decide__ about the credibility of $H_0$ using probability $p$.

## Hypothesis testing for the mean

* Step 1. $H_0: \mu =$ 10
* Step 2. Collect a sample

```{r x_num, echo=TRUE}
X_num = c(10.2, 8.8, 11.2, 9, 9.6, 10.3, 10.2, 13.2, 12.5, 7)
```

```{r}
N = length(X_num)
tstat = round((mean(X_num) - 10)/(sd(X_num)/sqrt(N)), 2)
```

* Step 3. Compute the test statistic $t = (\bar{x} - \mu)/(s_x/\sqrt{N}) = `r tstat`$.
    * If $H_0$ is true, $t$ is a realization of $t_{N-1}$.
* Step 4. The probability of obtaining "rarer" test statistics is given by the red area:

```{r t_test_anim, fig.height=1.8, out.width="90%", animation.hook="gifski", include=FALSE}
t_dist = tibble(
  x = seq(-3.5, 3.5, 0.01),
  fx = dt(x, 10-1))

p1 = ggplot() +
  geom_area(data=t_dist, aes(x=x, y = fx), col = 'grey', alpha = 0.8, fill = 'skyblue') +
  geom_segment(aes(x=tstat, xend = tstat, y = 0, yend = dt(tstat, 9))) +
  annotate("text", x = tstat, y = -0.025, label = sprintf("t=%0.2f",tstat), size = 3) +
  
  labs(x = '', y = 'density', col = '') +
  scale_color_manual(values = c('blue', 'red')) +
  theme(legend.position = 'top') +
  scale_x_continuous(breaks = seq(-3, 3, 1)) +
  scale_y_continuous(limits = c(NA,0.5)) +
  theme(axis.title.x = element_blank())

p2 = p1 +
  geom_area(data=filter(t_dist, x > tstat), aes(x=x, y = fx), alpha = 0.4, fill = 'red')

p3 = p2 +
  geom_area(data=filter(t_dist, x < -tstat), aes(x=x, y = fx), alpha = 0.4, fill = 'red') +
  annotate("text", x = -tstat, y = -0.025, label = sprintf("-%0.2f",tstat), col = 'red', size = 3)

p1
p2
p3
p3
```

```{r, fig.height=1.8, out.width="90%"}
p1
```

* Step 5. Is it rare an event that rarer events happen with probability  `r round(2*(1-pt(tstat, 9)), 4)`? 

## Hypothesis testing for the mean

* Step 1. $H_0: \mu =$ 10
* Step 2. Collect a sample

```{r, ref.label='x_num', echo=TRUE}
```


* Step 3. Compute the test statistic $t = (\bar{x} - \mu)/(s_x/\sqrt{N}) = `r tstat`$.
    * If $H_0$ is true, $t$ is a realization of $t_{N-1}$.
* Step 4. The probability of obtaining "rarer" test statistics is given by the red area:

```{r, fig.height=1.8, out.width="90%"}
p2
```

* Step 5. Is it rare an event that rarer events happen with probability  `r round(2*(1-pt(tstat, 9)), 4)`? 

## Hypothesis testing for the mean

* Step 1. $H_0: \mu =$ 10
* Step 2. Collect a sample

```{r, ref.label='x_num', echo=TRUE}
```

* Step 3. Compute the test statistic $t = (\bar{x} - \mu)/(s_x/\sqrt{N}) = `r tstat`$.
    * If $H_0$ is true, $t$ is a realization of $t_{N-1}$.
* Step 4. The probability of obtaining "rarer" test statistics is given by the red area:

```{r, fig.height=1.8, out.width="90%"}
p3
```

* Step 5. Is it rare an event that rarer events happen with probability  `r round(2*(1-pt(tstat, 9)), 4)`? 

## Hypothesis testing for the mean

* Step 1. $H_0: \mu =$ `r colorize("9", "blue")`
* Step 2. Collect a sample

```{r, ref.label='x_num', echo=TRUE}
```

```{r}
tstat = round((mean(X_num) - 9)/(sd(X_num)/sqrt(N)), 2)
```

* Step 3. Compute the test statistic $t = (\bar{x} - \mu)/(s_x/\sqrt{N}) =$ `r colorize(tstat, 'blue')`.
    * If $H_0$ is true, $t$ is a realization of $t_{N-1}$.
* Step 4. The probability of obtaining "rarer" test statistics is given by the red area:

```{r, ref.label='t_test_anim', fig.height=1.8, out.width="90%", animation.hook="gifski"}
```

* Step 5. Is it rare an event that rarer events happen with probability  `r colorize(round(2*(1-pt(tstat, 9)), 4), 'blue')`? 

## Statistical significance

We say that a result has statistical significance when the result is very unlikely to be occurred given the null hypothesis. Before a hypothesis test is done, a significance level, $\alpha$, is set. Usually $\alpha=0.05$.

* So in a hypothesis test is critical to know what is the null hypothesis, and
* what are the assumptions to assert that the test statistic follows a certain distribution.

## Hypothesis testing for proportions

```{r}
p0 = 0.4
```

* Step 1. $H_0: \pi = `r p0`$. Is it reasonable to assume $\pi = `r p0`$?
* Step 2. Given a dichotomous sample $X=\{x_1, \dots, x_{20}\}$:

```{r, echo=TRUE}
X_bin = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0)
```

```{r}
N = length(X_bin)
n = sum(X_bin)
```

* Step 3. $\sum_i x_i = `r sum(X_bin)` \sim Bin(n=N, \pi)$
* Step 4. The probability of obtaining “rarer” test statistics is given by the red area:

```{r, fig.height=2.2, out.width="90%", animation.hook="gifski"}
pmf = tibble(
  X = 0:20,
  Probability = dbinom(X, size = 20, prob = p0),
  col1 = X == n,
  col2 = Probability <= dbinom(n, size = 20, prob = p0)
)
p1 = ggplot(data = pmf) +
  geom_bar(aes(x=X, y = Probability, fill = col1), col = 'black', width = 1, stat = 'identity', alpha = 0.8) + 
  labs(x = '') +
  scale_y_continuous(limits = c(0,0.2)) +
  scale_x_continuous(breaks = 0:20, minor_breaks = NULL, labels = sprintf("X=%d", 0:20)) +
  scale_fill_manual(guide = 'none', values = c('skyblue', 'red')) +
  theme(axis.text.x = element_text(size = 7), axis.title.x = element_blank()) 

p2 = p1 +
  geom_hline(yintercept = dbinom(n, size = 20, prob = p0), linetype = 'dotted')

p3 = ggplot(data = pmf) +
  geom_bar(aes(x=X, y = Probability, fill = col2), col = 'black', width = 1, stat = 'identity', alpha = 0.8) + 
  geom_hline(yintercept = dbinom(n, size = 20, prob = p0), linetype = 'dotted') +
  labs(x = '') +
  scale_y_continuous(limits = c(0,0.2)) +
  scale_x_continuous(breaks = 0:20, minor_breaks = NULL, labels = sprintf("X=%d", 0:20)) +
  scale_fill_manual(guide = 'none', values = c('skyblue', 'red')) +
  theme(axis.text.x = element_text(size = 7), axis.title.x = element_blank())

p1
p2
p3
p3
```

* Step 5. Is it rare an event that rarer events happen with probability  `r round(sum(filter(pmf, col2)$Probability), 4)`? 

## One sample t-test with R

* $H_0: \mu = 10$

```{r, echo=TRUE}
t.test(X_num, mu = 10)
```

## One sample t-test with R

* $H_0: \mu = 9$

```{r, echo=TRUE}
t.test(X_num, mu = 9)
```

## Exact binomial test with R

* $H_0: \pi = 0.4$

```{r, echo=TRUE}
binom.test(sum(X_bin), n = length(X_bin), p = 0.4)
```


## Comparing group means (1)

* Step 1.

```{r, echo=TRUE}
data = tibble(
  X = c(80.8, 78.2, 81, 79.9, 80.7, 79.3, 78.7, 79.6, 79.5, 79.3,  # <- A
        81.8, 81.4, 78.3, 79.4, 82.1, 81.3, 79, 80.5, 81.1, 80.2), # <- B
  G = c(rep('A', 10), rep('B', 10))
)
```

* Step 2. $H_0: \mu_A = \mu_B$

```{r, fig.height=2.5}
ggplot() +
  geom_boxplot(data=data, aes(x = G, y = X))
```

## Comparing group means (2)

* Step 3, 4, 5. Test statistics and p-values are easily calculated with R.

```{r, echo=TRUE}
t.test(X~G, data = data)
```

## Independence tests

* $H_0$ when comparing the means of two groups are we are deciding whether the expected value of a numerical variable is independent of a binary group variable.

Depending on the nature of the two variables, other tests exist to contrast their independence.

* Between categorical variables: Fisher exact test, or chi-squared contingency table test if Fisher does not work.
  
```{r, eval=FALSE, echo=TRUE}
fisher.test(table(X,Y)) # chisq.test(table(X,Y))
```

* Between numerical variable: test for Association/Correlation Between Paired Samples.


```{r, eval=FALSE, echo=TRUE}
cor.test(X,Y)
```
    
* Between categorical ($X$) and numerical ($Y$) variables: ANOVA test.

```{r, eval=FALSE, echo=TRUE}
summary(aov(Y~X))
```
    
<!-- ## Activity: Comparing group means (1)

* After downloading data from [https://covid.ourworldindata.org/data/owid-covid-data.csv](https://covid.ourworldindata.org/data/owid-covid-data.csv).

```{r}
if(!file.exists("owid-covid-data.csv")){
download.file("https://covid.ourworldindata.org/data/owid-covid-data.csv",
              destfile = "owid-covid-data.csv")
}
cols_ = cols(
  .default = col_double(),
  iso_code = col_character(),
  continent = col_character(),
  location = col_character(),
  date = col_date(format = ""),
  tests_units = col_character()
)
data = read_csv("owid-covid-data.csv", col_types = cols_)
```

* For each country, we construct a new variable called `rate`:

```{r, echo=TRUE}
data = data %>%
  group_by(location) %>%
  mutate(
    new_cases = replace_na(new_cases, 0),
    acc_cases = cumsum(new_cases),
    active_cases = lag(acc_cases) - lag(acc_cases, 15)
  ) %>%
  mutate(
    rate = new_cases/active_cases
  ) %>%
  select(location, date, rate)
```

* `rate` measures how many new cases do we have in comparison to the number of new cases detected the last two weeks.

-->

<!-- ## Activity: Comparing group means (2)

* We only keep observations from Germany and France, and from April to October:

```{r, echo=TRUE}
data = data %>%
  mutate(month = month(date)) %>%
  filter(location %in% c('Germany', 'France'), 
         year(date) == 2020, 4 <= month, month <= 10) %>%
  mutate(month = factor(month(date), labels = month.abb[4:10]))
```

* To detect differences between countries, we decide to do a two-sample t-test in each month:

```{r, echo=TRUE}
dttest = data %>%
  group_by(month) %>%
  nest() %>%
  mutate(
    t_test = map(data, ~t.test(rate~location, data=.x)),
    map_df(t_test, broom::tidy))
```

-->

<!-- ## Activity: Comparing group means (3)

* To visualize the comparison between countries, we try three different approaches.

```{r}
p1 = ggplot() +
  geom_boxplot(data=data, aes(x = month, y = rate, fill = location)) +
  labs(y = 'Infectiuous rate', x = '', fill = 'Country') + 
  theme(axis.text.x = element_blank(), legend.position = 'top')

# p1 = ggplot() +
#   geom_boxplot(data=filter(data, rate > 0), aes(x = month, y = rate, fill = location)) +
#   labs(y = 'Infectiuous rate', x = '', fill = 'Country') + 
#   scale_y_continuous(trans = scales::logit_trans()) +
#   theme(axis.text.x = element_blank(), legend.position = 'top')

p2 = ggplot() +
  geom_point(data=dttest, aes(x = month, y = estimate)) +
  geom_segment(data=dttest, aes(x = month, xend = month, y = conf.low, yend = conf.high)) +
  geom_hline(yintercept = 0, linetype = 'dotted') +
  labs(y = TeX('$r_{France}-r_{Germany}$'), x = '') +
  theme(axis.text.x = element_blank())

p3 = ggplot() +
  geom_point(data=dttest, aes(x = month, y = statistic, col = p.value < 0.05)) +
  geom_hline(yintercept = 0) +
  geom_segment(data=dttest, aes(x = month, xend = month, y = 0, yend = statistic, col = p.value < 0.05)) +
  labs(y = 't-statistic', x = '') + 
  scale_color_manual(guide = 'none', values = c('black', 'red'))

ggarrange(p1, p2, p3, ncol = 1)
```

-->

<!-- ## Activity: Comparing group means (4)

* Comment about the pros and cons of this three visualizations.
* Which one do you prefer?
* Think about another visualization?
* Draw some conclusions from this comparison.

 # Other tests -->

## Categorical distribution

**Pearson's chi-squared test**

* $H_0$: $X \sim Cat(\boldsymbol\pi = (\pi_1, \dots, \pi_k))$.
* Test statistic:
$$
\chi^2 = \sum_{i=1}^k \frac{(O_i-E_i)^2}{E_i} \sim {\chi^2}_{k-1}
$$
where $O_i$ is the number of times $i$ was observed and $E_i = n \times \pi_i$.

```{r, fig.height=2.5}
chi_square_test = function(nsample, prob){
  categories = LETTERS[1:length(prob)]
  X = sample(categories, size = nsample, replace = TRUE, prob = prob)
  O = sapply(categories, function(cat) sum(X == cat))
  E = nsample * prob
  sum((O-E)^2/E)
}

prob = c(0.2,0.2,0.2,0.2,0.2)
chi = tibble(x = replicate(1000, chi_square_test(100, prob)))
chi_dist = tibble(x = seq(0, 15, 0.01),
                  f = dchisq(x, length(prob)-1))

ggplot() +
  geom_histogram(data=chi, aes(x=x, y=..density..), bins = 50) +
  geom_line(data=chi_dist, aes(x=x, y=f), col = 'blue', size = 1) +
  labs(x=TeX("$\\chi^2$"), title = TeX("Distribution of $\\chi^2$ statistics simulated from Cat($\\pi = (.2,0.2,0.2,0.2,0.2)$)"))
```

<!--

## A/B testing

```{r, include=FALSE, eval=FALSE}
cookie_cats %>% filter(sum_gamerounds > 0) %>% do(broom::tidy(prop.test(table(.$version, .$retention_7))))
```


[Mobile Game Retention Analysis: A/B Testing](https://www.kaggle.com/code/jyunyolin/mobile-game-retention-analysis-a-b-testing/notebook)

-->

## Gaussian distribution

**Shapiro-Wilk test**

* $H_0$: $X \sim N(\mu,\sigma)$
* Test statistic:
$$
W = \frac{(\sum_{i=1}^n a_i x_{[i]})^2}{(\sum_{i=1}^n (x_i - \bar{x})^2)^2} \sim f_{W}
$$
where $a_i$'s are certain constants, $x_{[i]}$ is the $i$-th smallest observation in $X$ and $f$ is a probability distribution for r.v. $W$.

```{r, echo=TRUE}
set.seed(1)
x = rnorm(100)
shapiro.test(x)
```

## Comparing group variances (homoscedasticity)

For $X$ categorical and $Y$ numerical.

* $H_0$: $\text{Var}(Y | X)$ is constant.

```{r, eval=FALSE, echo=TRUE}
bartlett.test(Y~X)
```

* Bartlett's test assumes normality. For non-normal variables better to use the Levene test (available in package `car`)

```{r, eval=FALSE, echo=TRUE}
car::leveneTest(Y~X)
```

## `broom` Statistical Objects into Tibbles

```{r, echo=TRUE}
binom.test(c(682, 243), p = 3/4)
```

With `broom`:

```{r, echo=TRUE}
library(broom)
tidy(binom.test(c(682, 243), p = 3/4))
```

...

```{r, echo=TRUE}
km_ = kmeans(iris[1:4], 2, nstart = 100)
tidy(km_)
glance(km_)
# augment(km_, iris[1:4])
```

# That's all for today

## Next week session

__Overview Data Science and Data preprocessing__ with _Karina Gibert_.