binary-models.Rmd

---
title: "Models for Binary Data"
description: | 
  Logistic regression / item response models
date: 06-11-2021
author:
  - first_name: "Andrew"
    last_name: "Ellis"
    url: https://github.com/awellis
    affiliation: Cognitive psychology, perception & methods, Univerity of Bern 
    affiliation_url: https://www.kog.psy.unibe.ch
    orcid_id: 0000-0002-2788-936X

citation_url: https://awellis.github.io/learnmultilevelmodels/item-response-models.html
bibliography: ./bibliography.bib
output: 
    distill::distill_article:
      toc: true
      toc_float: true
      toc_depth: 2
      code_folding: false
      css: ./css/style.css
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      cache = TRUE,
                      warning = FALSE,
                      message = FALSE)

# recall <- bcogsci::df_recall_complete |> 
#   select(subj, session, block, trial, set_size, response, rt, correct, response_category) |> 
#   drop_na()
# recall |> write_csv(file = "data/recall-oberauer.csv")

```

```{r packages}
library(knitr)
library(kableExtra)
library(tidyverse)
library(patchwork)
library(brms)

# set ggplot theme
theme_set(theme_grey(base_size = 14) +
            theme(panel.grid = element_blank()))

# set rstan options
rstan::rstan_options(auto_write = TRUE)
options(mc.cores = 4)
```


# Logistic Regression

This example is based on[Bayesian Data Analysis for Cognitive Science](https://vasishth.github.io/bayescogsci/book/) and uses data from @oberauerWorkingMemoryCapacity2019.

In this study, the effect of word list length on working memory capacity was assessed. Subjects were shown word lists of varying length (set size: 2, 4, 6, or 8 words) and were required to recall the correct word according to its position in the list. A single trial is shown in Figure \@ref(fig:oberauer-2019).

```{r oberauer-2019, echo=FALSE, fig.cap="Subjects were required to indicate the correct word from the studied list according to its cued position."}
knitr::include_graphics("images/oberauer-2019.png")
```


```{r}
recall <- read_csv(file = "https://raw.githubusercontent.com/awellis/learnmultilevelmodels/main/data/recall-oberauer.csv")

recall <- recall |> 
  mutate(subj = as_factor(subj))
```

```{r}
glimpse(recall)
```


Count the number of subjects and number of trials per set size for each subjetc.

```{r}
recall |> distinct(subj) |> count()
recall |> count(subj, set_size)
```

The latter is equivalent to:

```{r}
recall |> group_by(subj, set_size) |>
  summarize(N = n())
```


Before fitting any models, we will center the `set_size` variable.

```{r}
recall <- recall |> 
  mutate(c_set_size = set_size - mean(set_size))
```


We are interested in subjects' probability of responding correctly based on the set size. First, we can get a maximum likelihood estimate:


```{r}
recall_sum <- recall %>% 
  group_by(subj, set_size) %>% 
  summarise(accuracy = mean(correct),
            sd = sd(correct))
```


```{r}
recall_sum |> 
  ggplot(aes(set_size, accuracy)) +
  geom_line() +
  geom_point() +
  facet_wrap(~subj)
```

## Likelihood


Instead of computing point estimates, we can model these data using a generalized linear model. We know that each response is either correct or an error, and thus the most sensible distribution is a Bernoulli.

$$ correct_i \sim Bernoulli(\theta_{i})$$

Each response $i$ is drawn from a Bernoulli distribution, with a probability of being correct of $\theta_i$. THe probability of an error response is  $1-\theta_i$.

We want to predict the probability of getting a correct response using a linear predictor. Since the linear predictor in on the real line, and the parameter we are trying to predict lies in the interval $[0, 1]$, we need to transform the linear predictor. This is conventionally written as

$$ g(\theta_i) = logit(\theta_i) = b_0 + b_{set\_size} \cdot set\_size_i $$
with the link function $g()$ being applied to the parameter of the likelihood distribution. An alternative way of writing this is

$$ \theta_i = g^{-1}(b_0 + b_{set\_size} \cdot set\_size_i) $$

where $g^{-1}$ is the inverse link function.


In a logistic regression, the link function is the `logit` function:  

$$ logit(\theta) = log \bigg( \frac{\theta}{(1-\theta)} \bigg) $$
and gives the log-odds, i.e. the natural logarithm of the odds of a success.

Therefore(for a single subject)

$$ log \bigg( \frac{\theta_i}{1-\theta_i} \bigg) = b_0 + b_{set\_size} \cdot set\_size_i $$
The coefficients $b_0$ and $b_{set\_size}$ have an additive effect on the log-odds scale. If we want this effect on the probability scale, we need to apply the inverse link function. This is is function $f(x) = 1/(1 + exp(-x))$, and is known a the logistic function. It is also the cumulative distribution function of the logistic distribution, and exist in R under the name `plogis()`.

For example, if the log-odds are 0.1, then the probability is 0.525:

```{r}
logodds <- 0.1
prob <- plogis(logodds)
prob
```


We can plot the log-odds on the x axis, and the cumulative distribution on the y axis. Is it noticeable that the log-odds with an absolute value greater than approximately 5 lead to probabilities of 0 and 1, asymptotically. This is relevant when considering prior distributions.


```{r}
d1 <- tibble(x = seq(-5, 5, by = 0.01),
            y = plogis(x))

d1 %>% 
    ggplot(aes(x, y)) +
    geom_hline(yintercept = 0.5, linetype = 3) +
    geom_vline(xintercept = 0, linetype = 3) +
    geom_line(size = 2, color = "steelblue") +
    xlab("log-odds") + ylab("probability of success")
```


## Prior distributions 

Since we centered our predictor variable `c_set_size`, the zero point represents the average set size. The intercept  will therefore represent the expected log-odds for the average set size. With minimal prior knowledge, we may assume that at the average set size, a subject may be equally likely to give a correct or error response. In that case, $\theta = 0.5$, and the log-odds are $0$. 

We can express our uncertainty using a normal distribution centred at 0, with a standard deviation of 1, i.e. we are 95% certain that the log-odds will lie between $[-2, 2]$.


```{r}
qnorm(c(.025, .975), mean = 0, sd = 1.0)
```

YOu can plot the prior by using random draws from a normal distribution. The figure on the left shows the normal distribution on the log-odds scale, and the figure on the right shows the transformed values, on the probability scale.

```{r}
library(patchwork)

samples <- tibble(Intercept = rnorm(1e5, 0, 1.0),
                  p = plogis(Intercept))

p_logodds <- samples %>% 
  ggplot(aes(Intercept)) +
  geom_density() +
  ggtitle("Log-odds")

p_prob <- samples %>% 
  ggplot(aes(p)) +
  geom_density() +
  ggtitle("Probability")


p_logodds + p_prob
```

Our `normal(0, 1.0)` prior is very similar to using a `Beta(2, 2)` on the probability scale..

```{r}
tibble(x = seq(0, 1, by = 0.01),
       dens =  dbeta(x, shape1 = 2, shape2 = 2)) %>% 
  ggplot(aes(x, dens)) +
  geom_line()
```


For the average effect of `set_size` we will choose a prior on the log-odds scale that is uninformative on the probability scale. A `normal(0, 0.5)` prior expresses the belief that the an increase of the set size by one unit will lead to an change in the log-odds of somewhere between $[-1, 1]$, with a probability of 95%. The figure on the right again shows the effect transformed onto the probability scale.

```{r}
samples2 <- tibble(b_set_size = rnorm(1e5, 0, 0.5),
                          p = plogis(b_set_size))

p_logodds <- samples2 %>% 
  ggplot(aes(b_set_size)) +
  geom_density() +
  ggtitle("Log-odds")

p_prob <- samples2 %>% 
  ggplot(aes(p)) +
  geom_density() +
  ggtitle("Probability")


p_logodds + p_prob
```


## Fitting models


We have set our priors on the average (population-level) effects of the set size.
Let's look at the default priors in a multilevel model

```{r}
library(brms)
```


```{r}
get_prior(correct ~ 1 + c_set_size + (1 + c_set_size | subj),
                family = bernoulli(link = logit),
                data = recall)
```


```{r}
priors <- prior(normal(0, 1.0), class = Intercept) +
    prior(normal(0, .1), class = b, coef = c_set_size)

fit_recall_1 <- brm(correct ~ 1 + c_set_size + (1 + c_set_size | subj),
                family = bernoulli(link = logit),
                prior = priors,
                data = recall,
                file = "models/fit_recall-1",
                file_refit = "on_change")
```


```{r}
fit_recall_1
```


```{r}
fit_recall_1 |> mcmc_plot()
```

We can plot the expected effect of set size on the probability of giving a correct response using `conditional_effects. The default shows the exptected value of the posterior predictive distribution.

```{r}
fit_recall_1 |> 
  conditional_effects()
```

If you want the expected log-odds, you can use the argument `method = 'posterior_linpred'`.

```{r}
fit_recall_1 |> 
  conditional_effects(method = 'posterior_linpred')
```


# Item Response Models

We will look at an example from @burknerBayesianItemResponse2020a. I would thoroughly recommend reading this if you are interested in logistics regression models, and Item Response Theory (IRT) models in particular.

IRT models are widely applied in the human sciences to model persons’ responses on a set of items measuring one or more latent constructs.


## Response distributions {#respdists}

The response format of the items will critically determine which distribution
is appropriate to model individuals' responses on the items. The possibility
of using a wide range of response distributions within the same framework and
estimating all of them using the same general-purpose algorithms is an
important advantage of Bayesian statistics. 

If the response $y$ is a binary success (1) vs. failure (0) indicator, the
canonical family is the *Bernoulli* distribution with density

$$
y \sim \text{Bernoulli}(\psi) = \psi^y (1-\psi)^{1-y},
$$
where $\psi \in [0, 1]$ can be interpreted as the success probability. Common
IRT models that can be built on top of the Bernoulli distribution are the 1, 2,
and 3 parameter logistic models (1PL, 2PL, and 3PL models).


This results in what is known as a
*generalized linear model* (GLM). That is, the predictor term $\eta = \theta_p +
\xi_i$ is still linear but transformed, as a whole, by a non-linear function
$f$, which is commonly called 'response function'. For Bernoulli distributions,
we can canonically use the logistic response function

$$
f(\eta) = \text{logistic}(\eta) = \frac{\exp(\eta)}{1 + \exp(\eta)},
$$

which yields values $f(\eta) \in [0, 1]$ for any real value $\eta$. As a result,
we could write down the model of $\psi$ as

$$
\psi = \frac{\exp(\theta_p + \xi_i)}{1 + \exp(\theta_p + \xi_i)},
$$

which is known as the Rasch or 1PL model. Under the above model,
we can interprete $\theta_p$ as the ability of person $p$ (higher values of $\theta_p$ imply higher success probabilities regardless of the
administered item). $\xi_i$ can be interpreted as the easiness of item
$i$ (higher values of $\xi_i$ imply higher success probabilities regardless of
the person to which the item is administered).


## Data

The dataset is taken from @deboeck2011 and is included in the `lme4` package.

There are 24 items, based on four frustrating situations, two of which where someone else is to be blamed (e.g., “A bus fails to stop for me”), and two of which where one is to be blamed oneself (e.g., “I am entering a grocery store when it is about to close”).  Each of these situations is combined with each of three behaviors, cursing, scolding, and shouting, leading to 4x3 combinations.  These 12 combinations are formulated in two modes, a wanting mode and a doing mode, so that in total there are 24 items.  An example is "A bus fails tostop for me.  I would want to curse".
 
 
```{r}
data("VerbAgg", package = "lme4") 
VerbAgg <- VerbAgg |> as_tibble()

head(VerbAgg, 10)
```

The variables are:

```{r echo=FALSE}
tribble(~Variable, ~Description,
  "Anger", "the subject's Trait Anger score as measured on the State-Trait Anger Expression Inventory (STAXI)",
  "Gender", "the subject's gender - a factor with levels M and F",
  "item", "the item on the questionaire, as a factor",
  "resp", "the subject's response to the item - an ordered factor with levels no < perhaps < yes",
  "id", "the subject identifier, as a factor",
  "btype", "behavior type - a factor with levels curse, scold and shout",
  "situ", "situation type - a factor with levels other and self indicating other-to-blame and self-to-blame",
  "mode", "behavior mode - a factor with levels want and do",
  "r2", "dichotomous version of the response - a factor with levels N and Y") |> 
  kbl() |> 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
```


## Partial pooling for items

```{r}
formula_1pl <- bf(r2 ~ 1 + (1 | item) + (1 | id))
```


```{r}
get_prior(formula_1pl,
                  data = VerbAgg, 
                  family = bernoulli())
```

To impose a small amount of regularization on the model, we'll set
$\text{half-normal}(0, 3)$ priors on the hierarchical standard deviations of
person and items parameters. Given the scale of the logistic response function,
this can be regarded as a weakly informative prior.

```{r}
prior_1pl <- 
  prior("normal(0, 3)", class = "sd", group = "id") + 
  prior("normal(0, 3)", class = "sd", group = "item")
```


```{r}
fit_1pl <- brm(formula_1pl,
                  prior = prior_1pl,
                  data = VerbAgg, 
                  family = bernoulli(),
                  file = "models/fit_va_1pl") |> 
  add_criterion("loo")
```


```{r}
summary(fit_1pl)
plot(fit_1pl, ask = FALSE)
```       


## No pooling for items

```{r}
fit_1pl_nopooling <- brm(r2 ~ 0 + item + (1 | id),
                  prior =  prior("normal(0, 3)", class = "sd", group = "id"),
                  data = VerbAgg, 
                  family = bernoulli(),
                  file = "models/fit_1pl_nopooling") |> 
  add_criterion("loo")
```

```{r}
summary(fit_1pl_nopooling)
```


## Person and item parameters

```{r}
# extract person parameters
ranef_1pl <- ranef(fit_1pl)
person_pars_1pl <- ranef_1pl$id

head(person_pars_1pl, 10)
```

```{r}
# extract item parameters
item_pars_1pl <- coef(fit_1pl)$item
head(item_pars_1pl, 10)
```


```{r}
# plot item parameters
item_pars_1pl[, , "Intercept"] %>%
	as_tibble() %>%
	rownames_to_column() %>%
	rename(item = "rowname") %>%
	mutate(item = as.numeric(item)) %>%
	ggplot(aes(item, Estimate, ymin = Q2.5, ymax = Q97.5)) +
	geom_pointrange() +
	coord_flip() +
	labs(x = "Item Number")
```

```{r}
# plot person parameters
person_pars_1pl[, , "Intercept"] %>%
	as_tibble() %>%
	rownames_to_column() %>%
	arrange(Estimate) %>%
	mutate(id = seq_len(n())) %>%
	ggplot(aes(id, Estimate, ymin = Q2.5, ymax = Q97.5)) +
	geom_pointrange(alpha = 0.4) +
	coord_flip() +
	labs(x = "Person Number (Sorted)")
```


```{r}
library(tidybayes)

fit_1pl |> 
  spread_draws(r_id[id,Intercept])
```


```{r eval=FALSE, include=FALSE}
varying_subject_effects <- fit_1pl |> 
  spread_draws(b_Intercept, r_id[id, Intercept]) |> 
  median_qi(ability = b_Intercept + r_id, .width = c(.95)) |> 
  arrange(ability) 

varying_subject_effects |> 
  mutate(id = as_factor(id),
         id = fct_reorder(id, ability, .fun = 'min')) |> 
  ggplot(aes(y = id, x = ability, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(alpha = 0.4) 
```


```{r}
varying_item_effects <- fit_1pl |> 
  spread_draws(b_Intercept, r_item[item, Intercept]) |> 
  median_qi(difficulty = b_Intercept + r_item, .width = c(.95)) |> 
  arrange(difficulty) 

varying_item_effects |> 
  ggplot(aes(y = item, x = difficulty, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(alpha = 0.4) +
  ggtitle("Partial pooling estimates")
```


```{r eval=FALSE, include=FALSE}
r2 ~ Anger + Gender + btype + situ + mode + mode:Gender +
  (0 + Gender | item) + (0 + mode | id)
```


```{r}
population_item_effects <- fit_1pl_nopooling |> 
  gather_draws(`b_item.*`, regex = TRUE) |> 
  median_qi(.width = c(.95))
```


```{r}
population_item_effects |> 
  ggplot(aes(y = .variable, x = .value, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(alpha = 0.4) +
  ggtitle("No pooling estimates")
```


```{r eval=FALSE, include=FALSE}
library(lme4)
# specify a 1PL model with lme4 for comparison
lme4_va_1pl <- glmer(
	r2 ~ 1 + (1 | item) + (1 | id),
	data = VerbAgg,
	family = binomial()
)
summary(lme4_va_1pl)

# person and item parameters are similar to those obtained by brms
coef(lme4_va_1pl)$item
ranef(lme4_va_1pl)$id
```