Class10_notes_filled.Rmd

---
title: "Class10_notes"
author: "Anita Kurm"
date: "11/27/2019"
output: html_document
---

## Welcome to Class 10!

Today we will learn how to perform **logistic regression** in R and kick-start your Portfolio 5! In the Part 2 of this markdown you will also find out more about **the problem of multicollinearity in linear models** and ways to avoid it.


### Set-up: Data for Portfolio 5

The data you will be working with today are from a sound symbolism experiment: kikibobo.Rda. You will notice that data comes in .Rda format, which is short for .RData (or .rdata). Typically files of this format are *R objects* that were saved using save() command. To load them, we just need the load() command. You should see a new object (a data frame) namded kikibobo3 will appear in your environment.

```{r}
#an alternative spelling of pacman::p_load() is to call library(pacman) followed by p_load
library(pacman);p_load(tidyverse, boot, lmerTest, caret, e1071)

#load data
load(file = "kikibobo.Rda") 
```


Note that you can't give a name to the output of this command the way you would give a name 'df' to an output of read_csv(). If you try to give a name to the output of load(), your new object will just contain a name of the R object in your file, but not the data it contains.

```{r}
#try to give a name and fail
df <- load(file = "kikibobo.Rda") 
df #the output is just a character string 'kikibobo3'

#delete df from environment
rm(df)
```


#### Set-up Exercise:  
Now that we have data loaded (object named kikibobo3), let's do some simple pre-processing and data eye-balling:


1. Make a (better) copy of kikibobo3 by creating a tibble named 'kiki' using the command as_tibble()

2. Remove the column X - we don't need it

3. Anonymize data! Right now the id column might contain sensitive information. We want to preserve some kind of ID, so we can use it in our analysis further, but we don't want it the way it is written right now:
        
        - Make id a factor - to make sure that every distict ID is treated as its own thing
        
        - Then make id a numeric variable - to turn all level names into numbers - this way they are distinct and don't contain all the sensitive information
        
        - Then make id a factor again - this time all of its levels will be written down as numbers

4. Check out first several rows using head() function.

5. Check what class of variable 'shape' is. Use command class() for that. 

6. Pick any factor variable in the kiki tibble and use command levels() on that variable to see what levels are possible in this variable

7. Check how many levels there are in the factor variable by either looking at 
```{r}
#make the dataframe a tibble and give it a shorter name
kiki <- as_tibble(kikibobo3)

#remove the X column - we don't know what it is, but we know it isn't important
kiki$X <- NULL

#anonymize data
kiki$id <- as.factor(kiki$id) #id is now a factor with 35 levels - meaning we have 35 participants
kiki$id <- as.numeric(kiki$id) #turn those factors into numbers - data is anonymized
kiki$id <- as.factor(kiki$id) #turn back into the factor - levels are now numbers instead of possibly sensitive information


#check out what data looks like by looking at it's head (several first rows)
head(kiki) #looks


#you can check what class of a variable you're dealing with
class(kiki$shape)

#you can also check which levels there are in a factor variable
levels(kiki$shape)

#you can also see how many levels there are by using length() function - or by checking properties of the tibble in the environment
length(levels(kiki$shape)) 

```
 

### Part 1: Logistic Regression in R and its 

Logistic regression is another type of linear models - something we have worked with for a while now!

Quick recap on what it is:
- Logistic regression is for categorical (binary) outcome variables
- Predicts probability of a certain kind of y given values of x
- Same output as linear model, but estimates are in the log-odd scale
- Very common classification algorithm in machine learning


#### Generalized linear model: GLM
R makes it very easy to fit a logistic regression model by using a function called glm(). The fitting process of glm() is not so different from the one used in linear regression, excepth we also specify an additional parameter called family:

      glm(outcome ~ predictors, data, family = binomial)


See help(glm) for other modeling options. See help(family) for other allowable link functions for each family.
```{r}
#make a GLM model for shape
m <- glm(shape ~ consonant, kiki, family = binomial)
summary(m)
```


#### Mixed effects logistic regression model
If we have repeated measures, we need to make a mixed effect model, just like in the previous class!
The general syntax we will be using is:

  lme4::glmer(Outcome ~ Predictors + (1+Slope|Intercept),data = data, family = binomial)
  
    - lme4::glmer           asking to look for glmer command from lme4 pacakge
    
    - glmer()               asking to fit generalized mixed effect model (GLMM)
    
    - Outcome ~ Predictors  standard linear model formula
    
    - 1+Slope|Intercept     random effects
    
    - data                  name of the dataset
    
    - family = binomial     The ability to specify the family is exactly what makes a logistic regression model so easy to fit. GLMM, by default binomial means binomial(link = "logit"), but it can actually use other link functions too for other puproses
    

```{r}
#make a GLMM model for shape
m2 <- lme4::glmer(shape ~ consonant + (1+consonant|id), kiki, family = binomial)
summary(m2)
```


#### Output interpretation

1) The singularity warning - points out that your model is almost too complex (most likely overfits data) and some of your predictor estimates are close to 0. It's just a warning and you can still use estimates from the model - but most likely it isn't the best model and you should consider a simpler model. Check the most popular response here to know more about recommendations: https://www.researchgate.net/post/What_does_singular_fit_mean_in_Mixed_Models 

2) Model fit:
Information criteria - can be used to compare with other models

loglik - Log-likelihood statistic:
    how much unexplained information left after model is fitted
    large values = poor fit

deviance - like F-ration in linear regression


3) The estimates

Outcome variable of a binomial logistic regression has two levels (categories), and regardless of actual names of levels specific to the outcome variable we have, the first level is encoded as 0 and the second level is encoded as 1. And estimates we are getting are **probabilities of level 1**.
    On the example of our data: levels(kiki$shape) are "curved" "jagged" - meaning curved is level 0 and jagged is level 1. Therefore, probabilities we are going to get from the model estimates are for 'jagged' shape
    
```{r}
#see levels of outcome variable: 0 and 1. Model estimates log odds for level 1
levels(kiki$shape)


#see levels of predictor: base level and other levels... Model intercept shows log odds of outcome's level 1 at base level! Other estimates show how outcome's log odds change in other levels compared to the base level.
levels(kiki$consonant)
```
  
    
Estimates of the logistic regression are on the log-odds scale, which means that beta estimates show change in the odds resulting from a unit change in the predictor. Log odds are easier to interpret on either odds or probability scale. I think *probability scale* is more intuitive and recommend using it!

To get normal probabilities out of log odds, we will need package boot and its function for inverse logit:

```{r}
#see summary again to see log odds
summary(m2)

#log odds into probability for estimate of intercept (the base level of factor/the first factor - curved)
boot::inv.logit(-2.178216)

#log odds into probability for estimate of going from letter B to letter K
boot::inv.logit(-2.178216 + 3.290078) 

```

Intercept shows that probability of jagged shape at base level (consonant B) is 0.10 (10%), whcih becomes 75% when consonant changes to K. 

    -> If you want to know probability of the other category - just subtract probability of given category from 1: probability of curved shape in case of consonant B is 1-0.1 = 0.9, probability of curved shape in case of consonant K is 1-0.75=0.25 
    
    
    -> In binomial logistic regression, you can look at the scale of outcome probabilities as a scale from category 0 to category 1 (e.g. the scale from 'curved' to 'jagged'). 
    Interpretation: on the scale from curved to jagged, consonant B is very close to 'curved' (estimate of 0.1 - close to 0), while consonant K is very close to 'jagged' (estimate of 0.75 - close to 1)
    

#### Visualization

There is a lot of things going on in our data

aesthetics to show more differentiation: x, y, color, shape  

geom_point() vs geom_jitter()

geom_jitter() parameters

```{r}
#visualise the data:
ggplot(kiki, aes(shape, size, color = word, shape = word)) + geom_point() 

#spread points out - so much better
ggplot(kiki, aes(shape, size, color = word, shape = word)) + geom_jitter(width = .4, height = .4) 
```


#### Write-up in APA style


#### Assumptions
Credit: https://www.statisticssolutions.com/assumptions-of-logistic-regression/ 

Logistic regression *does not* make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms:

      - logistic regression does not require a linear relationship between the dependent and independent variables.  
      - the error terms (residuals) do not need to be normally distributed.
      - homoscedasticity is not required

However, some other assumptions still apply!

  - binary logistic regression requires the dependent variable to be binary, i.e. a factor with two levels
      
  - logistic regression requires the observations to be independent of each other.  In other words, the observations should not come from repeated measurements or matched data - **USE glmer() TO BUILD MIXED EFFECT MODELS THAT DEAL WITH VIOLATION OF THIS ASSUMPTION**
      
  - logistic regression requires there to be little or no multicollinearity among the independent variables.  This means that the independent variables should not be too highly correlated with each other.
      
  - logistic regression assumes linearity of independent variables and log odds.
    
  - logistic regression typically requires a large sample size.  A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10).

```{r}
plot(m)
```

#### Logistic Regression as 'classification algorithm'

Caret package:

The predict() function is good to know:
– For linear models it will give you predictions of y for a certain x-value

– For logistic models it will give you the predictions in percentage for a certain value of your outcome

predict(model, newdata = test_dataframe,  type = 'response')
    
    by default: newdata is not specified and then original data to which model was fit is going to be used for predictions
    
    by default: type is not specified and then predictions are given in log odds, when specified 'response' predictions are given in probabilities


confusionMatrix(predicted_categories, actual_categories, positive category) :
    needs a vector of predictions and a vector of actual values - in the same format to be comparable (in this case both have to be factors with levels 'jagged' and 'curved')
    
    needs to know which level of outcome is 'positive': for which category performance metrics should be calculated 
      
      https://stackoverflow.com/questions/33431694/caret-package-defining-positive-result/33432399 
      
    produces Confusion Matrix: Prediction against Actual values
    
    evaluates performance of our classifier: Sensitivity, Specificity,

```{r}
library(caret)

#let's remind ourselves the model: shape (jagged is level 1) ~ consonant
summary(m) 

#let's predict *probabilities* of jagged shape using consonants from the original dataframe
predicted_probs = predict(m, type = 'response')

#extract actual shape categories from the original dataframe
actual_categories = kiki$shape

#make a dataframe to see predicted probabilities of jagged category against the actual category
pred_df <- tibble(predicted_probs, actual_categories)

#make a new column to see more easily what our model predicted: if probability is less than 0.5 - it predicted 'curved', otherwise - it predicted jagged
pred_df$predicted_category = ifelse(pred_df$predicted_probs < 0.5, "curved", "jagged")

#make it a factor (it's a character due to the way we specified ifelse outcomes)
pred_df$predicted_category <- as_factor(pred_df$predicted_category)


confusionMatrix(pred_df$predicted_category, pred_df$actual_categories, positive ="jagged")
```


### Part 1 Exercise

Find Portfolio 5 assignment on blackboard and start doing it :) 


### Part 2: Multicollinearity

1. Correlation matrix

https://stackoverflow.com/questions/31238284/correlation-matrix-of-a-bunch-of-categorical-variables-in-r/31240202


```{r}
#Stroop task data
stroop <- list.files(path = "EmotionalStroopTask", pattern = "*.csv", full.names = T) %>% 
  lapply(read_csv) %>% 
  bind_rows()

#check cor()
#DF[] <- lapply(DF,as.integer)
#library(sjPlot)
#sjp.corr(DF)
#sjt.corr(DF)
#round(cor(as.integer(stroop)),2)

#plot as heatmap
#qplot(x=Var1, y=Var2, data=melt(cor(child, use="p")), fill=value, geom="tile") +
#   scale_fill_gradient2(limits=c(-1, 1))


```


2. Variance Inflation Factor (VIF)

Variance Inflation Factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables

vif() from package car
(GVIF stands for generalized variance inflation factor - is more comparable to normal VIFs when it )
  
Variance inflation factors range from 1 upwards;  a large VIF on an independent variable indicates a highly collinear relationship to the other variables
e.g a VIF of 1.9 tells you that the variance of a particular coefficient is 90% bigger than what you would expect if there was no multicollinearity


```{r}

pacman::p_load(car)
?car::vif()

data()

model <- lm(Reaction_time ~ Color + Valence, stroop)
summary(model)

vif(model)
```


#Making data for portfolio
```{r}
kiki2$choice <- as.character(kiki2$choice)
kiki2$shape <- ifelse(startsWith(kiki2$choice, "curved"), "curved", "jagged")
kiki2$size <-  ifelse(endsWith(kiki2$choice, c("big1", "ig2"), "big", "small")
```