-
Notifications
You must be signed in to change notification settings - Fork 3
/
Class10_notes_filled.Rmd
371 lines (213 loc) · 14.5 KB
/
Class10_notes_filled.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
---
title: "Class10_notes"
author: "Anita Kurm"
date: "11/27/2019"
output: html_document
---
## Welcome to Class 10!
Today we will learn how to perform **logistic regression** in R and kick-start your Portfolio 5! In the Part 2 of this markdown you will also find out more about **the problem of multicollinearity in linear models** and ways to avoid it.
### Set-up: Data for Portfolio 5
The data you will be working with today are from a sound symbolism experiment: kikibobo.Rda. You will notice that data comes in .Rda format, which is short for .RData (or .rdata). Typically files of this format are *R objects* that were saved using save() command. To load them, we just need the load() command. You should see a new object (a data frame) namded kikibobo3 will appear in your environment.
```{r}
#an alternative spelling of pacman::p_load() is to call library(pacman) followed by p_load
library(pacman);p_load(tidyverse, boot, lmerTest, caret, e1071)
#load data
load(file = "kikibobo.Rda")
```
Note that you can't give a name to the output of this command the way you would give a name 'df' to an output of read_csv(). If you try to give a name to the output of load(), your new object will just contain a name of the R object in your file, but not the data it contains.
```{r}
#try to give a name and fail
df <- load(file = "kikibobo.Rda")
df #the output is just a character string 'kikibobo3'
#delete df from environment
rm(df)
```
#### Set-up Exercise:
Now that we have data loaded (object named kikibobo3), let's do some simple pre-processing and data eye-balling:
1. Make a (better) copy of kikibobo3 by creating a tibble named 'kiki' using the command as_tibble()
2. Remove the column X - we don't need it
3. Anonymize data! Right now the id column might contain sensitive information. We want to preserve some kind of ID, so we can use it in our analysis further, but we don't want it the way it is written right now:
- Make id a factor - to make sure that every distict ID is treated as its own thing
- Then make id a numeric variable - to turn all level names into numbers - this way they are distinct and don't contain all the sensitive information
- Then make id a factor again - this time all of its levels will be written down as numbers
4. Check out first several rows using head() function.
5. Check what class of variable 'shape' is. Use command class() for that.
6. Pick any factor variable in the kiki tibble and use command levels() on that variable to see what levels are possible in this variable
7. Check how many levels there are in the factor variable by either looking at
```{r}
#make the dataframe a tibble and give it a shorter name
kiki <- as_tibble(kikibobo3)
#remove the X column - we don't know what it is, but we know it isn't important
kiki$X <- NULL
#anonymize data
kiki$id <- as.factor(kiki$id) #id is now a factor with 35 levels - meaning we have 35 participants
kiki$id <- as.numeric(kiki$id) #turn those factors into numbers - data is anonymized
kiki$id <- as.factor(kiki$id) #turn back into the factor - levels are now numbers instead of possibly sensitive information
#check out what data looks like by looking at it's head (several first rows)
head(kiki) #looks
#you can check what class of a variable you're dealing with
class(kiki$shape)
#you can also check which levels there are in a factor variable
levels(kiki$shape)
#you can also see how many levels there are by using length() function - or by checking properties of the tibble in the environment
length(levels(kiki$shape))
```
### Part 1: Logistic Regression in R and its
Logistic regression is another type of linear models - something we have worked with for a while now!
Quick recap on what it is:
- Logistic regression is for categorical (binary) outcome variables
- Predicts probability of a certain kind of y given values of x
- Same output as linear model, but estimates are in the log-odd scale
- Very common classification algorithm in machine learning
#### Generalized linear model: GLM
R makes it very easy to fit a logistic regression model by using a function called glm(). The fitting process of glm() is not so different from the one used in linear regression, excepth we also specify an additional parameter called family:
glm(outcome ~ predictors, data, family = binomial)
See help(glm) for other modeling options. See help(family) for other allowable link functions for each family.
```{r}
#make a GLM model for shape
m <- glm(shape ~ consonant, kiki, family = binomial)
summary(m)
```
#### Mixed effects logistic regression model
If we have repeated measures, we need to make a mixed effect model, just like in the previous class!
The general syntax we will be using is:
lme4::glmer(Outcome ~ Predictors + (1+Slope|Intercept),data = data, family = binomial)
- lme4::glmer asking to look for glmer command from lme4 pacakge
- glmer() asking to fit generalized mixed effect model (GLMM)
- Outcome ~ Predictors standard linear model formula
- 1+Slope|Intercept random effects
- data name of the dataset
- family = binomial The ability to specify the family is exactly what makes a logistic regression model so easy to fit. GLMM, by default binomial means binomial(link = "logit"), but it can actually use other link functions too for other puproses
```{r}
#make a GLMM model for shape
m2 <- lme4::glmer(shape ~ consonant + (1+consonant|id), kiki, family = binomial)
summary(m2)
```
#### Output interpretation
1) The singularity warning - points out that your model is almost too complex (most likely overfits data) and some of your predictor estimates are close to 0. It's just a warning and you can still use estimates from the model - but most likely it isn't the best model and you should consider a simpler model. Check the most popular response here to know more about recommendations: https://www.researchgate.net/post/What_does_singular_fit_mean_in_Mixed_Models
2) Model fit:
Information criteria - can be used to compare with other models
loglik - Log-likelihood statistic:
how much unexplained information left after model is fitted
large values = poor fit
deviance - like F-ration in linear regression
3) The estimates
Outcome variable of a binomial logistic regression has two levels (categories), and regardless of actual names of levels specific to the outcome variable we have, the first level is encoded as 0 and the second level is encoded as 1. And estimates we are getting are **probabilities of level 1**.
On the example of our data: levels(kiki$shape) are "curved" "jagged" - meaning curved is level 0 and jagged is level 1. Therefore, probabilities we are going to get from the model estimates are for 'jagged' shape
```{r}
#see levels of outcome variable: 0 and 1. Model estimates log odds for level 1
levels(kiki$shape)
#see levels of predictor: base level and other levels... Model intercept shows log odds of outcome's level 1 at base level! Other estimates show how outcome's log odds change in other levels compared to the base level.
levels(kiki$consonant)
```
Estimates of the logistic regression are on the log-odds scale, which means that beta estimates show change in the odds resulting from a unit change in the predictor. Log odds are easier to interpret on either odds or probability scale. I think *probability scale* is more intuitive and recommend using it!
To get normal probabilities out of log odds, we will need package boot and its function for inverse logit:
```{r}
#see summary again to see log odds
summary(m2)
#log odds into probability for estimate of intercept (the base level of factor/the first factor - curved)
boot::inv.logit(-2.178216)
#log odds into probability for estimate of going from letter B to letter K
boot::inv.logit(-2.178216 + 3.290078)
```
Intercept shows that probability of jagged shape at base level (consonant B) is 0.10 (10%), whcih becomes 75% when consonant changes to K.
-> If you want to know probability of the other category - just subtract probability of given category from 1: probability of curved shape in case of consonant B is 1-0.1 = 0.9, probability of curved shape in case of consonant K is 1-0.75=0.25
-> In binomial logistic regression, you can look at the scale of outcome probabilities as a scale from category 0 to category 1 (e.g. the scale from 'curved' to 'jagged').
Interpretation: on the scale from curved to jagged, consonant B is very close to 'curved' (estimate of 0.1 - close to 0), while consonant K is very close to 'jagged' (estimate of 0.75 - close to 1)
#### Visualization
There is a lot of things going on in our data
aesthetics to show more differentiation: x, y, color, shape
geom_point() vs geom_jitter()
geom_jitter() parameters
```{r}
#visualise the data:
ggplot(kiki, aes(shape, size, color = word, shape = word)) + geom_point()
#spread points out - so much better
ggplot(kiki, aes(shape, size, color = word, shape = word)) + geom_jitter(width = .4, height = .4)
```
#### Write-up in APA style
#### Assumptions
Credit: https://www.statisticssolutions.com/assumptions-of-logistic-regression/
Logistic regression *does not* make many of the key assumptions of linear regression and general linear models that are based on ordinary least squares algorithms:
- logistic regression does not require a linear relationship between the dependent and independent variables.
- the error terms (residuals) do not need to be normally distributed.
- homoscedasticity is not required
However, some other assumptions still apply!
- binary logistic regression requires the dependent variable to be binary, i.e. a factor with two levels
- logistic regression requires the observations to be independent of each other. In other words, the observations should not come from repeated measurements or matched data - **USE glmer() TO BUILD MIXED EFFECT MODELS THAT DEAL WITH VIOLATION OF THIS ASSUMPTION**
- logistic regression requires there to be little or no multicollinearity among the independent variables. This means that the independent variables should not be too highly correlated with each other.
- logistic regression assumes linearity of independent variables and log odds.
- logistic regression typically requires a large sample size. A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10).
```{r}
plot(m)
```
#### Logistic Regression as 'classification algorithm'
Caret package:
The predict() function is good to know:
– For linear models it will give you predictions of y for a certain x-value
– For logistic models it will give you the predictions in percentage for a certain value of your outcome
predict(model, newdata = test_dataframe, type = 'response')
by default: newdata is not specified and then original data to which model was fit is going to be used for predictions
by default: type is not specified and then predictions are given in log odds, when specified 'response' predictions are given in probabilities
confusionMatrix(predicted_categories, actual_categories, positive category) :
needs a vector of predictions and a vector of actual values - in the same format to be comparable (in this case both have to be factors with levels 'jagged' and 'curved')
needs to know which level of outcome is 'positive': for which category performance metrics should be calculated
https://stackoverflow.com/questions/33431694/caret-package-defining-positive-result/33432399
produces Confusion Matrix: Prediction against Actual values
evaluates performance of our classifier: Sensitivity, Specificity,
```{r}
library(caret)
#let's remind ourselves the model: shape (jagged is level 1) ~ consonant
summary(m)
#let's predict *probabilities* of jagged shape using consonants from the original dataframe
predicted_probs = predict(m, type = 'response')
#extract actual shape categories from the original dataframe
actual_categories = kiki$shape
#make a dataframe to see predicted probabilities of jagged category against the actual category
pred_df <- tibble(predicted_probs, actual_categories)
#make a new column to see more easily what our model predicted: if probability is less than 0.5 - it predicted 'curved', otherwise - it predicted jagged
pred_df$predicted_category = ifelse(pred_df$predicted_probs < 0.5, "curved", "jagged")
#make it a factor (it's a character due to the way we specified ifelse outcomes)
pred_df$predicted_category <- as_factor(pred_df$predicted_category)
confusionMatrix(pred_df$predicted_category, pred_df$actual_categories, positive ="jagged")
```
### Part 1 Exercise
Find Portfolio 5 assignment on blackboard and start doing it :)
### Part 2: Multicollinearity
1. Correlation matrix
https://stackoverflow.com/questions/31238284/correlation-matrix-of-a-bunch-of-categorical-variables-in-r/31240202
```{r}
#Stroop task data
stroop <- list.files(path = "EmotionalStroopTask", pattern = "*.csv", full.names = T) %>%
lapply(read_csv) %>%
bind_rows()
#check cor()
#DF[] <- lapply(DF,as.integer)
#library(sjPlot)
#sjp.corr(DF)
#sjt.corr(DF)
#round(cor(as.integer(stroop)),2)
#plot as heatmap
#qplot(x=Var1, y=Var2, data=melt(cor(child, use="p")), fill=value, geom="tile") +
# scale_fill_gradient2(limits=c(-1, 1))
```
2. Variance Inflation Factor (VIF)
Variance Inflation Factor measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables
vif() from package car
(GVIF stands for generalized variance inflation factor - is more comparable to normal VIFs when it )
Variance inflation factors range from 1 upwards; a large VIF on an independent variable indicates a highly collinear relationship to the other variables
e.g a VIF of 1.9 tells you that the variance of a particular coefficient is 90% bigger than what you would expect if there was no multicollinearity
```{r}
pacman::p_load(car)
?car::vif()
data()
model <- lm(Reaction_time ~ Color + Valence, stroop)
summary(model)
vif(model)
```
#Making data for portfolio
```{r}
kiki2$choice <- as.character(kiki2$choice)
kiki2$shape <- ifelse(startsWith(kiki2$choice, "curved"), "curved", "jagged")
kiki2$size <- ifelse(endsWith(kiki2$choice, c("big1", "ig2"), "big", "small")
```