-
Notifications
You must be signed in to change notification settings - Fork 3
/
Class4_notes_filled.Rmd
500 lines (319 loc) · 19.5 KB
/
Class4_notes_filled.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
---
title: "Class4_notes"
author: "Anita Kurm"
date: "9/24/2019"
output: html_document
---
```{r setup}
knitr::opts_chunk$set(echo = TRUE)
```
## Welcome to Class 4
Today we will be looking into assumptions of normality and ways to check them! The functions we will be using today come from packages 'tidyverse' and 'pastecs'. Make sure you load them in the chunk below:
```{r load/install packages}
pacman::p_load(tidyverse, pastecs)
```
Make a new R code chunk using keyboard shortcut
Ctrl + Alt + I (Cmd + Option + I on macOS).
Once the code chunk appears, change the name of the chunk, so it looks like this (you can choose the name yourself):
{r nameofthechunk}
Then, import the personality test data inside of it, using read.csv() function. Note, that I called it 'df', which is called upon a lot of times in this document further on. If you choose another identifier, you'll have to switch all of 'df' in the code to your identifier :)
To set working directory to the location of the data file, either save this R Markdown file into the same folder as data,or use knitr::opts_knit$set(root.dir = 'Path to the folder with the file').
```{r load_data}
df <- read.csv("NEW_CogSciPersonalityTest2019.csv")
```
### Part 1: Visually comparing measurements across different groups of partcipants
#### Bar plot, box plot, violin plot
"The boxplot compactly displays the distribution of a continuous variable. It visualises five summary statistics (the median, two hinges and two whiskers), and all "outlying" points individually."
credit and more info: https://ggplot2.tidyverse.org/reference/geom_boxplot.html
Box plots are more informative than bar plots!
Even more information can be displayed with a violin plot!
"The idea of a violin plot is to combine a box plot with a density plot. Since it relies on density estimation, the plot only makes sense if a sufficient number of data are available for obtaining reliable estimates"
credit and more info: https://www.r-bloggers.com/box-plot-alternatives-beeswarm-and-violin-plots/
To see for yourself the same data represented using these different plots, run the chunk below:
```{r}
#I will use the ggplot base, theme and labels several times, so I wrote it down as a variable 'rplot'
rplot <- ggplot(df, aes(x = df$gender, y = df$ratio2d4d, colour = df$gender)) +
theme_minimal() +
labs(x = "Gender", y = "Ratio 2D:4D")
#Bar plot
rplot +
geom_bar(stat = 'summary', fun.y = mean, width = 0.25, fill = 'white') +
ggtitle("Bar plot")
#Box plot
rplot +
geom_boxplot(width = 0.5) +
ggtitle("Box Plot")
#Violin plot
rplot +
geom_violin() +
ggtitle("Violin Plot")
```
Box plots show the median line in the middle and violin plot doesn't have any lines at all. So if we want to see the mean as well, we should ask for it using stat_summary:
```{r}
#Box plot
rplot +
geom_boxplot(width = 0.5) +
ggtitle("Box Plot with mean") +
stat_summary(fun.y = mean, geom = "point", shape = 23, colour = "Black") #draw a point for the mean
#Violin plot
rplot +
geom_violin() +
ggtitle("Violin Plot") +
stat_summary(fun.y = mean, geom = "point", shape = 23, colour = "Black") #draw a point for the mean
```
´´´
#### Exercise for Part 1
1. Numerically compare mean values of hours_music_per_week data between people who prefer Option L and Option S in taste_cola.
```{r}
df %>% group_by(taste_cola) %>% summarise(mean(hours_music_per_week))
```
2. Make a corresponding box plot (that is also showing the mean) and discuss if the box plot changes your interpretation of mean values in the groups.
```{r}
#Box plot
ggplot(df, aes(x = df$taste_cola, y = df$hours_music_per_week, colour = df$taste_cola)) +
geom_boxplot(width = 0.5) +
ggtitle("Music Time depending on Cola")+
stat_summary(fun.y = mean, geom = "point", shape = 23, colour = "Black")
```
3. Do you think something is abnormal about our data? Are some data points problematic? Try to fix it and see what changes when you make a new plot.
Hint: You can try to remove the outlier(s) (you could use filter) and make the same plot again on data without the outliers – what do you see now? Additionally, you can also re-calculate mean values and see what changed numerically.
```{r}
filtered <- df %>% filter(hours_music_per_week<100)
ggplot(filtered, aes(x = filtered$taste_cola, y = filtered$hours_music_per_week, colour = filtered$taste_cola)) +
geom_boxplot(width = 0.5) +
ggtitle("Music Time depending on Cola (filtered)")+
stat_summary(fun.y = mean, geom = "point", shape = 23, colour = "Black")
filtered %>% group_by(taste_cola) %>% summarise(mean(hours_music_per_week))
```
---
### Part 2: Mean + Error bars, confidence intervals
Out of all plots above, we can derive some information about our data and see some differences between groups. We even plotted the mean values, so we can visually compare them. It is however hard to say, how prominent is the difference between the means. One of ways to understant difference between mean values better is to draw confidence intervals. By comparing the confidence intervals of different means we can start to get some idea about whether the means came from the same population or different populations.
Confidence intervals are often visualized as error bars. Error bars can be drawn using geom_errorbar() and stat_summary(fun.data = , geom = "errorbar")
An error bar can represent the standard deviation, or the standard error, but more often than not it shows the 95% confidence interval of the
mean.
Run the chunk below to see different types of errorbars on the same data.
```{r}
#Visualizing mean and standard deviation
rplot +
geom_bar(aes(fill=gender), stat='summary', fun.y = mean, width = 0.5) +
geom_errorbar(aes(ymin = mean(df$ratio2d4d) - sd(df$ratio2d4d), ymax = mean(df$ratio2d4d) + sd(df$ratio2d4d)), width = 0.1, color = 'black') +
ggtitle("Using Standard Deviation")+
ylim(NA, 1.5) #just making sure that y-axis is consistent in all of the graphs by changing its limits (from the most minimal possible value to 1.5)
#Visualizing mean and standard error of the mean
rplot +
geom_bar(aes(fill=gender), stat='summary', fun.y = mean, width = 0.5) +
stat_summary(fun.data = mean_se, geom = "errorbar", color = 'black', width = 0.1) +
ggtitle("Using Standard Error of the Mean")+
ylim(NA, 1.5) #limiting the y axis
#Visualizing mean and 95% confidence intervals
rplot +
geom_bar(aes(fill=gender), stat='summary', fun.y = mean, width = 0.5) +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", color = 'black', width = 0.1) +
ggtitle("Using 95% Confidence Intervals")+
ylim(NA, 1.5) #limiting the y axis
```
*Standard deviation* is a measure of dispersion of the data from the mean.
"Small standard deviations represented a scenario in which most data points were close to the mean, a large standard deviation represented a situation in which data points were widely spread from the mean" - Field
*Standard error of the mean* is a measure of how precise is our estimate of the mean.
"it is a measure of how representative a sample is likely to be of the population. A large standard error (relative to the sample mean) means that there is a lot of variability between the means of different samples and so the sample we have might not be representative of the population. A small standard error indicates that most sample means are similar to the population mean and so our sample is likely to be an accurate reflection of the population" - Field
*95% Confidence Interval*: when you see a 95% confidence interval for a mean, think of it like this: if we’d collected 100 samples, calculated the mean and then calculated a confidence interval for that mean then for
95 of these samples, the confidence intervals we constructed would contain the true value of the mean in the population.
So what should we draw?
"It depends. If the message you want to carry is about the spread and variability of the data, then standard deviation is the metric to use. If you are interested in the precision of the means or in comparing and testing differences between means then standard error is your metric. Of course deriving confidence intervals around your data (using standard deviation) or the mean (using standard error) requires your data to be normally distributed"
source: https://www.r-bloggers.com/standard-deviation-vs-standard-error/
"An error bar can represent the standard deviation, or the standard error, but more often than not it shows the 95% confidence interval of the mean" - Field
#### Part 2 Exercise
Modify the last boxplot you made in Part 1 Exercises by adding various error bars to it. Which one do you think works the best for comparing mean values of different groups?
```{r}
ggplot(filtered, aes(x = filtered$taste_cola, y = filtered$hours_music_per_week, colour = filtered$taste_cola)) +
geom_boxplot(width = 0.5) +
ggtitle("Music Time depending on Cola (filtered)")+
stat_summary(fun.y = mean, geom = "point", shape = 23, colour = "Black")
```
---
### Part 3: Assumptions of normality
Credit: http://bradleyboehmke.github.io/tutorials/assumptions_normality#shapiro
Let's fabricate some perfectly normally distributed data, so we can then compare it to our actual data!
NB! We will do this just for educational purposes! After this class, when you will do your portfolio, you will just check your actual data and make judgement about its distrubution without comparing it to a fabricated normal distirbution.
```{r}
#make some random normally distributed data with sample size of 62 and mean scores resembling mean scores of actual data
normdist <- data.frame(hours_music_per_week = rnorm(62, mean = 13.74), shoesize = rnorm(62, mean = 40.65))
```
#### Checking assumptions visually:
##### Histogram + Normal curve
Histograms can be used to inspect distributions in your data
X-axis represents the different values in your variable
Y-axis represents
– counts (how many times does this value occur?)
OR
- density (what percentage of the total data has this value?)
Plotting normal curve: stat_function(fun = dnorm, args = list(mean = mean(data$ColumnOfInterest, na.rm = TRUE), sd = sd(data$ColumnOfInterest, na.rm = TRUE)), colour= "black", size = 1)
```{r}
ggplot(normdist, aes(x = hours_music_per_week)) +
geom_histogram(aes(y = ..density..), binwidth = 0.25, color = "black", fill = "white") +
ggtitle("This is what music time would look like if it was normally distributed") +
stat_function(fun = dnorm, args = list(mean = mean(normdist$hours_music_per_week, na.rm = TRUE), sd = sd(normdist$hours_music_per_week, na.rm = TRUE)), colour= "darkgreen", size = 1)+
theme_classic()
ggplot(df, aes(x = hours_music_per_week)) +
geom_histogram(aes(y = ..density..), binwidth = 0.25) +
ggtitle("This is what music time actually looks like") +
stat_function(fun = dnorm, args = list(mean = mean(df$hours_music_per_week, na.rm = TRUE), sd = sd(df$hours_music_per_week, na.rm = TRUE)), colour= "darkgreen", size = 1)+
theme_classic()
#Remember we removed the outlier of 100 hours of music per week? Let's do the same graph for the filtered data
ggplot(filtered, aes(x = hours_music_per_week)) + #note that I use 'filtered' dataframe here :) It might be called differently for you
geom_histogram(aes(y = ..density..), binwidth = 0.25) +
ggtitle("This is what music time looks like without the biggest outlier") +
stat_function(fun = dnorm, args = list(mean = mean(filtered$hours_music_per_week, na.rm = TRUE), sd = sd(filtered$hours_music_per_week, na.rm = TRUE)), colour= "darkgreen", size = 1)+
theme_classic()
```
##### QQ-plot
Tests the emperical, z-scored data against the theoretical, normally distributed data using qqnorm(data$column)
```{r}
ggplot(normdist, aes(sample = hours_music_per_week ))+
stat_qq()+
stat_qq_line(colour= 'red')
#qq-plot of fabricated normally distributed data
qqnorm(normdist$hours_music_per_week)
#qq-plot of actual data
qqnorm(df$hours_music_per_week)
#qqplot after removing the outlier
qqnorm(filtered$hours_music_per_week)
```
#### Checking assumptions numerically:
We use stat.desc() function from library 'pastecs'. We exclude basic metrics by writing basic = FALSE and include normality test by writing norm = TRUE. For ease of interpretation, we round the results of stat.desc using function round(), where we first specify the object to round up and then specify how many signs after coma we want to see. As a result we get a long line:
```{r}
#normality test and other metrics for hours_music_per_week
round(pastecs::stat.desc(df$hours_music_per_week, basic = FALSE, norm = TRUE), digits = 2)
round(pastecs::stat.desc(df$hours_music_per_week, basic = FALSE, norm = TRUE), digits = 2)
round(pastecs::stat.desc(normdist$hours_music_per_week, basic = FALSE, norm = TRUE), digits = 2)
```
We can also use cbind() function to perform stat.desc on multiple columns at the same time (columns don't have to be from the same dataframe):
```{r}
round(pastecs::stat.desc(cbind(normdist$hours_music_per_week, df$hours_music_per_week), basic = FALSE, norm = TRUE), digits = 2)
```
Understanding the statistic of a Shapiro-Wilk test of normality (normtest.W) and its associated probability (normtest.p):
If the test is non-significant (p>.05) it tells us that the distribution of the sample is not significantly different from a normal distribution. If, however, the test is significant (p < .05) then the distribution in question is significantly different from a normal distribution.
#### Part 3 Exercises
1. Make a histogram of the variable where participant chose a random number
2. Change the y-axis so it displays densities instead of counts
3. Add a normal curve to the plot – is this data normally distributed? If not, what is ”wrong” with it?
```{r}
ggplot(df, aes(x = choose_rand_num)) +
geom_histogram(aes(y = ..density..), binwidth = 0.25) +
ggtitle("Random number data") +
stat_function(fun = dnorm, args = list(mean = mean(df$choose_rand_num, na.rm = TRUE), sd = sd(df$choose_rand_num, na.rm = TRUE)), colour= "darkgreen", size = 1)+
theme_classic()
```
4. Are shoe size data normally distributed? Provide visual (histogram with normal curve; qqplot) and numerical evidence (the statistic of a Shapiro-Wilk test) for your claim.
```{r}
ggplot(df, aes(x = shoesize)) +
geom_histogram(aes(y = ..density..), binwidth = 0.25) +
ggtitle("shoesize data") +
stat_function(fun = dnorm, args = list(mean = mean(df$shoesize, na.rm = TRUE), sd = sd(df$shoesize, na.rm = TRUE)), colour= "darkgreen", size = 1)+
theme_classic()
qqnorm(df$shoesize)
round(pastecs::stat.desc(df$shoesize, basic = FALSE, norm = TRUE), digits = 2)
```
---
### Part 4: Data transformation
From Fabio's slides:
- Positive skew or Unequal variances -> log transformation: log(Xi)
Limmits: Can’t deal with negative numbers
R syntax: log()
- Positive skew or Unequal variances -> square root transformation: √Xi
Limits: Bigger effect than log(), still can’t deal with negative numbers
R syntax: sqrt()
- Positive skew - Unequal variances ->reciprocal transformation: 1/Xi
Limits: Good for negative numbers, but reverses scores
R syntax: 1/...
- Negative skew -> Reverse score transformations
Reverse the data (Xi-max(X)) before running any of the above transformations
mutate() from tidyverse is really good for transforming data:
```{r}
mdf <- df %>% mutate(hours_log = log(hours_music_per_week))
ggplot(mdf, aes(x = hours_log)) +
geom_histogram(aes(y = ..density..), binwidth = 0.25, color= 'black', fill = 'white') +
ggtitle("Log trasnformed") +
stat_function(fun = dnorm, args = list(mean = mean(mdf$hours_log, na.rm = TRUE), sd = sd(mdf$hours_log, na.rm = TRUE)), colour= "darkgreen", size = 1)+
theme_classic()
```
#### Part 4 Exercises:
1. Make a new dataset and add 3 new columns with mutate: log(hours_music_per_week), sqrt(hours_music_per_week) and 1/hours_music_per_week
2. Make histograms and look at the discriptive statistics of these new variables – which transformation should we choose?
```{r}
mdf <- df %>% mutate(hours_log = log(hours_music_per_week), hours_sqrt = sqrt(hours_music_per_week), hours_inv = 1/hours_music_per_week)
ggplot(mdf, aes(x = hours_log)) +
geom_histogram(aes(y = ..density..), binwidth = 0.25) +
ggtitle("Log trasnformed") +
stat_function(fun = dnorm, args = list(mean = mean(mdf$hours_log, na.rm = TRUE), sd = sd(mdf$hours_log, na.rm = TRUE)), colour= "darkgreen", size = 1)+
theme_classic()
ggplot(mdf, aes(x = hours_sqrt)) +
geom_histogram(aes(y = ..density..), binwidth = 0.25) +
ggtitle("Squared") +
stat_function(fun = dnorm, args = list(mean = mean(mdf$hours_sqrt, na.rm = TRUE), sd = sd(mdf$hours_sqrt, na.rm = TRUE)), colour= "darkgreen", size = 1)+
theme_classic()
ggplot(mdf, aes(x = hours_inv)) +
geom_histogram(aes(y = ..density..), binwidth = 0.25) +
ggtitle("Invert") +
stat_function(fun = dnorm, args = list(mean = mean(mdf$hours_inv, na.rm = TRUE), sd = sd(mdf$hours_inv, na.rm = TRUE)), colour= "darkgreen", size = 1)+
theme_classic()
```
---
### Part 5: Good-to-know ggplot stuff
Useful geom: geom_point() to make scatterplot and see relationship between two continuous measures.
```{r}
ggplot(df, aes(x = breath_hold, y = tongue_twist)) +
geom_point()
```
Aesthetics can be specified in different places, not just in the base layer:
```{r}
ggplot(df) + #no aesthetics in the base
geom_point(aes( x = choose_rand_num, y = sound_level_pref)) #aesthetics specified in geom
```
Memorizing a ggplot and then adding geoms to the memorized variable can be easier (Example from Part 1)
```{r}
#I will use the ggplot base, theme and labels several times, so I wrote it down as a variable 'rplot'
rplot <- ggplot(df, aes(x = df$gender, y = df$ratio2d4d, colour = df$gender)) +
theme_minimal() +
labs(x = "Gender", y = "Ratio 2D:4D")
#Bar plot
rplot +
geom_bar(stat = 'summary', fun.y = mean, width = 0.25, fill = 'white') +
ggtitle("Bar plot")
#Box plot
rplot +
geom_boxplot(width = 0.5) +
ggtitle("Box Plot")
#Violin plot
rplot +
geom_violin() +
ggtitle("Violin Plot")
```
Faceting
We can present the same plot in different groups by faceting it.
To use faceting in a plot, we can use facet_wrap( ~ y, nrow = integer, ncol = integer)
For instance, we can plot how mean shoesize differs depending on gender in every of three handedness groups by writing the following code:
```{r}
ggplot(df, aes(x = df$gender, y = df$shoesize, fill = gender)) +
facet_wrap(~handedness) + #faceting by handedness
geom_bar(stat = 'summary', fun.y = mean, width = 0.25)+
theme_minimal() +
labs(x = "Gender", y = "Shoesize") +
ggtitle("Faceting")
```
Saving ggplot (into your working directory)
```{r}
plot1 <- ggplot(df, aes(x=shoesize)) + geom_bar() #write a plot down
ggsave("myggplot.png") # saves the last plot.
ggsave("mystoredggplot.png", plot=plot1) # save a stored ggplot
```
#### Part 5 Exercises
1. Make a scatterplot showing balloon balance time against tongue twister time, write that plot down as a variable
2. Check if there are national differences in balloon balance time in relation to tongues twister time. Do so by using faceting by native language on the variable you made in Question 1
3. Save the faceted ggplot into your working directory as a png file.
#### Optional R Markdown stuff
In the very top of your document you wrote your name to the 'author'. Try to add a hyperlink to your GitHub to your name there in the similar format:
author: "[Anita Kurm](https://kbroman.org)"
Knit your rmarkdown and see if the link works :)
### Homework:
Finish your Portfolio and submit it on bb!