-
Notifications
You must be signed in to change notification settings - Fork 3
/
Class5_notes.Rmd
241 lines (136 loc) · 6.92 KB
/
Class5_notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "Class5_notes"
author: "Anita"
date: "10/9/2019"
output: html_document
---
## Welcome to Class 5!
Today we will go beyond descriptive statistics in R and look at *correlations*!
We will need libraries tidyverse and pastecs
We will also need the small_subset.csv as an example, I recommend calling it 'df' to be consistent with my code. It's just a subset from the personality test data.
```{r load/install packages/load data}
#libraries
#import data
df <-
```
### Part 1: Calculating Covariance and Correlation
We will try to understand/repeat equations of covariance and correlation from the example of relation of shoesize to breath hold data from the personality test.
#### Understanding Covariance equation
sum of all Cross-product deviations
cov(x,y) = -------------------------------------
degrees of freedom
Cross-product deviation = deviation_of_x_value * deviation_of_y_value
For the equation, we need:
deviation of x value from the mean and deviation of y value from the mean (for every row in our data)
degrees of freedom = Number of all observations - 1
Shoesize will be the x variable
Breath hold will be the y variable
```{r}
#Here we use mutate() to make new columns using values from existing columns
df <- df %>%
mutate(shoesize_dev = , # value of shoesize - mean(of all values of shoesize in our data)
breath_dev = , # value of breath hold - mean(of all values of breath hold in our data)
crossProdDev = ) # Multiply deviation of shoesize by deviation of breath hold to get cross-product deviations
#Calculate number of rows in our data and subtract 1 to get degrees of freedom
degrees = #number of rows can be calculated using nrow(df)
#Now we have all values we need to calculate covariance
covariance = #sum all cross-products of deviations and divide this sum by degrees of freedom
#see the result:
covariance
#LUCKILY, R HAS A FUNCTION FOR IT: cov(x,y)
#try it and comapre results with manually calculated covariance
```
#### Understanding Correlation equation
covariance covariance
correlation(x,y) = --------------------- = -------------------------------------------------
Standardisation term standard deviation of x * standard deviation of y
For the equation, we need:
value of covariance - we already calculated that
standard deviations of both variables - we can use sd() function for that
```{r}
#Standardize covariance by dividing it by the product of standard deviations of both variables
correlation =
#LUCKILY, THERE IS A FUNCTION FOR IT: cor.test(x, y, method = 'pearson')
#try it and compare results with manually calculated correlation
#Now try to store the output of cor.test(x, y, method = 'pearson') in a variable called output
output =
#Now try to access the estimate from the stored output by writing output$estimate, store this value in a variable called r_output
r_output =
#see if there is difference between correlation coefficients calculated manually and estimate of the cor.test() function
```
#### Testing for Pearson's Correlation assumptions
The most important assumption to check for is normality of data. You should always check normality of both variables.
The quickest way might be to use stat.desc() function from pastecs.
```{r}
#test
round(pastecs::stat.desc(cbind(df$shoesize, df$breath_hold), basic = FALSE, norm = TRUE), digits = 2)
#is it normally distributed?
```
It is not really normally distributed, so what do we do?
1) We can try to transform our data to make it more normal
e.g.log transform definitely helps breathhold data, shoesize seems to be trickier though. We won't do it for this data now.
or!
2) Another way around the problem with non-normally distributed data is to use other correlation coefficients, like Spearman's rho or Kendall's tau.
```{r}
#Running Spearman correlation test: cor.test(x,y, method = 'spearman')
output_spearman <-
r_spearman <- output_spearman$estimate #writing down the estimate
#seeing output and result
output_spearman
r_spearman
#Running Kendall correlation test: cor.test(x,y, method = 'kendall')
output_kendall <-
tau <- output_kendall$estimate #writing down the estimate
#seeing output and result
output_kendall
tau
#how similar are estimates for correlation using spearman's rho and kendall's tau?
```
---
### Part 2: Working with Reading Experiment Data
We've got some interesting data to work with!
#### Prepare Reading Experiment Data
Load your reading experiment logfile (it should be in the same folder as this Rmd file, which is your working directory).
```{r}
rdf <-
```
We have one continuous variable in our logfile - reading time. What do you think could have affected it? Normally, the length of the word relates to the time needed to read it.
Here is an example of how to calculate word length for all words in the dataframe:
```{r}
#create a random dataframe as an example
example <- data.frame(words = c("This", "is", "not", "a", "real", "dataframe"),
rt = rnorm(n = 6, mean = 2, sd = 0.1)) #sample 6 random values from a normal distribution with the mean of 2 and sd of 0.1
#words need to be characters in order to calculate their length!
example$words <- as.character(example$words)
#count characters in the column 'words' using function nchar() and put it into a new column
example$wordlength <- nchar(example$words)
#see the new 'example' dataframe with word length values
example
```
Given the example above, calculate length of words in your logfile:
```{r}
```
#### Analysis of reading data
1. Assumptions: are your data normally distributed?
– use stat.desc() on RT and word length
2. You can try transformations:
– Use mutate to create log(RT), sqrt(RT) and 1/RT
– Go through the assumptions check again: did transformation fix your data or do you need to use a correlation test for non-normally distributed data?
3. Correlational test:
– Perform a correlational test on your data using cor.test() - Can you use Pearson's test or do you need to you Spearman or Kendall?
Steps 4 and 5 continue in Part 3
---
### Part 3: Scatter plot and reporting results
#### Visualization
4. Make a scatterplot of the reaction times and word lengths and add a regression line, using the following code
ggplot(dataframe,aes(x, y))+
geom_point( )+
geom_smooth(method="lm") # lm stands for linear model, so geom_smooth will draw a straight regression line
```{r}
```
#### Reporting the results
5.Report the results in APA format:
r(degrees of freedom) = correlation coefficient estimate, p = p-value
Degrees of freedom are (N - 2) for correlations
You can also report shared variance: R2
Example: “Reading time (RT) was found to negatively correlate with word length, r(60) = - 0.71, p = .02, R2 = 0.50”