Class5_notes.Rmd

---
title: "Class5_notes"
author: "Anita"
date: "10/9/2019"
output: html_document
---

## Welcome to Class 5!

Today we will go beyond descriptive statistics in R and look at *correlations*!


We will need libraries tidyverse and pastecs
We will also need the small_subset.csv as an example, I recommend calling it 'df' to be consistent with my code. It's just a subset from the personality test data.

```{r load/install packages/load data}
#libraries


#import data
df <- 
```


### Part 1: Calculating Covariance and Correlation

We will try to understand/repeat equations of covariance and correlation from the example of relation of shoesize to breath hold data from the personality test.

#### Understanding Covariance equation

            sum of all Cross-product deviations       
cov(x,y) = ------------------------------------- 
                  degrees of freedom                                    


Cross-product deviation = deviation_of_x_value * deviation_of_y_value


For the equation, we need:
  deviation of x value from the mean and deviation of y value from the mean (for every row in our data)
  degrees of freedom = Number of all observations - 1
  
Shoesize will be the x variable
Breath hold will be the y variable

```{r}

#Here we use mutate() to make new columns using values from existing columns
df <- df %>%
  mutate(shoesize_dev =      ,      # value of shoesize - mean(of all values of shoesize in our data)
         breath_dev =        ,      # value of breath hold - mean(of all values of breath hold in our data)
         crossProdDev =      )      # Multiply deviation of shoesize by deviation of breath hold to get cross-product deviations


#Calculate number of rows in our data and subtract 1 to get degrees of freedom
degrees =                           #number of rows can be calculated using nrow(df)

  
#Now we have all values we need to calculate covariance
covariance =                        #sum all cross-products of deviations and divide this sum by degrees of freedom
  

#see the result:  
covariance

#LUCKILY, R HAS A FUNCTION FOR IT: cov(x,y)
#try it and comapre results with manually calculated covariance

```


#### Understanding Correlation equation

                       covariance                             covariance 
correlation(x,y) = --------------------- =  -------------------------------------------------
                  Standardisation term      standard deviation of x * standard deviation of y

For the equation, we need:
  value of covariance - we already calculated that
  standard deviations of both variables - we can use sd() function for that
  
```{r}

#Standardize covariance by dividing it by the product of standard deviations of both variables
correlation = 

#LUCKILY, THERE IS A FUNCTION FOR IT:   cor.test(x, y, method = 'pearson')
#try it and compare results with manually calculated correlation


#Now try to store the output of cor.test(x, y, method = 'pearson') in a variable called output
output = 

#Now try to access the estimate from the stored output by writing output$estimate, store this value in a variable called r_output
r_output =


#see if there is difference between correlation coefficients calculated manually and estimate of the cor.test() function


```


#### Testing for Pearson's Correlation assumptions 

The most important assumption to check for is normality of data. You should always check normality of both variables.
The quickest way might be to use stat.desc() function from pastecs.

```{r}
#test
round(pastecs::stat.desc(cbind(df$shoesize, df$breath_hold), basic = FALSE, norm = TRUE), digits = 2)

#is it normally distributed?

```


It is not really normally distributed, so what do we do?

1) We can try to transform our data to make it more normal
  e.g.log transform definitely helps breathhold data, shoesize seems to be trickier though. We won't do it for this data now.

or!

2) Another way around the problem with non-normally distributed data is to use other correlation coefficients, like Spearman's rho or Kendall's tau. 

```{r}
#Running Spearman correlation test: cor.test(x,y, method = 'spearman')
output_spearman <- 
r_spearman <- output_spearman$estimate      #writing down the estimate

#seeing output and result
output_spearman
r_spearman


#Running Kendall correlation test: cor.test(x,y, method = 'kendall')
output_kendall <- 
tau <- output_kendall$estimate               #writing down the estimate

#seeing output and result
output_kendall
tau


#how similar are estimates for correlation using spearman's rho and kendall's tau?

```

--- 

### Part 2: Working with Reading Experiment Data

We've got some interesting data to work with!

#### Prepare Reading Experiment Data
Load your reading experiment logfile (it should be in the same folder as this Rmd file, which is your working directory).

```{r}

rdf <- 
```


We have one continuous variable in our logfile - reading time. What do you think could have affected it? Normally, the length of the word relates to the time needed to read it.


Here is an example of how to calculate word length for all words in the dataframe: 
```{r}
#create a random dataframe as an example 
example <- data.frame(words = c("This", "is", "not", "a", "real", "dataframe"),
             rt = rnorm(n = 6, mean = 2, sd = 0.1)) #sample 6 random values from a normal distribution with the mean of 2 and sd of 0.1


#words need to be characters in order to calculate their length!
example$words <- as.character(example$words)

#count characters in the column 'words' using function nchar() and put it into a new column
example$wordlength <- nchar(example$words)

#see the new 'example' dataframe with word length values 
example

```


Given the example above, calculate length of words in your logfile:
```{r}

```

#### Analysis of reading data 
1. Assumptions: are your data normally distributed?
– use stat.desc() on RT and word length

2. You can try transformations:
– Use mutate to create log(RT), sqrt(RT) and 1/RT
– Go through the assumptions check again: did transformation fix your data or do you need to use a correlation test for non-normally distributed data?

3. Correlational test:
– Perform a correlational test on your data using cor.test() - Can you use Pearson's test or do you need to you Spearman or Kendall?


Steps 4 and 5 continue in Part 3


--- 

### Part 3: Scatter plot and reporting results

#### Visualization

4. Make a scatterplot of the reaction times and word lengths and add a regression line, using the following code

ggplot(dataframe,aes(x, y))+
  geom_point( )+
  geom_smooth(method="lm")   # lm stands for linear model, so geom_smooth will draw a straight regression line
  

```{r}

```


#### Reporting the results

5.Report the results in APA format:
  r(degrees of freedom) = correlation coefficient estimate, p = p-value
  Degrees of freedom are (N - 2) for correlations
  You can also report shared variance: R2 

Example:  “Reading time (RT) was found to negatively correlate with word length, r(60) = - 0.71, p = .02, R2  = 0.50”