04-basic_statistics.Rmd


## Summarizing data

### Summary statistics

```{r echo=FALSE, eval=TRUE, message=FALSE, warning=FALSE}
library(knitr)
options(scipen = 999)
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE, rownames.print = FALSE, rows.print = 10)
opts_chunk$set(cache=T)
```

This section discusses how to produce and analyse basic summary statistics. We make a distinction between categorical and continuous variables, for which different statistics are permissible.

[You can download the corresponding R-Code here](./Code/04-basic_statistics (1).R)


<br>
<br>

OK to compute....	 | Nominal	 | Ordinal	 | Interval	 | Ratio
------------- | ------------- | ------------- | --- | ---
frequency distribution  | Yes  | Yes  | Yes  | Yes
median and percentiles  | No  | Yes  | Yes  | Yes
mean, standard deviation, standard error of the mean | No  | No  | Yes  | Yes
ratio, or coefficient of variation  | No  | No  | No  | Yes

As an example data set, we will be using the Spotify music data. Let's load and inspect the data first. Students of the course can get the data via Learn\@WU. If you are not enrolled in the course please contact [Daniel Winkler](https://www.wu.ac.at/en/imsm/about-us/team/daniel-winkler) (https://www.wu.ac.at/en/imsm/about-us/team/daniel-winkler).

```{r, include=FALSE}
library(openssl)
passphrase <- charToRaw("MRDAnils")
key <- sha256(passphrase)
url <- "https://raw.githubusercontent.com/IMSMWU/mrda_data_pub/master/secret-music_data.rds"
download.file(url, "./data/secret_music_data.rds", method = "auto", quiet=FALSE)
encrypted_music_data <- readRDS("./data/secret_music_data.rds")
music_data <- unserialize(aes_cbc_decrypt(encrypted_music_data, key = key))
```


```{r, eval=FALSE}
readRDS("music_data.rds")
```
### Categorical variables

Categorical variables contain a finite number of categories or distinct groups and are also known as qualitative variables. There are different types of categorical variables:

* **Nominal variables**: variables that have two or more categories but no logical order (e.g., music genres). A dichotomous variables is simply a nominal variable that only has two categories (e.g., gender).
* **Ordinal variables**: variables that have two or more categories that can also be ordered or ranked (e.g., income groups).

For this example, we are interested in the following two variables

* "genre": the music genre the song is associated with, subsetted for the most frequent genres.  
* "explicit": whether the lyrics of the tracks are explicit or not (0 = not explicit, 1 = explicit)

[You can find a full description of the variables here:](https://developer.spotify.com/documentation/web-api/reference/personalization/get-users-top-artists-and-tracks/) 

In a first step, we convert the variables to factor variables using the ```factor()``` function to assign appropriate labels according to the scale points:

```{r}
s.genre <- c("pop","hip hop","rock","rap","indie")
music_data <- subset(music_data, top.genre %in% s.genre)

music_data$genre_cat <- as.factor(music_data$top.genre)
music_data$explicit_cat <- factor(music_data$explicit, levels = c(0:1), 
    labels = c("not explicit", "explicit"))
```

The ```table()``` function creates a frequency table. Let's start with the number of occurrences of the categories associated with the genre and explicitness variables separately:

```{r}
table(music_data[,c("genre_cat")]) #absolute frequencies
table(music_data[,c("explicit_cat")]) #absolute frequencies
```

It is obvious that there are more tracks with non-explicit lyrics than songs with explicit lyrics. For variables with more categories, it might be less obvious and we might use the ```summary()``` function, which produces further statistics. 

```{r}
summary(music_data$explicit)
```

Often, we are interested in the relative frequencies, which can be obtained by using the ```prop.table()``` function.

```{r}
prop.table(table(music_data[,c("genre_cat")])) #relative frequencies
prop.table(table(music_data[,c("explicit_cat")])) #relative frequencies
```

Now let's investigate if the genre differs by expliciteness. To do this, we simply apply the ```table()``` function to both variables:

```{r}
table(music_data[,c("genre_cat", "explicit_cat")]) #absolute frequencies
```

Again, it might be more meaningful to look at the relative frequencies using ```prop.table()```:

```{r}
prop.table(table(music_data[,c("genre_cat", "explicit_cat")])) #relative frequencies
```

Note that the above output shows the overall relative frequencies when explicit and non-explicit songs are considered together. In this context, it might be even more meaningful to look at the conditional relative frequencies. This can be achieved by adding a ```,2``` to the ```prop.table()``` command, which tells R to compute the relative frequencies by the columns (which is in our case the explicitness variable): 

```{r}
prop.table(table(music_data[,c("genre_cat", "explicit_cat")]),2) #conditional relative frequencies
```

### Continuous variables

#### Descriptive statistics

Continuous variables are numeric variables that can take on any value on a measurement scale (i.e., there is an infinite number of values between any two values). There are different types of continuous variables:

* **Interval variables**: while the zero point is arbitrary, equal intervals on the scale represent equal differences in the property being measured. E.g., on a temperature scale measured in Celsius the difference between a temperature of 15 degrees and 25 degrees is the same difference as between 25 degrees and 35 degrees but the zero point is arbitrary. 
* **Ratio variables**: has all the properties of an interval variable, but also has an absolute zero point. When the variable equals 0.0, it means that there is none of that variable (e.g., number of products sold, willingness-to-pay, mileage a car gets). 

Computing descriptive statistics in R is easy and there are many functions from different packages that let you calculate summary statistics (including the ```summary()``` function from the ```base``` package). In this tutorial, we will use the ```describe()``` function from the ```psych``` package:

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
library(psych)
psych::describe(music_data[,c("trackPopularity", "adv_spending")])
```

In the above command, we used the ```psych::``` prefix to avoid confusion and to make sure that R uses the ```describe()``` function from the ```psych``` package since there are many other packages that also contain a ```desribe()``` function. Note that you could also compute these statistics separately by using the respective functions (e.g., ```mean()```, ```sd()```, ```median()```, ```min()```, ```max()```, etc.). 

The ```psych``` package also contains the ```describeBy()``` function, which lets you compute the summary statistics by sub-group separately. For example, we could easily compute the summary statistics by expliciteness as follows: 

```{r message=FALSE, warning=FALSE}
describeBy(music_data[,c("trackPopularity", "adv_spending")], music_data$explicit_cat)
```

Note that you could just as well use other packages to compute the descriptive statistics. For example, you could have used the ```stat.desc()``` function from the ```pastecs``` package:

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
library(pastecs)
stat.desc(music_data[,c("trackPopularity", "adv_spending")])
```

Computing statistics by group is also possible by using the wrapper function ```by()```. Within the function, you first specify the data on which you would like to perform the grouping ```music_data[,c("trackPopularity", "adv_spending")]```, followed by the grouping variable ```music_data$explicit_cat``` and the function that you would like to execute (e.g., ```stat.desc()```):

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
library(pastecs)
by(music_data[,c("trackPopularity", "adv_spending")],music_data$explicit_cat,stat.desc)
```

These examples are meant to exemplify that there are often many different ways to reach a goal in R. Which one you choose depends on what type of information you seek (the results provide slightly different information) and on personal preferences.

#### Creating subsets

From the above statistics it is clear that the data set contains some severe outliers on some variables. For example, the maximum amount of spending on advertisment is `r round(max(na.omit(music_data$adv_spending)),1)` units. You might want to investigate these cases and delete them if they would turn out to indeed induce a bias in your analyses. For normally distributed data, any absolute standardized deviations larger than 3 standard deviations from the mean are suspicious. Let's check if potential outliers exist in the data:

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
library(dplyr)
music_data %>% mutate(adv_spending_std = as.vector(scale(adv_spending))) %>% filter(abs(adv_spending_std) > 3) %>% select(id, trackName, adv_spending, adv_spending_std)
```

Indeed, there appear to be two potential outliers, which we may wish to exclude before we start fitting models to the data. You could easily create a subset of the original data, which you would then use for estimation using the ```filter()``` function from the ```dplyr()``` package. For example, the following code creates a subset that excludes all cases with a standardized duration of more than 3: 

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
library(dplyr)
estimation_sample <- music_data %>% mutate(adv_spending_std = as.vector(scale(adv_spending))) %>% filter(abs(adv_spending_std) < 3)
psych::describe(estimation_sample[,c("trackPopularity", "adv_spending")])
```