-
Notifications
You must be signed in to change notification settings - Fork 4
/
04-data-manipulation.Rmd
117 lines (81 loc) · 4.18 KB
/
04-data-manipulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
title: "Data manipulation"
output:
html_document:
code_download: true
toc: true
toc_float: false
highlight: tango
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
library(tidyverse)
penguins <- read_csv("data/penguins.csv")
```
## Creating summary tables
In spreadsheet, pivot tables or dynamic tables are powerful tools for summarizing data in different ways. We can create these tables using the `group_by` and `summarize` functions from the tidyverse package.
Let's take a look at our report, right after we read the data, we create a summary table with the number of cases and the average body mass for each specie in our dataset.
We can use `group_by()` to group our data by **species** and `summarize()` to calculate the number of cases and the average **body_mass_g** for each grouping.
```{r}
penguins %>%
group_by(species) %>%
summarise(count = n(),
mean_body_mass = mean(body_mass_g, na.rm = TRUE))
```
The function `n()` returns the number of cases and the function `mean` computes the average. The parameter `na.rm = TRUE` tells the function to remove non available values (`NA`) before performing the calculation.
> Question: What if you don’t use `group_by()` before `summarise()`? Let’s try it and discuss what’s going on.
```{r}
penguins %>%
summarise(count = n(),
mean_body_mass = mean(body_mass_g, na.rm = TRUE))
```
So if we don't group_by first, we will get a single summary statistic (n and mean in this case) for the whole dataset.
Now, we want the average of the body mass in kilograms and with two decimals. We can change our `summarise()` call to get that calculation.
```{r}
penguins %>%
group_by(species) %>%
summarise(count = n(),
mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2))
```
First we divide body mass by 1000 to transform from grams to kilograms and then we use the function `round()` to get only two decimals.
Now we are at the point where we actually want to save this summary information as a variable so we can use it in further analyses and formatting.
So let's add a variable assignment to that first line:
```{r}
summary <- penguins %>%
group_by(species) %>%
summarise(count = n(),
mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2))
```
## Formatig the table
To make our table look better in a report, we have to change some of its elements. For example, the name of the variables should be informative for a person (at the moment they are useful names for a computer), and we can add a title to it and change the alignment of the numbers.
We can do that with the `kable` function.
```{r}
summary %>%
knitr::kable(col.names = c("Specie", "Count", "Mean Body Mass"),
align = "lcc",
caption = "Summary of penguins on the data set by specie.")
```
We can do a lot more with other functions (or "verbs") of the tidyverse. Let's see a couple more
## Creating new variables
There are a lot of times in which we want to operate beyond the original data. This is often done by adding a column. We do this with the `mutate()` function.
We can add a new variable (called `prop`) representing the proportion of penguins in each **species** with by using `mutate()`.
```{r}
penguins %>%
group_by(species) %>%
summarise(count = n(),
mean_body_mass = round(mean(body_mass_g/1000, na.rm = TRUE), 2)) %>%
mutate(prop = count/nrow(penguins))
```
## Filtering data
In our report we also make plot by species. To do this we need to keep only the data of the species we want to analyse. We use the `filter()` function to filter data frames, this function returns a new data frame with only the rows that satisfy some conditions.
We can use logical operators (`>`, `<`, `>=`, `<=`, `==`) to create conditions from numeric variables These operators are also useful for dates and text strings. Look at the code to see a condition for text type of data.
```{r}
penguins %>%
filter(species == "Gentoo")
```
> Activity: Write code to filter all the penguins in the Torgersen island.
> Activity: Now write the code to filter all the penguins with flipper lenght greater than 200