-
Notifications
You must be signed in to change notification settings - Fork 31
/
hummell_dplyr.Rmd
262 lines (207 loc) · 13.6 KB
/
hummell_dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
---
title: "Tidyverse Vignette"
author: "Joshua Hummell, Gabriella Martinez"
date: "4/8/2021, 4/23/2021"
output:
html_document:
theme: cerulean
highlight: kate
code_folding: "show"
---
## Dplyr {.tabset .tabset-pills}
### TidyVerse Vignette CREATE
#### {.tabset}
##### Assignment Overview
Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
Later, you'll be asked to extend an existing vignette. Using one of your classmate’s examples (as created above), you'll then extend his or her example with additional annotated code. (15 points)
You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.
After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.
You should complete your submission on the schedule stated in the course syllabus.
##### Packages and Data
```{r message=FALSE}
library(dplyr)
library(reactable)
```
```{r}
murders <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/murder_2016/murder_2015_final.csv')
```
##### select()
**Hands down my favorite R package in Tidyverse is Dplyr**
Dplyr allows for easy data manipulation and, therefore, is highly useful for everyday work!
Select data columns with ease
```{r}
murders %>% select(state)
```
##### filter()
easily filter data
```{r}
murders %>%
filter(city == 'Baltimore')
```
##### select(), group_by(), summarise(), arrange()
Easily Aggregate Date
```{r}
state <- murders %>%
select(state, change) %>%
group_by(state) %>%
summarise(state_totals = sum(change)) %>%
arrange(desc(state_totals))
state
```
##### left_join()
and even join data
```{r}
states_pop <- read.csv('https://raw.githubusercontent.com/jhumms/DATA607/main/state_populations.csv')
colnames(states_pop) <- tolower(colnames(states_pop))
murders_state <- left_join(state, states_pop, by='state')
murders_state
```
##### mutate()
and, if that weren't enough, you can even make aggregations across columns very easily!
```{r}
murders_state$population <- as.numeric(murders_state$population)
murders_state %>%
mutate(murder_rate_by_pop = (state_totals / population) *100) %>%
arrange(desc(murder_rate_by_pop))
```
### TidyVerse Vignette EXTEND
#### {.tabset}
##### Assignment Overview
Dplyr has many functions useful for data transformation. The possibilities are endless! This extension vignette covers examples using the the functions noted in the tabs above. For more handy dplyr functions, check out this [cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf)!
Joshua's vignette and his dataset caught my attention. It looks like the ```murders``` dataset contains 83 observations with 5 variables: City, State, 2014 murder count (X2014_murders), 2015 murder count (X2015_murders), and the difference between the two years (change). He also added ```states_pop``` which has 51 observations and 4 variables.
Note, the reactable^[https://glin.github.io/reactable/] package has been also used for the extended vignette for the purpose of narrowing the amount of data displayed on the page upon rendering to HTML. The usage of the reactable function will not be explained, please see the reference below for further detail.
**Instructions**
Your task here is to Extend an Existing Example. Using one of your classmate’s examples (as created above), extend his or her example with additional annotated code. (15 points)
You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.
After you’ve extended your classmate's vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.
You should complete your submission on the schedule stated in the course syllabus.
##### glimpse(), slice(), distinct()
Using dplyr's ```glimpse()```^[https://www.rdocumentation.org/packages/dplyr/versions/0.3/topics/glimpse] function, we can take a look at the data as well as their types. There are 83 rows and 5 columns.
```{r}
glimpse(murders)
```
The dplyr ```slice()```^[https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/slice] function is useful if you want only a 'slice' of the data where you specify an index row range. For example, below we have selected row indexes from 2 to 7.
```{r}
murders %>%
slice(2:7)
```
Using dplyr's ```distinct()``` ^[https://dplyr.tidyverse.org/reference/distinct_all.html?q=distinct] function, we can see the list of distinct states in the ```murder``` dataset. Looks like there are 34 distinct cases, 33 states as well as the nation's capital noted as D.C..
```{r}
murders %>%
distinct(state) %>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE, showPageSizeOptions = TRUE,
showPagination = TRUE, pageSizeOptions = c(5, 10, 20),
defaultPageSize = 5)
```
##### top_n()
Next, suppose you wanted the top 10 cities with the most murders in the year 2015. In order to extract the top 10 cities, we would use the ```top_n()```^[https://dplyr.tidyverse.org/reference/top_n.html] function as shown below. Note, if a variable is not specified in the function as we have below with X2015_murders, then the ```top_n()``` function will automatically extract the top n specifed by the last column in the dataset.
```{r}
murders %>%
top_n(10, X2015_murders) %>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE, showPageSizeOptions = TRUE,
showPagination = TRUE, pageSizeOptions = c(5, 10, 20),
defaultPageSize = 5)
```
Lets say now, instead of the top ten, you actually want the bottom 5 cities with the least murders from the year 2014. We can still use the ```top_n()``` function, the only difference will be is that we will add a minus (-) sign to the input.
```{r}
murders %>%
top_n(-5, X2014_murders) %>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE,
defaultPageSize = 10)
```
##### mutate(), transmute(), arrange()
Which cities had the greatest percent change in murder counts? The dataset already comes with a change column that give us the murders 2015 - murders 2014 value, however we would like to see this in a percentage. We can perform such calculation and add it as a new column using the ```mutate()```^[https://www.rdocumentation.org/packages/plyr/versions/1.8.6/topics/mutate] function.
Note, the ```round()```^[https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Round] from baseR is used to round our percent_change value to the 2 decimal places by wrapping our percent change function and specifying the two decimal places.
```{r}
murders %>%
mutate(percent_change = round(((X2015_murders-X2014_murders)/X2014_murders)*100,2)) %>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE, showPageSizeOptions = TRUE,
showPagination = TRUE, pageSizeOptions = c(5, 10, 20),
defaultPageSize = 5)
```
Using our ```top_n()``` function as well as leveraging the pipe operator multiple times, we can see which top 5 cities had the highest percent increases, and arrange them in descending order using the ```arrange()```^[https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/arrange] function.
```{r}
murders %>%
mutate(percent_change = round(((X2015_murders-X2014_murders)/X2014_murders)*100,2)) %>%
top_n(5,percent_change) %>%
arrange(desc(percent_change)) %>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE,defaultPageSize = 5)
```
Alternatively, suppose you only wanted the percent_change column and wanted to leave the rest of the columns behind. In this case, the ```transmute()```^[https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/mutate] function comes in handy. Note, ```mutate()``` adds new variables and preserves existing variables whereas the ```transmute()``` function drops existing variables.
```{r}
murders %>%
transmute(percent_change = round(((X2015_murders-X2014_murders)/X2014_murders)*100,2)) %>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE, showPageSizeOptions = TRUE,
showPagination = TRUE, pageSizeOptions = c(5, 10, 20),
defaultPageSize = 5)
```
##### full_join(), inner_join(), anti_join()
Next, we will experiment with the ```join()```^[https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join] function first touched upon by Josh using ```left_join()```. As seen below, ```full_join()``` returns all rows and all columns from both ```murders``` and ```states_pop```. Where there are not matching values, the function returns NA for the ones missing. Note, the ```reactable()``` displays the NA values as empty cells. For example, take a look a the state of Arkansas, this state is not in the ```murders``` table, but is in ```states_pop``` and its values are empty with the exception of the values from the original ```states_pop``` table.
*Note, D.C. in the ```murders``` table needed to be changed to DC to match the format in the ```states_pop``` table.
```{r}
murders[5,2]='DC'
```
```{r}
full_join(murders, states_pop, by = 'state') %>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE, showPageSizeOptions = TRUE,
showPagination = TRUE, pageSizeOptions = c(10, 25,50),
defaultPageSize = 10)
```
The ```inner_join()``` function returns all rows from ```murders``` where there are matching values in ```states_pop```, and all columns from ```murders``` and ```states_pop```. If there are multiple matches between ```murders``` and ```states_pop()```, all combination of the matches are returned.
```{r}
inner_join(murders, states_pop, by = 'state') %>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE, showPageSizeOptions = TRUE,
showPagination = TRUE, pageSizeOptions = c(10, 25,50),
defaultPageSize = 10)
```
Next, we'll check what states do not have murders reported in the ```murders``` table using ```anti_join()``` and the ```states_pop table```. Based on the below, we can see that there are 17 states that do not have murders reported on the ```murders``` table. The ```anti_join()``` function returned all rows from the ```states_pop``` table where there are not matching values in ```murders```, keeping just columns from the ```states_pop``` table.
```{r}
anti_join(states_pop, murders, by = 'state') %>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE, showPageSizeOptions = TRUE,
showPagination = TRUE, pageSizeOptions = c(10, 25,50),
defaultPageSize = 10)
```
##### select(), rename()
Assuming the population data is from 2015, we can see how much of the population was murdered by state. Using the ```select()```^[https://dplyr.tidyverse.org/reference/select.html] function, we can select only the columns we want to retain. ```select()``` can be used with either the column index numbers or the column names. Below is an example using the column index numbers.
```{r}
murders_2015 <- murders %>%
group_by(state) %>%
summarise(sum_2015_murders = sum(X2015_murders)) %>%
arrange(desc(sum_2015_murders))
reactable(murders_2015,bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE, showPageSizeOptions = TRUE,
showPagination = TRUE, pageSizeOptions = c(10, 25,50),
defaultPageSize = 10)
```
Furthermore, the ```rename()``` ^[https://dplyr.tidyverse.org/reference/rename.html] function can be used to rename the columns.
```{r}
inner_join(murders_2015, states_pop, by = 'state') %>%
select(c(1,2,4)) %>%
mutate(murder_percent = (sum_2015_murders/population)*100) %>%
rename("State" = "state", "2015 Murders" = "sum_2015_murders",
"Population" = "population", "Percentage of Population Murdered"= "murder_percent") %>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE, filterable = TRUE, showPageSizeOptions = TRUE,
showPagination = TRUE, pageSizeOptions = c(10, 25,50),
defaultPageSize = 10)
```
Lastly, as a New Jersey native and resident, I was curious to see the data for New Jersey. To do so, I used the ```filter()```^[https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/filter] function to filter out my home state.
```{r}
inner_join(murders_2015, states_pop, by = 'state') %>%
select(c(1,2,4)) %>%
mutate(murder_percent = (sum_2015_murders/population)*100) %>%
rename("State" = "state", "2015 Murders" = "sum_2015_murders",
"Population" = "population", "Percentage of Population Murdered"= "murder_percent") %>%
filter(State == "New Jersey")%>%
reactable(bordered = TRUE, striped = TRUE,
highlight = TRUE)
```