-
Notifications
You must be signed in to change notification settings - Fork 31
/
Esteban_Aramayo_Tidyverse_Across.Rmd
160 lines (110 loc) · 5.17 KB
/
Esteban_Aramayo_Tidyverse_Across.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: "Tidyverse Create project – Using across() function"
author: "Esteban Aramayo"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
#number_sections: true
toc_float: true
collapsed: false
theme: united
highlight: tango
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
### Project summary
In this project I will show how to compute summary statistics across multiple columns in R using the **dplyr's across()** function. In **dplyr's release 1.0.4 (2021-02-02)** the **across() function** makes **summarise(across())** and **mutate(across())** perform much faster than ever.
### Why use the across() function
In certain situations one needs to calculate summary statistics like mean/median/std or some other function on multiple columns. Usually we calculate summary statistics by manually doing it for each variable (column) one by one. In a dataset with many columns, it can be very tedious to calculate the summary statistics on such columns, especially if we want to combine the calculated statistics with the **summarise()** and **mutate()** functions.
The dplyr's across() function allows us to apply the same statistical function to multiple variables at the same time and it allows us to combine them with the **summarise()** and **mutate()** functions.
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
if (!require("DT")) install.packages('DT')
library(DT)
```
### Load & view the dataset
For this project we will use the "Nifty 500 fundamental statistics" dataset, which includes Top 500 companies listed on the National Stock Exchange (NSE) in India.
The original dataset can be downloaded from:
https://www.kaggle.com/dhimananubhav/nifty-500-fundamental-statistics?select=nifty_500_stats.csv
```{r}
url <- 'https://raw.githubusercontent.com/esteban-data-enthusiast/DATA607/main/datasets/nifty_500_stats.csv'
my_col_types <- cols(
index = col_integer(),
company = col_character(),
industry = col_character(),
symbol = col_character(),
market_cap = col_double(),
current_value = col_double(),
high_52week = col_double(),
low_52week = col_double(),
book_value = col_double(),
price_earnings = col_double(),
dividend_yield = col_double(),
roce = col_double(),
roe = col_double(),
sales_growth_3yr = col_double()
)
ds <- read_delim(file = url, delim = ";",
col_names = TRUE, col_types = my_col_types)
```
Let's take a peak at our dataset
```{r echo=FALSE}
DT::datatable(ds,
options = list(pageLength = 5, autoWidth = TRUE),
width = 1000, height = 650, rownames = FALSE)
```
### Let's use the across() function
Let's calculate the mean per industry for all numeric fields at once.
```{r}
# Use the .names argument to control the output names
num_vars_means <- ds %>%
select(industry:sales_growth_3yr ) %>%
group_by(industry) %>%
summarise(across(where(is.numeric), mean, .names = "mean_{col}"))
```
Let's show the calculated means per numeric field per industry.
```{r echo=FALSE}
DT::datatable(num_vars_means,
options = list(pageLength = 5, autoWidth = TRUE),
width = 1000, height = 650, rownames = FALSE)
```
### Conclusions
The dplyr's across() function is a very powerful one, which allows to calculate statistical functions on multiple columns at once. As a bonus, it also allows us to automatically name the calculated fields without having to do it manually.
Also, instead of storing the calculated fields in a separate dataframe, we could also use the mutate() function to append those new fields to the original dataframe as new columns.
## Extending the Vignette
*Author: Claire Meyer*
From here, we can extend the vignette to leverage ggplot's functionality and compare these means. First, let's use a basic scatterplot to compare mean ROE and Mean Market Cap, and then compare to the total data:
```{r scatter}
num_vars_means %>%
ggplot(aes(x=mean_market_cap,y=mean_dividend_yield)) +
geom_point(aes(color=factor(industry)))
ds %>%
ggplot(aes(x=market_cap,y=dividend_yield)) +
geom_point(aes(color=factor(industry)))
```
We can also play with a geom_label or geom_text plot, to more easily see which industry mean falls where, though this has it's challenges (label sizing, namely):
```{r label}
num_vars_means %>%
ggplot(aes(x=mean_market_cap,y=mean_dividend_yield)) +
geom_label(aes(color=factor(industry),label=industry))
num_vars_means %>%
ggplot(aes(x=mean_market_cap,y=mean_dividend_yield)) +
geom_text(aes(color=factor(industry),label=industry))
```
We can also look at a histogram of market caps in all the original data, again with a view by industry, though at this number of industries it's a bit hard to discern:
```{r all-data}
ds %>%
mutate(market_cap_k = market_cap/1000) %>%
ggplot(aes(x=market_cap_k,fill=factor(industry))) +
geom_histogram(binwidth=100)
```
Finally, if we want to more visually size each industry, we can do so with geom_count().
```{r counts}
ds %>%
ggplot(aes(x=category,y=industry)) + geom_count()
```