-
Notifications
You must be signed in to change notification settings - Fork 0
/
Workshop03_slides_practice.Rmd
275 lines (222 loc) · 10.5 KB
/
Workshop03_slides_practice.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
title: "Data visualisation (computer session)"
author: "Daniele Rotolo"
date: "Introductory Data Science for Innovation (995N1) -- Weeks 3, 11 October 2021"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Objectives
+ To familiarise with `ggplot` (we will use a dataset on firms' publishing activity and R&D expenditures)
+ To explore a few network layout algorithms
## Working with `ggplot`
We will rely on a sample of data from [Camerani et al. (2018)](https://www.sussex.ac.uk/webteam/gateway/file.php?name=2018-21-swps-camerani-et-al.pdf&site=25). This sample includes 391 firms in the Pharmaceutical and Healthcare sector listed in the [2014 EU Industrial R&D Investment Scoreboard](https://ec.europa.eu/jrc/en/publication/eur-scientific-and-technical-research-reports/2014-eu-industrial-rd-investment-scoreboard). The dataset includes a range of variables:
+ `ID`: a firm's unique identifier
+ `isocountrycode`: a firm's headquarter location (country-level)
+ `rd2011` to `rd2015`: a firm's R&D expenditure from 2011 to 2015
+ `ns2011` to `ns2015`: a firm's net sales from 2011 to 2015
+ `emp2011` to `emp2015`: a firm's employees from 2011 to 2015
+ `pub.2011` to `pubs.2015`: a firm's number of publications from 2011 to 2015
We first load the packages we need to visualise the data and we also load the data (please note the the working directory will be the directory where you save the ".Rmd" file)
```{r echo=TRUE, message=FALSE, warning=FALSE}
rm(list=ls())
library(tidyverse)
library(GGally)
library(gghighlight)
library(patchwork)
my_data <- read_csv("scoreboard_firms_pharma_healthcare.csv")
```
We start with examining the distribution of some variables in the dataset. In the case of `isocountrycode`, an histogram is more appropriate since this variable is categorical, while for all the remaining variables we can plot a density function.
```{r echo=TRUE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
ggplot(data = my_data, aes(isocountrycode)) +
geom_histogram(stat = "count", color = "white",
fill = "blue", binwidth = 1, alpha = 0.4) +
ggtitle("Firms by country") +
xlab("country") + ylab("number of firms")
```
We now explore the remaining variables on R&D expenditure, net sales, employees, and publications in a given year. As an example, we select the year 2015.
```{r echo=TRUE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
ggplot(data = my_data, aes(rd2015)) +
geom_density(color = "white", fill = "blue", alpha = 0.4) +
ggtitle("Density plot of 2015 R&D investment")
```
The distribution is highly skewed. We ca transform the R&D investment variable using the log function.
```{r echo=TRUE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
ggplot(data = my_data, aes(rd2015)) +
geom_density(color = "white", fill = "blue", alpha = 0.4) +
scale_x_log10() +
ggtitle("Density plot of 2015 R&D investment") +
xlab("R&D investment 2015 (log scale)")
```
> **Exercise 1:** Reproduce the density plot for the variable `pubs.2015` (5 minutes).
To exploit all year data, we need to transform our data into a tidy format. As an example, we focus on firms' R&D investment.
```{r echo=TRUE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
my_data_rd <- my_data %>%
select(ID, rd2011, rd2012, rd2013, rd2014, rd2015) %>%
pivot_longer(-ID, names_to = "year", values_to = "rd")
head(my_data_rd)
```
```{r echo=TRUE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
ggplot(data = my_data_rd, aes(rd)) +
geom_density(color = "white", fill = "blue", alpha = 0.4) +
scale_x_log10() +
ggtitle("Density plot of R&D investment (2011-2015)") +
xlab("R&D (log scale)")
```
The tidy structure allows us to explore our data by year and to generate automatically a legend in `ggplot2`.
```{r echo=TRUE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
ggplot(data = my_data_rd, aes(rd, fill = year)) +
geom_density(color = "white", position = "identity", alpha = 0.4) +
scale_x_log10() +
ggtitle("Density plot of R&D investment (2011-2015)") +
xlab("R&D (log scale)")
```
> **Exercise 2**: Reproduce the density plot of the number of publications for each year (5 minutes).
We can now explore relationships between variables. To do so, we now need to transform the entire dataset into a tidy dataset.
```{r echo=TRUE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
head(my_data)
my_data_rd <- my_data %>%
select(ID, rd2011:rd2015) %>%
pivot_longer(-ID, names_to = "year", values_to = "rd") %>%
mutate(year = gsub("rd", "", year))
my_data_ns <- my_data %>%
select(ID, ns2011:ns2015) %>%
pivot_longer(-ID, names_to = "year", values_to = "ns") %>%
mutate(year = gsub("ns", "", year))
my_data_emp <- my_data %>%
select(ID, emp2011:emp2015) %>%
pivot_longer(-ID, names_to = "year", values_to = "emp") %>%
mutate(year = gsub("emp", "", year))
my_data_pub <- my_data %>%
select(ID, pubs.2011:pubs.2015) %>%
pivot_longer(-ID, names_to = "year", values_to = "pubs") %>%
mutate(year = gsub("pubs.", "", year))
my_data_tidy <- my_data_rd %>%
full_join(., my_data_ns, by = c("ID", "year")) %>%
full_join(., my_data_emp, by = c("ID", "year")) %>%
full_join(., my_data_pub, by = c("ID", "year")) %>%
full_join(., my_data %>% select(ID, isocountrycode), by = c("ID"))
head(my_data_tidy)
```
We can use the `GGally` package to explore relationships between variables by years using the new tidy data structure.
```{r echo=TRUE, fig.height=4, fig.width=8, message=FALSE, warning=FALSE}
ggpairs(my_data_tidy, aes(color = year),
columns = c("rd", "ns", "emp", "pubs"))
```
We can focus on the relationship between a firm's R&D investment and publications activity. We can also increase the size of the points on the basis of the number of employees and color them on the basis of country data. We will need to simplify the latter first.
```{r echo=TRUE, fig.height=6, fig.width=8, message=FALSE, warning=FALSE}
my_data_tidy <- my_data_tidy %>%
mutate(country = ifelse(isocountrycode != "US" &
isocountrycode != "CN" &
isocountrycode != "JP" &
isocountrycode != "DE" &
isocountrycode != "GB", "Other", isocountrycode))
ggplot(data = my_data_tidy, aes(x = rd, y = pubs+1)) +
geom_point(aes(color = country, size = emp)) +
scale_size(range = c(0, 3)) +
geom_smooth() +
scale_x_log10() +
scale_y_log10() +
ggtitle("R&D investment and publications (2011-2015)") +
xlab("R&D (log scale)") +
ylab("Number of publications (log scale)") +
theme(legend.position = "bottom")
```
We can use the `gghighlight` package to identify firms with less than 10,000 employees...
```{r echo=TRUE, fig.height=6, fig.width=8, message=FALSE, warning=FALSE}
ggplot(data = my_data_tidy, aes(x = rd, y = pubs+1)) +
geom_point(aes(color = country, size = emp)) +
scale_size(range = c(0, 3)) +
geom_smooth() +
scale_x_log10() +
scale_y_log10() +
ggtitle("R&D investment and publications (2011-2015) - <10,000 employees") +
xlab("R&D (log scale)") +
ylab("Number of publications") +
gghighlight(emp < 10000, keep_scales = T)
```
... or firms with 10,000-50,000 employees...
```{r echo=TRUE, fig.height=6, fig.width=8, message=FALSE, warning=FALSE}
ggplot(data = my_data_tidy, aes(x = rd, y = pubs+1)) +
geom_point(aes(color = country, size = emp)) +
scale_size(range = c(0, 3)) +
geom_smooth() +
scale_x_log10() +
scale_y_log10() +
ggtitle("R&D investment and publications (2011-2015) - 10,000-50,000 employees") +
xlab("R&D (log scale)") +
ylab("Number of publications") +
gghighlight(emp >= 10000 & emp <= 50000, keep_scales = T)
```
... or firms with more than 50,000 employees.
```{r echo=TRUE, fig.height=6, fig.width=8, message=FALSE, warning=FALSE}
ggplot(data = my_data_tidy, aes(x = rd, y = pubs+1)) +
geom_point(aes(color = country, size = emp)) +
scale_size(range = c(0, 3)) +
geom_smooth() +
scale_x_log10() +
scale_y_log10() +
ggtitle("R&D investment and publications (2011-2015) - >50,000 employees") +
xlab("R&D (log scale)") +
ylab("Number of publications") +
gghighlight(emp > 50000, keep_scales = T)
```
We can combine all these charts using the `patchwork` package.
```{r echo=TRUE, fig.height=8, fig.width=10, message=FALSE, warning=FALSE}
g1 <- ggplot(data = my_data_tidy, aes(x = rd, y = pubs+1)) +
geom_point(aes(color = country, size = emp)) +
scale_size(range = c(0, 3)) +
geom_smooth() +
scale_x_log10() +
scale_y_log10() +
ggtitle("R&D investment and number of publications (2011-2015)") +
xlab("R&D (log scale)") +
ylab("Number of publications") +
theme(legend.position = "bottom")
g2 <- g1 +
theme(legend.position = "none",
plot.title = element_text(size = 7)) +
ggtitle("<10,000 employees") +
gghighlight(emp < 10000, keep_scales = T)
g3 <- g1 +
theme(legend.position = "none",
plot.title = element_text(size = 7)) +
ggtitle("10,000-50,000 employees") +
gghighlight(emp >= 10000 & emp <= 50000, keep_scales = T)
g4 <- g1 +
theme(legend.position = "none",
plot.title = element_text(size = 7)) +
ggtitle(">50,000 employees") +
gghighlight(emp > 50000, keep_scales = T)
g1 / (g2 + g3 + g4) + plot_annotation(tag_levels = 'A')
```
The `face_wrap()` function is a very helpful tool to produce multiple charts on the basis of a categorical variable. We can produce a chart for each country - note we grouped countries into CN, DE, GB, JP, US, and Other.
```{r echo=TRUE, fig.height=8, fig.width=10, message=FALSE, warning=FALSE}
ggplot(data = my_data_tidy, aes(x = rd, y = pubs+1)) +
geom_point(aes(color = country, size = emp)) +
scale_size(range = c(0, 3)) +
geom_smooth() +
scale_x_log10() +
scale_y_log10() +
ggtitle("R&D investment and number of publications (2011-2015)") +
xlab("R&D (log scale)") +
ylab("Number of publications") +
theme(legend.position = "bottom") +
facet_wrap(~country)
```
Similarly, we can produce a char by year.
```{r echo=TRUE, fig.height=8, fig.width=10, message=FALSE, warning=FALSE}
ggplot(data = my_data_tidy, aes(x = rd, y = pubs+1)) +
geom_point(aes(color = country, size = emp)) +
scale_size(range = c(0, 3)) +
geom_smooth() +
scale_x_log10() +
scale_y_log10() +
ggtitle("R&D investment and number of publications (2011-2015)") +
xlab("R&D (log scale)") +
ylab("Number of publications") +
theme(legend.position = "bottom") +
facet_wrap(~year)
```
> **Exercise 3**: Produce a chart that compares R&D investment and number of publications for UK firms (10 minutes).