-
Notifications
You must be signed in to change notification settings - Fork 0
/
W4-webdata_fullsolution_SIC.Rmd
140 lines (101 loc) · 3.87 KB
/
W4-webdata_fullsolution_SIC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
title: "Getting Data From the web"
author: "Frédérique Bone"
date: "08/02/2021"
output: html_document
---
We are going to collect the data from the Sussex Innovation Center webpage to build a start-ups dataset.
This exercise will involve building up code from a few examples, in an iterative manner.
Let's first load the libraries that we need.
```{r setup, include=FALSE}
library(tidyverse)
library(rvest)
```
Using read_html we are going to load the first page of the firm data
using html_nodes with are going all the containers where each of the firms are located which are contained into 'article'
This will collect all articles where the firm data is contained in.
```{r}
SIC_item <- read_html("https://www.sinc.co.uk/member-directory-list") %>%
html_nodes("article")
```
Create a variable to extract the name of the first firm blurb using double square brackets to access the first item in the list
```{r}
firm_blurb <- html_node(SIC_item[[1]], "p") %>%
html_text
print(firm_blurb)
```
Use the same code as before to extract (line 30-35) the 'firm_name' and the 'firm_location' (spin-out location)
```{r}
firm_name <- html_node(SIC_item[[1]], "h2") %>%
html_text
print(firm_name)
firm_location <- html_node(SIC_item[[1]], "li") %>%
html_text
print(firm_location)
```
We are going to compile all of the variables into a one row dataframe
(i) first by compiling the three variables created into a single vector
(ii) second by transforming it into a dataframe (using the transpose function, as vector are by default added by columns)
```{r}
SIC_vec <- c(firm_name, firm_blurb, firm_location)
SIC_vec <- data.frame(t(SIC_vec))
colnames(SIC_vec) <- c("firm_name", "firm_blurb", "firm_location")
SIC <- SIC_vec
head(SIC, n=2)
```
Create a for loop to collect the rest of the data on the page from all previous blocks (using all)
(i) Create a for loop for i taking value between 2 and length of SIC_item
(ii) create a vector for each of the items
(iii) transform the item to a dataframe with the right column names
(iv) add the new dataframe to the SIC dataframe using bind_rows
```{r}
for (i in 2:length(SIC_item)){
# Second tab copy
firm_blurb <- html_node(SIC_item[[i]], "p") %>%
html_text
# Third tab copy
firm_name <- html_node(SIC_item[[i]], "h2") %>%
html_text
firm_location <- html_node(SIC_item[[i]], "li") %>%
html_text
# Fourth tab copy
SIC_vec <- c(firm_name, firm_blurb, firm_location)
SIC_vec <- data.frame(t(SIC_vec))
colnames(SIC_vec) <- c("firm_name", "firm_blurb", "firm_location")
# bindrows
SIC <- bind_rows(SIC, SIC_vec)
}
```
For next time, using all the code above, try to create a for loop over the different pages on the website:
(i) Create a for loop for i taking value between 1 and number of pages of the website (check first how many pages there are on the website)
(ii) Get the data for each page by adjusting the html code and make it flexible to the row number (using "paste0(url, page)")
(iii) create a vector for each of the items / dataframe for each items on the page
(iv) add the new dataframe to the SIC dataframe using bindrows
(v) view the dataframe created
```{r}
for (i in 1:14){
URL <- paste0("https://www.sinc.co.uk/member-directory-list?page=", i)
SIC_item <- read_html(URL) %>%
html_nodes("article")
for (j in 1:length(SIC_item)){
# Second tab copy
firm_blurb <- html_node(SIC_item[[j]], "p") %>%
html_text
# Third tab copy
firm_name <- html_node(SIC_item[[j]], "h2") %>%
html_text
firm_location <- html_node(SIC_item[[j]], "li") %>%
html_text
# Fourth tab copy
SIC_vec <- c(firm_name, firm_blurb, firm_location)
SIC_vec <- data.frame(t(SIC_vec))
colnames(SIC_vec) <- c("firm_name", "firm_blurb", "firm_location")
# bindrows
SIC <- bind_rows(SIC, SIC_vec)
}
}
```
Save the dataset
```{r}
write_csv(SIC, "Sussex_innovation_center_firms.csv")
```