-
Notifications
You must be signed in to change notification settings - Fork 0
/
W4-webdata_exercise_SIC.Rmd
79 lines (54 loc) · 2.47 KB
/
W4-webdata_exercise_SIC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
title: "Getting Data From the web"
author: "Frédérique Bone"
date: "08/02/2021"
output: html_document
---
We are going to collect the data from the Sussex Innovation Center webpage to build a start-ups dataset.
This exercise will involve building up code from a few examples, in an iterative manner.
Let's first load the libraries that we need.
```{r setup, include=FALSE}
library(tidyverse)
library(rvest)
```
Using read_html we are going to load the first page of the firm data
using html_nodes with are going all the containers where each of the firms are located which are contained into 'article'
This will collect all articles where the firm data is contained in.
```{r}
SIC_item <- read_html("https://www.sinc.co.uk/member-directory-list") %>%
html_nodes("article")
```
Create a variable to extract the name of the first firm blurb using double square brackets to access the first item in the list
```{r}
firm_blurb <- html_node(SIC_item[[1]], "p") %>%
html_text
print(firm_blurb)
```
Use the same code as before to extract (line 30-35) the 'firm_name' and the 'firm_location' (spin-out location)
```{r}
```
We are going to compile all of the variables into a one row dataframe
(i) first by compiling the three variables created into a single vector
(ii) second by transforming it into a dataframe (using the transpose function, as vector are by default added by columns)
```{r}
SIC_vec <- c(firm_name, firm_blurb, firm_location)
SIC_vec <- data.frame(t(SIC_vec))
colnames(SIC_vec) <- c("firm_name", "firm_blurb", "firm_location")
SIC <- SIC_vec
head(SIC, n=2)
```
Create a for loop to collect the rest of the data on the page from all previous blocks (using all)
(i) Create a for loop for i taking value between 2 and length of SIC_item
(ii) create a vector for each of the items
(iii) transform the item to a dataframe with the right column names
(iv) add the new dataframe to the SIC dataframe using bind_rows
```{r}
```
For next time, using all the code above, try to create a for loop over the different pages on the website:
(i) Create a for loop for i taking value between 1 and number of pages of the website (check first how many pages there are on the website)
(ii) Get the data for each page by adjusting the html code and make it flexible to the row number (using "paste0(url, page)")
(iii) create a vector for each of the items / dataframe for each items on the page
(iv) add the new dataframe to the SIC dataframe using bindrows
(v) view the dataframe created
```{r}
```