-
Notifications
You must be signed in to change notification settings - Fork 2
/
scraping_data.txt
129 lines (97 loc) · 4.3 KB
/
scraping_data.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
Scraping data
Present Yelp dataset contains very few large metro areas in the US, so we decided to scrape our own data from the Yelp page.
We are using these papers as reference -
[1] D. Jurafsky, V. Chahuneau, B. R. Routledge and N.
A. Smith, "Narrative framing of consumer sentiment
in online restaurant reviews," First Monday, vol. 19,
no. 4, 2014.
Word Salad: Relating food prices and descriptions
https://homes.cs.washington.edu/~nasmith/papers/chahuneau+gimpel+routledge+scherlis+smith.emnlp12.pdf
>>From the paper >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
We crawled Allmenus.com (www.allmenus.
com) to gather menus for restaurants in seven
U.S. cities: Boston, Chicago, Los Angeles, New
York, Philadelphia, San Francisco, and Washington,
D.C. Each menu includes a list of item names
with optional text descriptions and prices. Most Allmenus
restaurant pages contain a link to the corresponding
page on Yelp (www.yelp.com) with
metadata and user reviews for the restaurant, which
we also collected.
The metadata consist of many fields for each
restaurant, which can be divided into three categories:
location (city, neighborhood, transit stop),
services available (take-out, delivery, wifi, parking,
etc.), and ambience (good for groups, noise level,
attire, etc.). Also, the category of food and a price
range ($ to $$$$, indicating the price of a typical
meal at the restaurant) are indicated. The user reviews
include a star rating on a scale of 1 to 5.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
So we are going to use Helena to scrape data from Yelp for the reviews, from the restaurants we have already constructed the menus for.
To scrape the reviews, go to -
Restaurant > click on a top dish > get around 20 ish reviews for each
For each review let's try to follow the same format that yelp follows -
{
// string, 22 character unique review id - generate our own - at the end of the whole scraping task
"review_id": "zdSx_SD6obEhz9VrW9uAWA",
// have to scrape this event though we may not use it
"user_name" : "Larry H."
// string, 22 character business id, maps to business in business.json - generate our own
"business_id": "tnhfDv5Il8EaGSXZGiuQGg",
//scrape from page
"business_name" : "Pike Place Chowder",
// integer, star rating - this scrape needs to be cleaned
"stars": 4,
// string, date formatted YYYY-MM-DD
"date": "2016-03-09",
// string, the review itself
"text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
}
Manually write it down. - combine all businesses to create a ID
{
// string, 22 character unique string business id - generate our own
"business_id": "tnhfDv5Il8EaGSXZGiuQGg",
// string, the business's name
"name": "Garaje",
// string, the city
"city": "San Francisco",
// string, 2 character state code, if applicable
"state": "CA",
// float, star rating, rounded to half-stars
"stars": 4.5,
// an array of strings of business categories
"categories": [
"Mexican",
"Burgers",
"Gastropubs"
]
}
Business menus
{
// from the previous business json
"business_id": "tnhfDv5Il8EaGSXZGiuQGg",
// array of strings of menu items
"menu_item" : "cake",
//flag for popularity
"is_popular" : 1
}
After running the model -
{
// string, 22 character unique review id - generate our own - at the end of the whole scraping task
"review_id": "zdSx_SD6obEhz9VrW9uAWA",
// have to scrape this event though we may not use it
"user_name" : "Larry H."
// string, 22 character business id, maps to business in business.json - generate our own
"business_id": "tnhfDv5Il8EaGSXZGiuQGg",
//scrape from page
"business_name" : "Pike Place Chowder",
// integer, star rating - this scrape needs to be cleaned
"stars": 4,
// string, date formatted YYYY-MM-DD
"date": "2016-03-09",
// string, the review itself
"text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
//Named Entities
"Entities":['item1','item2', 'item3'...],
}