-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
57 lines (43 loc) · 1.83 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
Getting started
============================
Create and activate a python virtual environment
Install requirements:
$ pip install -r requirements.txt
Running a Tradeshow Scrape
============================
For the most part, the generic spider at
scraper/spiders/generic_tradeshow_spider.py should be able to handle most
scraping requests that we get.
If you need to scrape something that follows the same format as sites like
http://s23.a2zinc.net/clients/lrp/hrtechnologyconference2017/Public/exhibitors.aspx?Index=All
or
http://events.pennwell.com/DTECH2018/Public/exhibitors.aspx?_ga=2.91461086.575732828.1507662078-248451487.1507662078
then you should be able to use this as-is by running the wrapper script:
./scrape.sh <your-start-url>
This script will run the "tradeshow" spider and output CSV to a file named
tradeshow-scrape.csv. If tradeshow-scrape.csv already exists, it will be
overwritten with each run.
Otherwise, you may need to create a custom spider like
in nrf2018_custom_spider.py
Scrapy Basics
============================
Callback functions
----------------------------
Each Rule should have a callback function
created for it - this is what will be executed on the HTML of a resulting
page when scrapy follows a link.
Items
----------------------------
For each type of item you need to capture information about, add an Item
to scraper/items.py. This is where you define the fields you want to
store data in for each item and how to populate those fields. Your
Rule callback functions should invoke these items.
Running a scraper
============================
$ cd scraper
$ scrapy crawl nameofyourspider \
--output=location-of-output-file --output-format=[csv,jl,etc]
example:
$ scrapy crawl hrtech2017 --output-format=csv --output=hrtech2017.csv
For more details and tutorials, see the scrapy documentation:
https://doc.scrapy.org/