Using config files to crawl web data, and load it to database.
Project Name | Author | CreateDate | OS | Language |
---|---|---|---|---|
Template Crawler | @TauWu | 2018-07-19 | Linux (Based on Debain) | Python 3 |
-
Code
- Transplant Module
Downloader
from spider_anjuke - Transplant Module
Proxies
from spider_anjuke - Transplant Module
Parser
from spider_anjuke - Transplant Module
Common
from spider_anjuke - Create Module
Config Parser
- Create Module
Common Crawler
- Transplant Module
-
Test
-
Full Test
-
ETL Project
- Extract Data from Redis
- Transform and clean data
- Load Data and save
-
Log Module
- Log into files.
- log into database.
-
Async Func
- Use async func to mutil-process ziroom price imgs.
-
Mail Monitor
- Package email module.
- Send mail when crawler finished.
- Send mail when etl finished.
-
Mail Reporter
- Send mail with xlsx file with cron.
- Data Warehouse
- Data Mining
- Data API
- Price Trend Prediction
# Python interpreter.
apt-get install python3
# Pip tools to install 3rd-party modules.
apt-get install python3-pip
# Save Hash Map data.
apt-get install redis-server
# Save structured data.
apt-get install mysql-server
# Use tesseract-ocr to convert img to string.
apt-get install tesseract-ocr
# Better method to start requests.
pip3 install requests
# Connect to MySQL and control it.
pip3 install PyMySQL
# Read and write config files.
pip3 install configparser
# Read HTML files as a balance tree.
pip3 install lxml
# Provide random UA when request a host.
pip3 install fake-useragent
# Connect to Redis server and control it.
pip3 install redis
# Edit images.
pip3 install pillow
# Use tesseract-ocr in python.
pip3 install pytesseract
For different websites, there are some methods to get its data. For instance, you can get house info list from externally exposed HTTP APIs, howerver, some sites don't provide them, because some sites are rendered by templates. Thus, we are supposed to provide different types to slove these problems. And here is the enumeration of carwler types.
Type No. | Crawler methods |
---|---|
1 | GET Request HTTP APIs, and parsering JSON object by json. |
2 | GET Request Webpage, and parsering HTML content by lxml. |
3 | POST Request HTTP APIs, and parsering JSON object by json. |
- _output
- _test
- .vscode
- config
- constant
- database
- do
- log
- module
- util
- .gitignore
- crawler_main.py
- etl_main.py
- LICENSE
- README.md
- start_crawler.sh
- start_etl.sh
- stop_all.sh
The folder to save intermedidate temporary files. For instance, xlsx files, img files.
Test code when develop or debug, this folder won't be pushed to git.
Visual Studio Code config files, this folder won't be pushed to git.
Contains crawler, elt and sys config files here. File sys.cfg won't be pushed to git.
Constant value, config, dict and so on.
SQL files here.
Try to do it!
crawler.py
=> Do crawler.etl.py
=> Do etl project.
Save log files here, contains subfolders named by different projects.
Contains many modules used by do.
config
=> Read config files in ./config and return it.database
=> Create and execute SQL stringparser
=> Parse the data from request file one by one by lxml or json module, and then save them into redis server.detail.py
=> Parser detail webpage or HTTP API json and update the dict to redis.list.py
=> Parser list webpage or HTTP API json and update the dict to redis.extra.py
=> Extra function for this crawler, for instance, ziroom's house price is showed by a picture, we should do extra for it.
redis
=> Redis scanner, scanning the redis-server to get request key list.request
=> Mutil-Proxy-Request ordered list or detail url list by ProxiesRequest Module and yield the content to parser module.
Contains many base tools used by do, module.
-
common
=> Common extensions here.date.py
=> Class Time and DateTime.logger.py
=> Class LogBase.timeout.py
=> Function set_timeout.tools.py
=> Tool functions.
-
config
=> Config reader and writer module. -
database
=> Database connector and executor. -
redis
=> Redis connector and executor. -
web
=> Mutil-Proxy-Request module and test. -
xlsx
=> Read and write xlsx files.
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
- Here is the crawl difficulty's order of different websites: ziroom > lianjia > qk.
- When you get the rent price of ziroom, you're supposed to download a temp img, and then, get the string from img by using OCR tools. Ziroom website will give the index of each number and you should joint it.
- TODO This project use
gevent
to download webpage and useasdef
andawait
request to download imgs. It can be quicker if turngevent
+asdef
asynchronous mode togevent
mode orasdef
asynchronous mode, because the waste of exchange for threads.