Template Crawler

Description

Using config files to crawl web data, and load it to database.

Basic Info

Project Name	Author	CreateDate	OS	Language
Template Crawler	@TauWu	2018-07-19	Linux (Based on Debain)	Python 3

TODOs

Future TODOs

Data Warehouse
Data Mining
Data API
Price Trend Prediction

Flow Chart

Requirements

Software Needs

# Python interpreter.
apt-get install python3

# Pip tools to install 3rd-party modules.
apt-get install python3-pip

# Save Hash Map data.
apt-get install redis-server

# Save structured data.
apt-get install mysql-server

# Use tesseract-ocr to convert img to string.
apt-get install tesseract-ocr

Module Needs

# Better method to start requests.
pip3 install requests

# Connect to MySQL and control it.
pip3 install PyMySQL

# Read and write config files.
pip3 install configparser

# Read HTML files as a balance tree.
pip3 install lxml

# Provide random UA when request a host.
pip3 install fake-useragent

# Connect to Redis server and control it.
pip3 install redis

# Edit images.
pip3 install pillow

# Use tesseract-ocr in python.
pip3 install pytesseract

Crawler Types

For different websites, there are some methods to get its data. For instance, you can get house info list from externally exposed HTTP APIs, howerver, some sites don't provide them, because some sites are rendered by templates. Thus, we are supposed to provide different types to slove these problems. And here is the enumeration of carwler types.

Type No.	Crawler methods
1	`GET` Request HTTP APIs, and parsering JSON object by json.
2	`GET` Request Webpage, and parsering HTML content by lxml.
3	`POST` Request HTTP APIs, and parsering JSON object by json.

Code Directory Structure

_output
_test
.vscode
config
constant
database
do
log
module
util
.gitignore
crawler_main.py
etl_main.py
LICENSE
README.md
start_crawler.sh
start_etl.sh
stop_all.sh

_output

The folder to save intermedidate temporary files. For instance, xlsx files, img files.

_test

Test code when develop or debug, this folder won't be pushed to git.

.vscode

Visual Studio Code config files, this folder won't be pushed to git.

config

Contains crawler, elt and sys config files here. File sys.cfg won't be pushed to git.

constant

Constant value, config, dict and so on.

database

SQL files here.

do

Try to do it!

crawler.py => Do crawler.
etl.py => Do etl project.

log

Save log files here, contains subfolders named by different projects.

module

Contains many modules used by do.

config => Read config files in ./config and return it.
database => Create and execute SQL string
parser => Parse the data from request file one by one by lxml or json module, and then save them into redis server.
- detail.py => Parser detail webpage or HTTP API json and update the dict to redis.
- list.py => Parser list webpage or HTTP API json and update the dict to redis.
- extra.py => Extra function for this crawler, for instance, ziroom's house price is showed by a picture, we should do extra for it.
redis => Redis scanner, scanning the redis-server to get request key list.
request => Mutil-Proxy-Request ordered list or detail url list by ProxiesRequest Module and yield the content to parser module.

util

Contains many base tools used by do, module.

common => Common extensions here.
- date.py => Class Time and DateTime.
- logger.py => Class LogBase.
- timeout.py => Function set_timeout.
- tools.py => Tool functions.
config => Config reader and writer module.
database => Database connector and executor.
redis => Redis connector and executor.
web => Mutil-Proxy-Request module and test.
xlsx => Read and write xlsx files.

LICENSE

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

Tips

Here is the crawl difficulty's order of different websites: ziroom > lianjia > qk.
When you get the rent price of ziroom, you're supposed to download a temp img, and then, get the string from img by using OCR tools. Ziroom website will give the index of each number and you should joint it.
TODO This project use gevent to download webpage and use asdef and await request to download imgs. It can be quicker if turn gevent + asdef asynchronous mode to gevent mode or asdef asynchronous mode, because the waste of exchange for threads.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Template Crawler

Description

Basic Info

TODOs

Future TODOs

Flow Chart

Requirements

Software Needs

Module Needs

Crawler Types

Code Directory Structure

_output

_test

.vscode

config

constant

database

do

log

module

util

LICENSE

Tips

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
bin		bin
config		config
constant		constant
database		database
do		do
module		module
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawler_main.py		crawler_main.py
crawler_test.py		crawler_test.py
etl_main.py		etl_main.py
reporter_main.py		reporter_main.py

License

TauWu/template_crawler

Folders and files

Latest commit

History

Repository files navigation

Template Crawler

Description

Basic Info

TODOs

Future TODOs

Flow Chart

Requirements

Software Needs

Module Needs

Crawler Types

Code Directory Structure

_output

_test

.vscode

config

constant

database

do

log

module

util

LICENSE

Tips

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages