StreamSpider

Spider based on storm platform

Environment

Features

Incremental scrap && analyze

Define allowed URL patterns

Customize scrap strategy of certain pattern

limitation
reset interval
expire time
parallelism

Update settings dynamically

System will refetch settings after a certain time (cache), so it is possible to update settings dynamically.

Topology

There are one spout(URLReader) and five bolts in these topology. Bolts include URLFilter, Downloader, HTMLParser, HTMLSaver, URLSaver

URLReader: Pop from redis waiting list to get urls

URLFilter: Determine Which url will be downloaded.

This bolt is the controller, in charge of :

Handle repeated urls
Pattern download count, ignore limitation exceeded pattern.

Downloader: Download url

URLParser: Parse urls from the page

HTMLSaver : Save page html to MQ

URLSaver : Push possible urls to redis waiting list

Configuration

There something to (or can to be) configured

urls_to_download (Redis list, required ) : waiting list, absolute url path.

allowed_url_patterns (Redis sorted list, required, priority from highest score(5) to lowest score(1), zrevrangeBYScore): allowed url patterns to be downloaded

url_pattern_setting_{pattern} (Redis hash, optional) :

- **limitation**: download count limitation in an interval
- **interval**: duration to reset count
- **expire**: cache time
- **parallelism**: max number of workers working on this pattern(host)

TODO

ignore nun-text pages (binary file)
consume faster

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
deploy		deploy
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StreamSpider

Environment

Features

Incremental scrap && analyze

Define allowed URL patterns

Customize scrap strategy of certain pattern

Update settings dynamically

Topology

URLReader: Pop from redis waiting list to get urls

URLFilter: Determine Which url will be downloaded.

Downloader: Download url

URLParser: Parse urls from the page

HTMLSaver : Save page html to MQ

URLSaver : Push possible urls to redis waiting list

Configuration

urls_to_download (Redis list, required ) : waiting list, absolute url path.

allowed_url_patterns (Redis sorted list, required, priority from highest score(5) to lowest score(1), zrevrangeBYScore): allowed url patterns to be downloaded

url_pattern_setting_{pattern} (Redis hash, optional) :

TODO

About

Releases

Packages

Languages

License

newnius/StreamSpider

Folders and files

Latest commit

History

Repository files navigation

StreamSpider

Environment

Features

Incremental scrap && analyze

Define allowed URL patterns

Customize scrap strategy of certain pattern

Update settings dynamically

Topology

URLReader: Pop from redis waiting list to get urls

URLFilter: Determine Which url will be downloaded.

Downloader: Download url

URLParser: Parse urls from the page

HTMLSaver : Save page html to MQ

URLSaver : Push possible urls to redis waiting list

Configuration

urls_to_download (Redis list, required ) : waiting list, absolute url path.

allowed_url_patterns (Redis sorted list, required, priority from highest score(5) to lowest score(1), zrevrangeBYScore): allowed url patterns to be downloaded

url_pattern_setting_{pattern} (Redis hash, optional) :

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages