torspider

Fast asynchronous web-crawler based on Tornado framework.

Tested on:

MacOSX
Debian-based Linux.

Usage

Prerequisites

Unix
Python >= 3.5
Redis server

Installation

Required packages

To install required Python packages, run from the project root:

$ python setup.py install develop

Linux installation may require additional steps for building pycurl with SSL support. If so, you may find more convenient to install the dependencies before running setup.py :

$ pip install -r requirements.txt

If it fails at some stage because of a missing system package, install the system package manually with your standard package manager (apt, yum, etc.), then repeat the above command.

Testing

To run all the unit tests:

$ python -m unittest discover tests/ -p test_*.py

Configuration

Default configuration file is named default.conf. To override any of its predefined settings, put desired key/value pair to local.conf or provide an alternative via command-line option. The rule is simple:

Options from local.conf override those from default.conf.
Options from command line override those from both the configuration files.

To see all available command-line options, run:

See also the detailed options description.

$ torspider --help

Seeds

To provide initial URLs, edit seeds.conf file. If you do not provide any seeds, the program will be able only to continue previous session. If there were no previous sessions, or you start torspider with --clear-tasks option, there won't be any tasks for workers.

Plugins

plugins.json file contains list of available plugins. To disable a plugin, set its enabled property to false. config hash contains plugin-specific configuration data.

Running

$ torspider

With extra logging:

$ torspider --logging=debug

With 50 asynchronous workers:

$ torspider --workers=50

To stop after passing 5000 pages:

$ torspider --max-pages=5000

To clear all data from previous session:

$ torspider --clear-tasks

Extending

The program is pluggable. Briefly speaking, its primary responsibility is to traverse the network following a set of rules. How to use the results is entirely up to plugins. In one scenario each page must be parsed and saved to a database. Another scenario requires extracting some special information from the page, such as stock prices.

See Wiki article.

Official plugins:

torspider-mongo

TODOs

More entry points for plugins, e.g. let them control where to go from a page.
More configurable settings, at the first place: request headers.
Black list of domains / addresses.
Pauses between succedent request to the same domain.
Additional content types.
Monitoring tools.
Sphinx-compatiable documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

torspider

Usage

Prerequisites

Installation

Required packages

Testing

Configuration

Seeds

Plugins

Running

Extending

Official plugins:

TODOs

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
doc/img		doc/img
tests		tests
torspider		torspider
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
default.conf		default.conf
local.conf		local.conf
plugins.json		plugins.json
requirements.txt		requirements.txt
seeds.conf		seeds.conf
setup.cfg		setup.cfg
setup.py		setup.py

License

skrushinsky/torspider

Folders and files

Latest commit

History

Repository files navigation

torspider

Usage

Prerequisites

Installation

Required packages

Testing

Configuration

Seeds

Plugins

Running

Extending

Official plugins:

TODOs

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages