Exoskeleton

Machine Learning and other applications make it necessary to download thousands or sometimes hundreds of thousands of files.

Using a high-speed-connection carries the risk to run an involuntary denial-of-service attack on the servers that provide those files and webpages.

Exoskeleton is a Python framework that helps you build a crawler / scraper that avoids too high loads on the connection and instead runs permanently and fault tolerant to ultimately download all files.

Its main functionalities are:

Managing the download queue and document data within a MariaDB database.
Avoid processing the same URL more than once.
Working through the queue by either
- downloading files to disk,
- storing the page source code into a database table,
- storing the page text,
- or making PDF-copies of webpages.
Managing already downloaded files:
- Storing multiple versions of a specific file.
- Assigning labels to downloads, so they can be found and grouped easily.
Sending progress reports to the admin.

Documentation

How To Use Exoskeleton

Example Uses

Downloading an Archive : A quite complex use case requiring some custom SQL. This is the actual project that triggered the development of exoskeleton.

Technical Documentation

Example

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import logging

import exoskeleton

logging.basicConfig(level=logging.DEBUG)

# Create a bot
# exoskeleton makes reasonable assumptions about
# parameters left out, like:
# - host = localhost
# - port = 3306 (MariaDB standard)
# - ...
exo = exoskeleton.Exoskeleton(
    project_name='Bot',
    database_settings={'database': 'exoskeleton',
                       'username': 'exoskeleton',
                       'passphrase': ''},
    # True, to stop after the queue is empty, Otherwise it will
    # look consistently for new tasks in the queue:
    bot_behavior={'stop_if_queue_empty': True},
    filename_prefix='bot_',
    chrome_name='chromium-browser',
    target_directory='/home/myusername/myBot/'
)

exo.add_file_download('https://www.ruediger-voigt.eu/examplefile.txt')
# => Will be saved in the target directory. The filename will be the
#    chosen prefix followed by the database id and .txt.

exo.add_file_download(
    'https://www.ruediger-voigt.eu/examplefile.txt',
    {'example-label', 'foo'})
# => Duplicate will be recognized and not added to the queue,
#    but the labels will be associated with the file in the
#    database.


exo.add_file_download(
    'https://www.ruediger-voigt.eu/file_does_not_exist.pdf')
# => Nonexistent file: will be marked, but will not stop the bot.

# Save a page's code into the database:
exo.add_save_page_code('https://www.ruediger-voigt.eu/')

# Use chromium or Google chrome to generate a PDF of the website:
exo.add_page_to_pdf('https://github.com/RuedigerVoigt/exoskeleton')

# work through the queue:
exo.process_queue()

Name		Name	Last commit message	Last commit date
Latest commit History 565 Commits
.github		.github
Database-Scripts		Database-Scripts
documentation		documentation
exoskeleton		exoskeleton
.coveragerc		.coveragerc
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
contributing.md		contributing.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
salted-linkcheck.ini		salted-linkcheck.ini
setup.py		setup.py
tests_with_side_effects.py		tests_with_side_effects.py
tests_without_side_effects.py		tests_without_side_effects.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exoskeleton

Documentation

How To Use Exoskeleton

Example Uses

Technical Documentation

Example

About

Releases 27

Packages

Contributors 2

Languages

License

RuedigerVoigt/exoskeleton

Folders and files

Latest commit

History

Repository files navigation

Exoskeleton

Documentation

How To Use Exoskeleton

Example Uses

Technical Documentation

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 27

Packages 0

Contributors 2

Languages

Packages