Skip to content

A convenient way to calculate the difference between html files to scrape with confidence

License

Notifications You must be signed in to change notification settings

BentoBox-Project/vanguard-kit

Repository files navigation

Vanguard kit

PyPI version Tests Codecov

A convenient way to calculate the edit distance between html files to scrape with confidence

Sometimes, scraping becomes a hard task, because the web sites are in continous changing. What about if there was a way to prevent those changes before scrape a site? Vanguard is a tool kit that provides a way to calculate the edit distance between two html files by the Zhang-Shasha algorithm. This package is based on zss.

Installation

OS X & Linux:

From PYPI

$ pip3 install vanguardkit

from the source

$ git clone https://github.com/dany2691/vanguard-kit.git
$ cd vanguard-kit
$ python3 setup.py install

Usage example

With vanguard, it is possible to convert html content into a tree (graph) of nodes. The create_html_tree function is the responsible to do that, it returns an instance of the VanguardNode class that inherits from the zss.Node class:

from vanguardkit import create_html_tree

with open("target_website.html") as target_website:
    thml_tree = create_html_tree(target_website)

It is possible to segment specific parts of an html file.

By tag:

with open("target_website.html") as target_website:
    html_tree = create_html_tree(
        html_file=target_website,
        specific_tag="footer"
    )

By tag and class:

with open("target_website.html") as target_website:
    html_tree = create_html_tree(
        html_file=target_website,
        specific_tag="div",
        class_="main-div"
    )

By tag and id:

with open("target_website.html") as target_website:
    html_tree = create_html_tree(
        html_file=target_website,
        specific_tag="div",
        id="super-div"
    )

Calculating distance

As previously said, the used algorithm is the Zhang-Shasha, that computes the edit distance between the two given trees. Ths is possible with the zss package behind the scenes; vanguard only provides a way to convert html files into trees.

from vanguard_kit import create_html_tree, calcuate_html_tree_distance

with open("stored_target_website.html") as stored_file:
    with open("current_target_website.html") as current_file:
        previous_tree = create_html_tree(stored_file)
        current_tree = create_html_tree(current_file)
        print(calcuate_html_tree_distance(previous_tree, current_tree))
        # Prints 1

Due to the VanguardNode class implements the sub dunder method, the next way to calculate the edit distance is possible:

from vanguard_kit import create_html_tree, calcuate_html_tree_distance

with open("stored_target_website.html") as stored_file:
    with open("current_target_website.html") as current_file:
        previous_tree = create_html_tree(stored_file)
        current_tree = create_html_tree(current_file)
        print(previous_tree - current_tree)
        # Prints 1

Then, the next statement returns True:

calcuate_html_tree_distance(previous_tree, current_tree) == previous_tree - current_tree

Development setup

This project uses Poetry for dependecy resolution. It's a kind of mix between pip and virtualenv. Follow the next instructions to setup the development enviroment.

First of all, install Poetry:

$ pip install poetry
$ git clone https://github.com/dany2691/vanguard-kit.git
$ cd vanguard_kit
$ poetry install

To run the test-suite, inside the pybundler directory:

$ poetry run pytest test/ -vv

Meta

Daniel Omar Vergara Pérez – @__danvergara __daniel.omar.vergara@gmail.com -- github.com/danvergara

Valery Briz - @valerybriz -- github.com/valerybriz

Contributing

  1. Fork it (https://github.com/BentoBox-Project/vanguard-kit)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request