A convenient way to calculate the edit distance between html files to scrape with confidence
Sometimes, scraping becomes a hard task, because the web sites are in continous changing. What about if there was a way to prevent those changes before scrape a site? Vanguard is a tool kit that provides a way to calculate the edit distance between two html files by the Zhang-Shasha algorithm. This package is based on zss.
OS X & Linux:
From PYPI
$ pip3 install vanguardkit
from the source
$ git clone https://github.com/dany2691/vanguard-kit.git
$ cd vanguard-kit
$ python3 setup.py install
With vanguard, it is possible to convert html content into a tree (graph) of nodes. The create_html_tree function is the responsible to do that, it returns an instance of the VanguardNode class that inherits from the zss.Node class:
from vanguardkit import create_html_tree
with open("target_website.html") as target_website:
thml_tree = create_html_tree(target_website)
It is possible to segment specific parts of an html file.
By tag:
with open("target_website.html") as target_website:
html_tree = create_html_tree(
html_file=target_website,
specific_tag="footer"
)
By tag and class:
with open("target_website.html") as target_website:
html_tree = create_html_tree(
html_file=target_website,
specific_tag="div",
class_="main-div"
)
By tag and id:
with open("target_website.html") as target_website:
html_tree = create_html_tree(
html_file=target_website,
specific_tag="div",
id="super-div"
)
As previously said, the used algorithm is the Zhang-Shasha, that computes the edit distance between the two given trees. Ths is possible with the zss package behind the scenes; vanguard only provides a way to convert html files into trees.
from vanguard_kit import create_html_tree, calcuate_html_tree_distance
with open("stored_target_website.html") as stored_file:
with open("current_target_website.html") as current_file:
previous_tree = create_html_tree(stored_file)
current_tree = create_html_tree(current_file)
print(calcuate_html_tree_distance(previous_tree, current_tree))
# Prints 1
Due to the VanguardNode class implements the sub dunder method, the next way to calculate the edit distance is possible:
from vanguard_kit import create_html_tree, calcuate_html_tree_distance
with open("stored_target_website.html") as stored_file:
with open("current_target_website.html") as current_file:
previous_tree = create_html_tree(stored_file)
current_tree = create_html_tree(current_file)
print(previous_tree - current_tree)
# Prints 1
Then, the next statement returns True:
calcuate_html_tree_distance(previous_tree, current_tree) == previous_tree - current_tree
This project uses Poetry for dependecy resolution. It's a kind of mix between pip and virtualenv. Follow the next instructions to setup the development enviroment.
First of all, install Poetry:
$ pip install poetry
$ git clone https://github.com/dany2691/vanguard-kit.git
$ cd vanguard_kit
$ poetry install
To run the test-suite, inside the pybundler directory:
$ poetry run pytest test/ -vv
Daniel Omar Vergara Pérez – @__danvergara __ – daniel.omar.vergara@gmail.com -- github.com/danvergara
Valery Briz - @valerybriz -- github.com/valerybriz
- Fork it (https://github.com/BentoBox-Project/vanguard-kit)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request