GitHub Action
Broken-Links-Crawler
This action checks all links on a website. It will detect broken links i.e. links that return HTTP Code 403, 404... Check the logs to see which links are broken and consequently cause this action to fail.
Based on this work: https://github.com/healeycodes/Broken-Link-Crawler
Demo: https://github.com/ScholliYT/devportfolio/actions?query=workflow%3A%22Site+Reliability%22
Required The url of the website to check. You can provide a comma separated list of urls if you would like to crawl multiple websites.
Optional Comma separated list of url prefixes to include. You may only want to crawl urls that use "https://".
Optional Comma separated list of url prefixes to exclude (default mailto:,tel:). Some sites do not respond properly to bots and you might want to exclude those known sites to prevent a failed build.
Optional Comma separated list of url suffixes to include. You may only want to crawl urls that end with ".html".
Optional Comma separated list of url suffixes to exclude. You may want to skip images by using ".jpg,.gif,.png".
Optional Comma separated list of url substrings to include. You may only want to crawl urls that are hosted on your primay domain, so you could use "mydomain.com", which would include all of your subdomains, regardless the protocol you use.
Optional Comma separated list of url substrings to exclude. You may want to skip specific domains for external links that you do not want to fetch from, such as "knownadprovider.com".
Optional The string to use for the web agent when crawling pages.
The default is:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
Optional Set logging verbosity level: true/false/yes/no/on/off/debug/info/warning/error/critical (default false).
Optional Maximum seconds for request retries (default 30).
Optional Maximum request retry count (default 4).
Optional Maximum levels deep to check, -1 = unlimited depth (default -1).
Optional By default, the crawler will use a HEAD request first, then will follow-up with a GET request if the response is html and the url represents a page on the same site as the root url. By setting this value to 'true', the crawler will always us a GET request for any onsite url. (default false).
Optional By default, the crawler will open as many connections at it needs to make the requests. By setting this value to a number greater than zero (0), the crawler will not exceed that number of connections for each host. This can be useful for when crawling a site that has rate limits. (default unlimited).
uses: ScholliYT/Broken-Links-Crawler-Action@v2.1.1
with:
website_url: 'https://github.com/ScholliYT/Broken-Links-Crawler-Action'
exclude_url_prefix: 'mailto:,https://www.linkedin.com,https://linkedin.com'
verbose: 'true'
max_retry_time: 30
max_retries: 5
The easiest way to run this action locally is to use Docker. Just build a new image and pass the correct env. variables to it.
docker build --tag broken-links-crawler-action:latest .
docker run -e INPUT_WEBSITE_URL="https://github.com/ScholliYT/Broken-Links-Crawler-Action" -e INPUT_VERBOSE="true" -e INPUT_MAX_RETRY_TIME=30 -e INPUT_MAX_RETRIES=5 -e INPUT_EXCLUDE_URL_PREFIX="mailto:,https://www.linkedin.com,https://linkedin.com" broken-links-crawler-action:latest
Several utilities are used for testing the code out. To install all of the required dependencies, please use the following command:
pip install -r test-dependences.txt
To execute quality checks automatically during a commit, make sure that the git hook script is set up by running the following command:
pre-commit install
You can run the full suite of automated tests on your local machine by using the "act" tool to simulate the github workflow action that executs during commit, more information here:
https://github.com/nektos/act
You can run linting and code formatting checks using the flake8
command:
flake8 . --count --show-source --max-complexity=10 --statistics
You can run type linting mypy
command:
mypy
To run the unit tests, use the pytest
command like this:
pytest -m "not integrationtest" --cov=deadseeker --cov-fail-under=95 --cov-branch --cov-report=term-missing
To generate an html report that visually displays uncovered lines, use this version:
pytest -m "not integrationtest" --cov=deadseeker --cov-fail-under=95 --cov-branch --cov-report=term-missing --cov-report=html
To run the unit tests, use the pytest
command like this:
pytest -m "integrationtest"
Mutation testing changes the real code (creating a 'mutant') and runs all of the tests to make sure that at least one test fails. This ensures that your tests are actually effective at testing the code, or it can also reveal unnecessary implementation code that should be refactored. To run mutation testing, use the mutmut run
command. For details on missed mutants, run the mutmut html
command to generate an html report of the missed mutants.