Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
-
Updated
Nov 20, 2024 - Java
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
Dockerized Web Curator Tool with Heritrix 3 and pywb
Single Docker container running Heritrix 3, picking up jobs from a directory.
Parse a Heritrix crawl.log into an XML sitemap
Add a description, image, and links to the heritrix topic page so that developers can more easily learn about it.
To associate your repository with the heritrix topic, visit your repo's landing page and select "manage topics."