Chronicrawl is an experimental web crawler for web archiving. The goal is to explore some ideas around budget-based continuous crawling and mixing of browser-based crawling with traditional link extraction.
Current unmaintained. May or may not be revisited in future. Chronicrawl is very rough around the edges but (barely) usable for basic crawling functions. It's likely not compatible with the latest version of Chromium.
Currently it:
- keeps the crawl state in an embedded
H2Sqlite SQL database (still experimenting with db options, it currently uses a fairly portable subset of SQL and likely will target both an embedded and an clustered database) - fetches robots.txt and discover URLs via sitemaps and links
- discovers subresources by parsing HTML and also loading in Headless Chromium when script tags are detected
- periodically revisits resources (both fine-grained manual and basic automatic scheduling using a content change heuristic)
- writes WARC records (with both server not modified and identical digest dedupe)
- shows a primitive UI for exploring the state of the crawl and examining the content analysis
- replays archived content using Pywb
but many serious limitations still need to be addressed:
- the main crawl loop is single-threaded
- error handling is incomplete
- there's no real prioritisation system yet
- only a little effort has been put into performance so far
- it only speaks HTTP/1.0 without keep-alive
- essential options like url scoping are missing
- Java 11 or later
- Chromium or Chrome (currently mandatory, may be optional in future)
- Pywb (optional)
To compile install Apache Maven and then run:
mvn package
java -jar target/chronicrawl-*-with-dependencies.jar
See Config.java for the full list of configuration options. They can be
set as environment variables, system properties or read from a properties file using the -c
option.
Chronicrawl can optionally run an instance of Pywb for replay. To enable this specify the path to the pywb main executable:
PYWB=/usr/bin/pywb PYWB_PORT=8081 java -jar ...
Copyright 2020 National Library of Australia and contributors
Licensed under the Apache License, Version 2.0