Chronicrawl

Chronicrawl is an experimental web crawler for web archiving. The goal is to explore some ideas around budget-based continuous crawling and mixing of browser-based crawling with traditional link extraction.

Status

Current unmaintained. May or may not be revisited in future. Chronicrawl is very rough around the edges but (barely) usable for basic crawling functions. It's likely not compatible with the latest version of Chromium.

Currently it:

keeps the crawl state in an embedded H2 Sqlite SQL database (still experimenting with db options, it currently uses a fairly portable subset of SQL and likely will target both an embedded and an clustered database)
fetches robots.txt and discover URLs via sitemaps and links
discovers subresources by parsing HTML and also loading in Headless Chromium when script tags are detected
periodically revisits resources (both fine-grained manual and basic automatic scheduling using a content change heuristic)
writes WARC records (with both server not modified and identical digest dedupe)
shows a primitive UI for exploring the state of the crawl and examining the content analysis
replays archived content using Pywb

but many serious limitations still need to be addressed:

the main crawl loop is single-threaded
error handling is incomplete
there's no real prioritisation system yet
only a little effort has been put into performance so far
it only speaks HTTP/1.0 without keep-alive
essential options like url scoping are missing

Requirements

Java 11 or later
Chromium or Chrome (currently mandatory, may be optional in future)
Pywb (optional)

Usage

To compile install Apache Maven and then run:

mvn package
java -jar target/chronicrawl-*-with-dependencies.jar

Configuration

See Config.java for the full list of configuration options. They can be set as environment variables, system properties or read from a properties file using the -c option.

Pywb integration

Chronicrawl can optionally run an instance of Pywb for replay. To enable this specify the path to the pywb main executable:

PYWB=/usr/bin/pywb PYWB_PORT=8081 java -jar ...

License

Licensed under the Apache License, Version 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
.github/workflows		.github/workflows
resources		resources
src/org/netpreserve/chronicrawl		src/org/netpreserve/chronicrawl
test-resources/org/netpreserve/chronicrawl		test-resources/org/netpreserve/chronicrawl
test/org/netpreserve/chronicrawl		test/org/netpreserve/chronicrawl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
nla-deploy.sh		nla-deploy.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chronicrawl

Status

Requirements

Usage

Configuration

Pywb integration

License

About

Releases

Packages

Contributors 3

Languages

License

nla/chronicrawl

Folders and files

Latest commit

History

Repository files navigation

Chronicrawl

Status

Requirements

Usage

Configuration

Pywb integration

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages