Skip to content

buren/wayback_archiver

Repository files navigation

WaybackArchiver

Post URLs to Wayback Machine (Internet Archive), using a crawler, from Sitemap(s), or a list of URLs.

The Wayback Machine is a digital archive of the World Wide Web [...] The service enables users to see archived versions of web pages across time ...
- Wikipedia

Build Status Code Climate Docs badge Gem Version

Index

Installation

Install the gem:

$ gem install wayback_archiver

Or add this line to your application's Gemfile:

gem 'wayback_archiver'

And then execute:

$ bundle

Usage

Strategies:

  • auto (the default) - Will try to
    1. Find Sitemap(s) defined in /robots.txt
    2. Then in common sitemap locations /sitemap-index.xml, /sitemap.xml etc.
    3. Fallback to crawling (using the excellent spidr gem)
  • sitemap - Parse Sitemap(s), supports index files (and gzip)
  • urls - Post URL(s)

Ruby

First require the gem

require 'wayback_archiver'

Examples:

Auto

# auto is the default
WaybackArchiver.archive('example.com')

# or explicitly
WaybackArchiver.archive('example.com', strategy: :auto)

Crawl

WaybackArchiver.archive('example.com',  strategy: :crawl)

Only send one single URL

WaybackArchiver.archive('example.com', strategy: :url)

Send multiple URLs

WaybackArchiver.archive(%w[example.com www.example.com], strategy: :urls)

Send all URL(s) found in Sitemap

WaybackArchiver.archive('example.com/sitemap.xml', strategy: :sitemap)

# works with Sitemap index files too
WaybackArchiver.archive('example.com/sitemap-index.xml.gz', strategy: :sitemap)

Specify concurrency

WaybackArchiver.archive('example.com', strategy: :auto, concurrency: 10)

Specify max number of URLs to be archived

WaybackArchiver.archive('example.com', strategy: :auto, limit: 10)

Each archive strategy can receive a block that will be called for each URL

WaybackArchiver.archive('example.com', strategy: :auto) do |result|
  if result.success?
    puts "Successfully archived: #{result.archived_url}"
  else
    puts "Error (HTTP #{result.code}) when archiving: #{result.archived_url}"
  end
end

Use your own adapter for posting found URLs

WaybackArchiver.adapter = ->(url) { puts url } # whatever that responds to #call

CLI

Usage:

wayback_archiver [<url>] [options]

Print full usage instructions

wayback_archiver --help

Examples:

Auto

# auto is the default
wayback_archiver example.com

# or explicitly
wayback_archiver example.com --auto

Crawl

wayback_archiver example.com --crawl

Only send one single URL

wayback_archiver example.com --url

Send multiple URLs

wayback_archiver example.com www.example.com --urls

Crawl multiple URLs

wayback_archiver example.com www.example.com --crawl

Send all URL(s) found in Sitemap

wayback_archiver example.com/sitemap.xml

# works with Sitemap index files too
wayback_archiver example.com/sitemap-index.xml.gz

Most options

wayback_archiver example.com www.example.com --auto --concurrency=10 --limit=100 --log=output.log --verbose

View archive: https://web.archive.org/web/*/http://example.com (replace http://example.com with to your desired domain).

Configuration

ℹ️ By default wayback_archiver doesn't respect robots.txt files, see this Internet Archive blog post for more information.

Configuration (the below values are the defaults)

WaybackArchiver.concurrency = 1
WaybackArchiver.user_agent = WaybackArchiver::USER_AGENT
WaybackArchiver.respect_robots_txt = WaybackArchiver::DEFAULT_RESPECT_ROBOTS_TXT
WaybackArchiver.logger = Logger.new(STDOUT)
WaybackArchiver.max_limit = -1 # unlimited
WaybackArchiver.adapter = WaybackArchiver::WaybackMachine # must implement #call(url)

For a more verbose log you can configure WaybackArchiver as such:

WaybackArchiver.logger = Logger.new(STDOUT).tap do |logger|
  logger.progname = 'WaybackArchiver'
  logger.level = Logger::DEBUG
end

Pro tip: If you're using the gem in a Rails app you can set WaybackArchiver.logger = Rails.logger.

Docs

You can find the docs online on RubyDoc.

This gem is documented using yard (run from the root of this repository).

yard # Generates documentation to doc/

Contributing

Contributions, feedback and suggestions are very welcome.

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

License

MIT License

References