Skip to content
Sawood Alam edited this page May 11, 2017 · 1 revision

When OpenWayback is built from the source using mvn package, it includes some binaries (executable scripts) that can be useful to perform certain tasks such as indexing. Below is a list of packaged utility scrips (also available in the Docker image).

bdb-client

$ bdb-client
Usage: DBPATH DBNAME -w
	Read lines from STDIN, inserting into BDBJE at
 DBPATH named DBNAME, creating DB if needed.
Usage: DBPATH DBNAME -r [PREFIX]
	Dump lines from BDBJE at path DBPATH named DBNAME
 to STDOUT. If PREFIX is specified, only output records
 beginning with PREFIX, otherwise output all records

bin-search

$ bin-search
Usage: PREFIX FILE1 [FILE2] ...

cdx-indexer

$ cdx-indexer
USAGE:

cdx-indexer [-format FORMAT|-identity] FILE
cdx-indexer [-format FORMAT|-identity] FILE CDXFILE

Create a CDX format index from ARC or WARC file
FILE at CDXFILE or to STDOUT.
With -identity, perform no url canonicalization.
With -format, output CDX in format FORMAT.

cdx-sample

$ cdx-sample
Need path to CDX argument 1

 USAGE: ./cdx-sample PATH NUM
Create a split file for use with Wayback hadoop indexing code on STDOUT.
Finds approximate offsets at host boundaries for file at PATH, producing
a split file with NUM parts, which indicates the number of reduce tasks.

create-test-arc

$ create-test-arc
USAGE: srcDir tgtDir [arc_prefix]

location-client

$ location-client
USAGE: 
	[lookup|add|remove|sync] ...

	 lookup LOCATION-DB-URL ARC
		emit all known URLs for arc ARC

	 add LOCATION-DB-URL ARC URL
		inform locationDB that ARC is located at URL

	 remove LOCATION-DB-URL ARC URL
		remove reference to ARC at URL in locationDB

	 sync LOCATION-DB-URL DIR DIR-URL
		scan directory DIR, and submit all ARC files therein
		to locationDB at url DIR-URL/ARC

	 get-mark LOCATION-DB-URL
		emit an identifier for the current marker in the 
		locationDB log. These identifiers can be used with the
		mark-range operation.

	 mark-range LOCATION-DB-URL START END
		emit to STDOUT one line with the name of all ARC files
		added to the locationDB between marks START and END

	 add-stream LOCATION-DB-URL
		read lines from STDIN formatted like:
			NAME<SPACE>URL
		and for each line, inform locationDB that file NAME is
		located at URL

url-client

$ url-client

warc-header

$ warc-header
USAGE: tgtWarc fieldsSrc id
	tgtWarc is the path to the target WARC.gz
	fieldsSrc is the path to the text of the record
		make sure each line is terminated by \r\n
		and that the file ends with a blank, \r\n terminiated line
	id is the XXX in:
		Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
		of the header record... header...

zipline-manifest

$ zipline-manifest
Usage: ZIPLINES_PATH

zl-bin-search

$ zl-bin-search
USAGE:

zl-bin-search [-format FORMAT] [-max MAX_BLOCKS] SUMMARY LOCATION KEY

Search a ziplined compressed CDX format index for key
KEY to STDOUT. SUMMARY and LOCATION are paths to the
block summary and file location files.
With -format, output CDX in format FORMAT.
With -max, limit search at most MAX_BLOCKS blocks.