Scripts and templates used for born-digital transfers at the Bentley Historical Library. These utilities are primarily used to assist in the transfer of materials from removable media using the BHL's Removable Media Workstations (RMWs) and RipStation.
- Python 3: Required to run all utilities
- Pillow: To adjust images of removable media
- FFmpeg: To validate audio and video files and to create DIPs for audio CDs
- HandBrake CLI: To create DIPs for video DVDs
- bulk_extractor: To scan for PII
- rsync: To copy files on non-Windows machines
- Brunnhilde: To generate reports (non-Windows machines only)
pip install git+https://github.com/bentley-historical-library/bhl_born_digital_utils.git
A tracking template used for born-digital transfers at the Bentley Historical Library. Many of the utilities detailed below rely on the bhl_inventory.csv. See README for bhl_inventory.csv.
This script serves as the entry point to various born-digital transfer utilities. A summary of the script's usage and available actions are below, followed by detailed instructions for each utility. All utilities require, at minimum, an accession number, corresponding to a directory with transferred removable media items, and an action.
Usage: bhl_bd_utils.py ACCESSION_NUMBER action [options]
Action | Description |
---|---|
-c, --create_transfer | Create a RMW transfer |
-e, --empty | Check for empty folders and files |
-m, --missing | Check for missing barcodes and folders |
-o, --osfiles | Check for and delete system files and directories |
-s, --structure | Check RipStation output structure |
-u, --unhide | Unhide folders (Windows workstations only) |
-b, --bulkextractor | Run bulk_extractor |
--copy | Copy accession from RMW |
--move_separations | Move separations |
--av_media | Separate AV media |
--rename_files | Rename files with invalid characters |
--dips | Make DIPs for audio CDs and video DVDs |
--split_transfer | Split transfer into smaller chunks |
--brunnhilde | Run Brunnhilde |
Many of born-digital transfer utilities make use of a .bhl_bd_utils
configuration file in the current user's home directory. If a configuration file does not exist, the utility will prompt you to create one. The available settings are detailed below.
Setting | Description |
---|---|
input | Directory where accession directories can be found. This should be a local directory. |
logs | Directory where various program logs will be stored |
destination | Directory where accessions will be copied to from the RMW |
separations | Directory where separations will be moved |
webcam_dir | Directory where images are saved by the RMW's webcam |
handbrake | Full path to an installed HandBrake CLI executable |
handbrake_preset | Full path to a HandBrake preset JSON file |
ffmpeg | Full path to an installed FFmpeg executable |
bulk_extractor | Full path to an installed bulk_extractor executable |
brunnhilde | Directory where Brunnhilde reports will be stored |
Some of these settings can be overriden from the command line. For example, an -i
flag can be passed along with the path to a directory to override the configured input directory. Below are the optional arguments that can be passed to bhl_bd_utils.py
and the configuration default that they override.
Argument | Help |
---|---|
-i, --input | Override input |
-d, --destination | Override destination |
-l, --logs | Override logs |
--separations_dir | Override separations |
By default, bhl_bd_utils.py
assumes that the scripts are being run on transfers on a local RMW directory. In some cases, these scripts may be run on a transfer after it has been copied to the remote destination
directory. Passing a --base remote
will override the default behavior and use the configured destination
directory as the input
directory.
Creates accession and barcode directories, bhl_metadata, and bhl_notices directories within the configured input
directory.
Requirements:
- This script uses Pillow to adjust images of removable media.
- bhl_notice uses JsBarcode CDN to a render Codabar barcode in the HTML document.
bhl_bd_utils.py ACCESSION_NUMBER -c/--create_transfer [--metadata_off] [--notices_off]
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
-c, --create_transfer | Create a RMW transfer |
--metadata_off | Turn off creating bhl_metadata directory inside barcode folders |
--notices_off | Turn off creating bhl_notices inside accession folder |
Acknowledgments: This utility is developed based on CollectionSetup.exe by Matt Adair.
Checks for empty directories and 0-byte files in a source directory and prints the results to the terminal
bhl_bd_utils.py ACCESSION_NUMBER -e/--empty
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
-e, --empty | Check for empty folders and files |
Parses the bhl_inventory.csv and subdirectories for a source directory, compares the results, and lists barcodes that are in the bhl_inventory.csv but not in the source directory and subdirectories in the source directory that are not accounted for in the bhl_inventory.csv
bhl_bd_utils.py ACCESSION_NUMBER -m/--missing
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
-m, --missing | Check for missing barcodes and folders |
Checks for and deletes operating system files and directories in a source path. Operating system files checked include Thumbs.db, .DS_Store, Desktop DB, and Desktop DF. Operating system directories checked include .Trashes, .Spotlight-V100, and .fseventsd. The script will print all found files and directories to the terminal to confirm deletion. Optional arguments can turn off deleting Thumbs.db, .DS_Store, Desktop DB/DF, and directories (.Trashes, .Spotlight-V100, and .fseventsd).
bhl_bd_utils.py ACCESSION_NUMBER -o/--osfiles [--thumbsdb_off] [--dsstore_off] [--desktopdbdf_off] [--dirs_off]
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
-o, --osfiles | Check for and delete system files |
--thumbsdb_off | Turn off deleting Thumbs.db files |
--dsstore_off | Turn off deleting .DS_Store files |
--desktopdbdf_off | Turn off deleting Desktop DB and Desktop DF files |
--dirs_off | Turn off deleting .Trashes, .Spotlight-V100, and .fseventsd folders |
Checks RipStation output structure, including checking to ensure that .mp4 and .wav DIPs have been made for video DVDs and audio CDs, respectively, that photos of removable media exist when applicable, and that .wav and .mp4 files are valid. Optional arguments can turn off validating .wav and .mp4 files.
Requirements: This script uses ffmpeg to validate .wav and .mp4 files.
bhl_bd_utils.py ACCESSION_NUMBER -s/--structure [--validation_off]
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
--validation_off | Turn off validating .wav (audio CDs) and .mp4 (video DVDs) |
Unhide hidden sub-directories in a directory. Note: This removes the Windows -H (hidden) and -S (system) attributes from directories, and as such is only applicable on Windows machines.
bhl_bd_utils.py ACCESSION_NUMBER -u/--unhide
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
-u, --unhide | Unhide folders |
This script will run bulk_extractor from the command line. Currently, scanning for exif metadata generated from images is turned off, and bulk_extractor will use its accts
scanner to search for PII such as Social Security numbers, credit card numbers, telephone numbers, and email addresses. Reports generated by bulk_extractor will be stored in the configured logs directory. In order to save time, resources, and to avoid false positives, bulk_extractor is not run on audio CDs or video DVDs. bulk_extractor reports are stored in subdirectories for each piece of removable media scanned. Following scanning, all empty reports are deleted, leaving only reports that had one or more hit. On Windows machines, the utility uses the bulk_extractor
configuraiton setting, corresponding to an exact path to a bulk_extract executable, and on other operating systems assumes that bulk_extractor
is available on the system path.
bhl_bd_utils.py ACCESSION_NUMBER -b/--bulkextractor
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
-b, --bulkextractor | Run bulk_extractor |
This utility copies a directory using robocopy (on Windows) or rsync (on other operating systems). Its primary purpose is to copy accessions from a removable media workstation to a network storage location. It uses the configured defaults for input
and destination
, but can optionally be overriden by passing either an -i
and/or -d
argument. This utility will create a log file of the form [accession_number]_[timestamp].txt
in the configured logs
directory.
bhl_bd_utils.py ACCESSION_NUMBER --copy
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
--copy | Copy accession from RMW |
This utility moves separated directories to the configured separations
directory. The script parses the bhl_inventory.csv in a given source directory to identify media that has been marked as separated by a separation
column with a value of y
, and then moves the _barcode
directory from within the source directory to an [accession]_separations
directory in the given destination. It optionally takes a --separations_dir
argument to override the configured separations
directory.
bhl_bd_utils.py ACCESSION_NUMBER --move_separations
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
--move_separations | Move separations |
This utility moves audio-formatted CDs and video-formatted DVDs into their own directory so that AV content can be processed using Archivematica's automation-tools and data content can be sent to Archivematica's backlog. The script parses the bhl_inventory.csv in a given source directory to identify media with a media_type
that begins with audio
or video
and then moves those barcode directories into a new directory named [accession]_audiovisual
in the source directory's parent directory. For example, given a source directory of /path/to/source/1234
, the script will move audiovisual media into /path/to/source/1234_audiovisual
.
bhl_bd_utils.py ACCESSION_NUMBER --av_media [-a/-accession ACCESSION]
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
--av_media | Separate AV media |
This utility replaces 'invalid' characters in a filename with an underscore. The script is based heavily off of Archivematica's sanitize_names.py. The rename_files
utility allows several more characters in a filename than Archivematica, as its primary use case is to resolve issues with running bagit.py on directories that contain files with certain invalid characters. The utility prints out a list of files that will be renamed and asks for confirmation. This script should be used sparingly, and only after bagging attempts on Windows and Linux filesystems have failed.
bhl_bd_utils.py ACCESSION_NUMBER --rename_files
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
--rename_files | Rename files |
This utility makes access derivatives (DIPs) for audio CDs and video DVDs to be uploaded to the Bentley Digital Media Library. Its primary use case is to make DIPs in batch for media transferred using the RipStation, but can also be used for media transferred using the RMWs. The utility parses the bhl_inventory.csv in a given accession directory to identify all audio CDs and video DVDs that have (1) been successfully transferred and (2) do not have an existing access derivative. The utility then does the following depending on the media type:
- Audio CD: Concatenates all of the individual
.wav
tracks from an audio CD into a single[barcode].wav
file using FFmpeg. On Windows machines, the utility uses theffmpeg
configuration setting, corresponding to an exact path to an FFmpeg executable, and on other operating systems assumes thatffmpeg
is available on the system path. - Video DVD: Uses the HandBrake CLI to scan an
.iso
disc image and make an.mp4
for each title found on the disc. The utility uses thehandbrake_preset
configuration, which corresponds to the exact path to a HandBrake preset JSON file, to specify the settings for encoding mp4s. If there are multiple disc images (e.g., for media imaged on an RMW using FTK Imager) it will first make a temporary concatenated.iso
before generating access derivatives. On Windows machines, the utility uses thehandbrake
configuration setting, corresponding to an exact path to a HandBrakeCLI executable, and on other operating systems assumes thatHandBrakeCLI
is available on the system path.
bhl_bd_utils.py ACCESSION_NUMBER --dips
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
--dips | Make DIPs |
This utility splits a transfer into multiple smaller chunks. This is especially useful for transfer with more than 10,000 files, which can cause problems in Archivematica. The utility counts the number of files in each item within a transfer and then moves items into directories of fewer than the split size, which defaults to 5,000 files and can be modified by passing a --split_size
parameter. The utility keeps individual items whole (i.e., it will not move some subdirectories from one item into chunk and other subdirectories into another chunk). As a result, it is best suited to transfers of many small-to-medium size items, rather than a transfer of one large item (e.g., a single hard drive with 10s of thousands of files). The chunk directories are created inside the transfer directory and are appended with a three-digit sequence. For example, a transfer 172345
with 12,234 files would be split into about 3 chunks: 172345-001
, 172345-002
, 172345-003
bhl_bd_utils.py ACCESSION_NUMBER --split_transfer [--split_size INT]
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
--split_transfer | Split transfer into smaller chunks |
--split_size | Maximum file count for each chunk (optional; defaults to 5,000) |
This utility runs Brunnhilde to generate reports on the contents of a given transfer. The utility is configured to run Brunnhilde with the -z
(decompress and scan archived files) and -n
(skip virus scan) options. The outputs of this scan include a CSV output from Siegfried, a tree report of the transfer's directory structure, and an HTML report with aggregate statistics for the transfer including detailed information about file formats, unidentified files, last modified dates, and duplicate files. Reports will be output to the configured brunnhilde
directory and will be copied to the transfer's metadata\submissionDocumentation
directory once Brunnhilde has finished.
This utility can only be run from a Linux or macOS machine.
bhl_bd_utils.py ACCESSION_NUMBER --brunnhilde
Argument | Help |
---|---|
ACCESSION_NUMBER | The accession number |
--brunnhilde | Run Brunnhilde |