Utility to generate Bags for objects using Islandora's REST interface using either a command-line tool or via a batch-oriented queue. In addition, Islandora Bagger provides its own REST interface that allows population of the queue. Specific content is added to the Bag's data
directory and bag-info.txt
file using plugins. Bags are compliant with version 1.0 of the BagIt specification. If you want to allow your Islandora users to initiate the creation of Bags, install the Islandora Bagger Integration module.
This utility is for Islandora 8.x-1.x. For creating Bags for Islandora 7.x, use Islandora Fetch Bags.
- PHP 7.2 or higher
- composer
- Clone this git repository.
cd islandora_bagger
php composer.phar install
(or equivalent on your system, e.g.,./composer install
)- If you want to try Islandora Bagger's REST interface, you can install the Symfony Local Web Server. Note that this server is part of the Symfony binary, which is not required by Islandora Bagger otherwise.
Even though each Bag is created using options defined in its own configuration file (see next section), Islandora Bagger uses several application-wide configuration options defined in the parameters
section of config/services.yaml
.
You probably don't need to change app.queue.path
and app.location.log.path
since these specify default locations for some data files. However, if you are providing the ability for users to download serialized Bags, you will need to change the app.bag.download.prefix
parameter to the hostname/path to append to each Bag's filename as described in the "Making Bags downloadable" section below.
The command to generate a Bag takes two required parameters, --settings
and --node
. Assuming the configuration file is named sample_config.yml
, and the Drupal node ID you want to generate a Bag from is 112, the command would look like this:
./bin/console app:islandora_bagger:create_bag --settings=sample_config.yml --node=112
A third parameter, --extra
, is explained in the "Passing settings via the command line" section below.
For each Bag it creates, Islandora Bagger requires a configuration file in YAML format:
####################
# General settings #
####################
# Required.
drupal_base_url: 'http://localhost:8000'
drupal_basic_auth: ['admin', 'islandora']
# Register creation of this Bag with Islandora Bagger Integration. Default is false.
register_bags_with_islandora: true
# Required. How to name the Bag directory (or file if serialized). One of 'nid' or 'uuid'.
bag_name: nid
# Optional. Template for the Bag name. The % is replaced by the nid or uuid (depending on
# the value of "bag_name") in the name of the Bag directory (or file if serialized). If absent,
# the bare value of the nid or uuid is used.
# bag_name_template: sfu_aip_%
# Both temp_dir and output_dir are required.
temp_dir: /tmp/islandora_bagger_temp
output_dir: /tmp
# Required. Whether or not to zip up the Bag. One of 'false', 'zip', or 'tgz'.
serialize: zip
# Required. Whether or not to log Bag creation. Set log output path in config/packages/{environment}/monolog.yaml.
log_bag_creation: true
# Optional. Static bag-info.txt tags. No plugin needed. You can use any combination
# of tag name / value here, as long as ou seprate tags from values using a colon (:).
bag-info:
Contact-Name: Mark Jordan
Contact-Email: bags@sfu.ca
Source-Organization: Simon Fraser University
Foo: Bar
# Optional. Whether or not to include the Payload-Oxum tag in bag-info.txt. Defaults to true.
# include_payload_oxum: false
# Optional. Which hash algorithm(s) to use.
# One of md5, sha1, sha224, sha256, sha384, sha512, sha3224, sha3256, sha3384, sha3512,
# or a list of values. Default is sha512.
# hash_algorithm: md5
# hash_algorithm: [md5, sha1, sha256]
# Optional. Timeout to use for Guzzle requests, in seconds. Default is 60.
# http_timeout: 120
# Optional. Whether or not to verify the Certificate Authority in Guzzle requests
# against websites that implement HTTPS. Used on Mac OSX if Islandora Bagger is
# interacting with websites running HTTPS. Default is true. Note that if you set
# verify_ca to false, you are bypassing HTTPS encryption between Islandora Bagger
# and the remote website. Use at your own risk.
# verify_ca: false
# Optional. Whether or not to delete the settings file upon successful creation
# of the Bag. Default is false.
# delete_settings_file: true
# Optional. Whether or not to log the serialized Bag's location so Islandora can
# retrieve the Bag's download URL. Default is false.
# log_bag_location: true
############################
# Plugin-specific settings #
############################
# Required. Register plugins to populate bag-info.txt and the data directory.
# Plugins are executed in the order they are listed here.
plugins: ['AddBasicTags', 'AddMedia', 'AddNodeJson', 'AddNodeJsonld', 'AddMediaJson', 'AddMediaJsonld', 'AddFileFromTemplate', 'AddFedoraTurtle', 'AddNodeCsv']
# Used by the 'AddFedoraTurtle' plugin.
fedora_base_url: 'http://localhost:8080/fcrepo/rest/'
# Used by the 'AddMedia' plugin. These are the Drupal taxomony term IDs
# from the "Islandora Media Use" vocabulary. Use an emply list (e.g., [])
# to include all media.
drupal_media_tags: ['/taxonomy/term/16']
# Used by the 'AddMedia' plugin. Indicates whether the Bag should contain a file
# named 'media_use_summary.tsv' that lists all the media files plus the taxonomy
# name corresponding to the 'drupal_media_tags' list. Default is false.
include_media_use_list: true
# Used by the 'AddMedia' plugin. Include this option save media files with the
# specified subdirectories within the Bag's data directory. Include the trailing /.
# media_file_directories: 'foo/bar/baz/'
# Used by the 'AddFileFromTemplate' plugin.
# template_path can be absolute or relative to the Islandora Bagger directory.
template_path: 'templates/mods.twig'
# template_output_filename will be assigned to the file generated from the template,
# which will be added to the Bag's data directory. You may include a subdirectory
# or subdirectories as part of the filename.
templated_output_filename: 'metadata/MODS.xml'
# Used by the 'AddNodeCsv' plugin.
# csv_output_filename will be assigned to the CSV file, which will be added to
# the Bag's data directory. You may include a subdirectory or subdirectories
# as part of the filename.
csv_output_filename: 'metadata.csv'
####################
# Post-Bag scripts #
####################
# post_bag_scripts: ["php /tmp/test.php", "python /path/to/script.py"]
The resulting Bag would look like this:
/tmp/112
├── bag-info.txt
├── bagit.txt
├── data
│ ├── IMG_1410.JPG
│ ├── media.json
│ ├── media.jsonld
│ ├── node.json
│ ├── node.jsonld
│ ├── metadata
│ │ └── MODS.xml
│ ├── metadata.csv
│ ├── media_use_summary.tsv
│ └── node.turtle.rdf
├── manifest-sha1.txt
└── tagmanifest-sha1.txt
Since the Drupal node's ID is not included in the configuration file, the same file can be used for multiple Bags. It is called a 'per-Bag' configuration file because it is used each time Islandora Bagger creates a Bag.
In some cases, you may want to define configuration options in config/services.yml
that are normally defined in the per-Bag configuration file. The most common reasons to do this are 1) to keep sensitive data such as login credentials out of the per-Bag configuration files and 2) to centralize commonly used options in one place rather than repeat them in each per-Bag configuration file.
To do this, define the options from the per-Bag configuration file in config/services.yml
and prepend their keys with app.
. For example, to define drupal_base_url
and drupal_basic_auth
in config/services.yml
, do the following:
- Comment them out or remove them from the per-Bag file:
# Required.
# drupal_base_url: 'http://localhost:8000'
# drupal_basic_auth: ['admin', 'islandora']
- Define them in the
parameters
section ofconfig/services.yml
and append each option key withapp.
:
parameters:
app.queue.path: '%kernel.project_dir%/var/islandora_bagger.queue'
app.location.log.path: '%kernel.project_dir%/var/islandora_bagger.locations'
# The hostname/path to where users can download serialized bags. This string
# will be prepended to the Bag's filename.
app.bag.download.prefix: 'http://example.com/bags/'
# These options are usually defined in the per-Bag config file.
app.drupal_base_url: 'http://localhost:8000'
app.drupal_basic_auth: ['admin', 'islandora']
A couple of things to note about this:
- If the options are defined in both places, the options in the per-Bag file override their counterparts in
config/services.yml
. This way, you can define commonly used options in theconfig/services.yml
but override them on a per-Bag basis. - Options defined in
services/config.yml
are not accessible to post-Bag scripts.
You can pass settings to Islandora Bagger on the command line using the optional --extra
parameter:
./bin/console app:islandora_bagger:create_bag --settings=sample_config.yml --node=112 --extra='{"serialize": "tar", "hash_algorithm": "md5"}'
The value of this parameter is a serialized JSON object containing key:value pairs of settings. Key:value pairs passed in this way will be added to the config settings and will also override settings in the config file and in 'config/services.yml'.
Islandora Bagger can also initiate the creation Bags via a simple REST interface. It does this by 1) receiving a PUT
request containing the node ID of the Islandora object to be bagged in a "Islandora-Node-ID" header and 2) receiving a YAML configuration file as the body of the request. Using this data, it adds the request to a queue (see below), which is then processed at a later time. The REST interface also provides the ability to GET
a Bag's download URL.
Note that requests to the REST interface do not generate Bags directly, they only populate a queue as described below.
To use the REST API to add a Bag-creation job to the queue:
- Run
symfony server:start
- Prepare a YAML configuration file for posting to the REST API.
- Run
curl -v -X POST -H "Islandora-Node-ID: 4" --data-binary "@sample_config.yml" http://127.0.0.1:8001/api/createbag
To use the REST API to get a serialized Bag's location for download:
- Make sure your configuration file's
serialize
setting is either "zip" or "tgz", and thelog_bag_creation
setting istrue
. - Create a Bag using the command-line or via a REST
PUT
request. - Start the web server, as above, if not already started.
- Run
curl -v -H "Islandora-Node-ID: 4" http://127.0.0.1:8001/api/createbag
. Your response will be a JSON string containing the node ID, the Bag's location, and an ISO8601 timestamp of when the Bag was created, e.g.:
{"nid":"4","location":"http:\/\/example.com\/bags\/4.zip","created":"2019-05-06T19:31:33-0700"}
A couple of things to note about this REST API:
- The API lacks credential-based authentication. Using Symfony's firewall to provide IP-based access to the API should provide sufficient security.
- Islandora Bagger's REST interface accepts
POST
requests that contain a request body (in this case, the YAML configuration file). Some HTTP clients (Guzzle, for example) convert requests that are redirected (e.g., in respones to a 301, etc.) toGET
. If this happens, the request body is lost and the resulting YAML configuration files will be empty. If you are running Islandora Bagger in a web server environment that returns HTTP response codes in the3xx
range, such as inside an ApacheAlias
directive, your HTTP client will need to not redirect POST requests with GET requests. Guzzle's documentation on this behavior is useful.
As described in the previous section, the location of each Bag is available via a GET
request to Islandora Bagger's REST interface. If you want to use this information to provide a way to download Bags from Islandora Bagger, follow these steps:
- In the Bag-specific configuration file
- Make sure the
serialize
option is set tozip
ortgz
(only serialized Bags can be downloaded). - Make sure the
log_bag_location
option is set totrue
. - Make sure the directory specified in the
output_dir
option is exposed to the web.
- Make sure the
- In Islandora Bagger's
config/services.yml
file- make sure the
app.bag.download.prefix
parameter contains the hostname/path leading to the directory specified in the configuration file'soutput_dir
option.
- make sure the
GET
requests to the REST API will now return location
values that contain URLs that combine the path specified in app.bag.download.prefix
with the serialized Bag's filename.
This is insecure, since anyone who can guess the path to a Bags will have access to it. Please join the discussion at this issue if you have a suggestion on implementing more robust security on Bag downloads.
Another approach is to use a post-Bag script (see below) to copy the Bag to a location from where it can be downloaded, and to email the user with the location.
Islandora Bagger implements a simple processing queue, which is populated mainly by REST requests to generate Bags. However, the queue can be populated by any process (manually, scripted, etc.). Islandora Bagger processes the queue by inspecting each entry in first-in, first-out order and for each entry, runs the app:islandora_bagger:create_bag
command, which creates the Bag by fetching the files and other data from the Islandora instance as defined in that entry's configuration file.
The queue is a simple tab-delimited text file that contains one entry per line. The three fields in each entry are 1) the node ID, 2) the full path to the YAML configuration file, and 3) and ISO8601 timestamp, e.g.:
2073 /home/mark/Documents/hacking/islandora_bagger/var/islandora_bagger.2073.yaml 2020-09-14T19:01:46-0700
To process the queue, run the following command:
./bin/console app:islandora_bagger:process_queue --queue=var/islandora_bagger.queue
where the value of the --queue
option is the path to the queue file. This command is then executed as needed, or from within a scheduled job managed by cron. This command iterates through the queue in first-in, first-out order. Once processed, the entry is removed from the queue. You can also optionally specify how many queue entries to process by including the --entries
option, e.g., ./bin/console app:islandora_bagger:process_queue --queue=var/islandora_bagger.queue --entries=100
Since the queue file is just a plain tab-separated value file, looking at its contents can be done in a variety of ways (openning it in a text editor, using cat
, etc.). Islandora Bagger offers two other ways of inspecting the queue:
- via the console command
app:islandora_bagger:get_queue
(e.g../bin/console app:islandora_bagger:get_queue --queue=var/islandora_bagger.queue --output_format=json
) - via the REST interface (e.g.
curl -v http://127.0.0.1:8000/api/queue
)
In both cases, the output is a serialized JSON object containing each item in the queue. The console command can also print the raw queue if the --output_format
option has a value of "csv").
Customizing the generated Bags is done via values in the configuration file and via plugins.
Items in the "General Configuration" section provide some simple options for customizing Bags, e.g.:
- whether the Bag is named using the node's ID or its UUID
- whether the Bag is serialized (i.e., zipped)
- what tags are included in the
bag-info.txt
file. Tags specified in general settings'bag-info
option are static in that they are simple strings. In order to include tags that are dynamically generated, you must use a plugin.
Apart from the static tags mentioned in the previous section, all file content and additional tags are added to the Bag using plugins. Plugins are registerd in the plugins
section of the configuration file.
The following plugins are bundled with Islandora Bagger:
- AddBasicTags: Adds the
Internal-Sender-Identifier
bag-info.txt tag using the Drupal URL for the node as its value, and theBagging-Date
tag using the current date as its value. - AddNodeJson: Adds the Drupal JSON representation of the node, specifically, the response to a request to
/node/1234?_format=json
. - AddNodeJsonld: Adds the Drupal JSON-LD representation of the node, specifically, the response to a request to
/node/1234?_format=jsonld
. - AddFedoraTurtle: Adds the Fedora Turtle RDF representation of the node.
- AddMedia: Adds media files, such as the Original File, Preservation Master, etc., to the Bag. The specific files added are identified by the relevant tags from the "Islandora Media Use" vocabulary listed in the
drupal_media_tags
configuration option. - AddMediaJson: Adds the Drupal JSON representation of the node's media list, specifically, the response to a request to
/node/1234/media?_format=json
. - AddMediaJsonld: Adds the Drupal JSON-LD representation of the node's media list, specifically, the response to a request to
/node/1234/media?_format=jsonld
. - AddFileFromTemplate: Adds a file generated from a Twig template using data from the node's JSON. Within the template, the data is represented as a PHP array. Basic sample MODS and DC templates are included.
- AddFile: Adds files listed in the the
files_to_add
configuration option, e.g.,files_to_add: ['/tmp/file1.txt', '/tmp/file2.txt']
. - AddNodeCsv: Adds a CSV file generated from node field data.
- AddFetch: Adds a
fetch.txt
file to the Bag, using URLs listed in thefetch_urls
configuation option, e.g.,fetch_urls: ['http://example.com/path/to/file.htm', 'https://someother.url.com/about']
. - AddPremis: Adds a serialized version of the PREMIS Turtle RDF for the node generated by the Islandora PREMIS module, if it is enabled in the host Drupal.
- Sample: A example plugin for developers.
Each plugin is a PHP class that extends the base AbstractIbPlugin
class. The Sample.php
plugin illustrates what you can (and must) do within a plugin. Plugins are located in the islandora_bagger/src/Plugin
directory, and must implement an execute()
method. Within that method, you have access to the Bag object, the Bag temporary directory, the node's ID, the node's JSON representation from Drupal. You also have access to all values in the configuratin file via the $this->settings
associative array.
To use a custom plugin, simply register its class name in the plugins
list in your configuation file.
The post_bag_scripts
option in the configuration file allows you to specify a list of scripts to run after the Bag has been successfully created. These scripts can send email messages, copy Bag files to alternate locations, and other tasks. You can include any script, in any language, with the following constraints:
- the scripts must exist on the same system where Islandora Bagger is running, and must be executable by the user running the
app:islandora_bagger:create_bag
command - the scripts are only executed if the Bag was successfully created
- the scripts are executed in the order they appear in the list
- you should always include the script's interpreter (php, python, etc.) and the full path to the script
- all scripts are passed three arguments, 1) the current node ID, 2) the Bag's output directory (or if the Bag was serialized, the path to the Bag file), and 3) the path to the YAML configuration file
- the results of the script are logged, including their exit codes
In the YAML configuration file, you can define any options needed by your scripts, for example, an email address to send a message to. For example, if your script /opt/utils/send_bag_notice.py
requires an email address to send its notice to, you can include that option's value in your configuration file, as long as the script can parse YAML files:
####################
# Post-Bag scripts #
####################
post_bag_scripts: ["python /opt/utils/send_bag_notice.py"]
recipient_email: preservation@example.ca
Then within your script, you would have access to the value of recipient_email
. Within your scripts, you have access to all options used by Islandora Bagger's app:islandora_bagger:create_bag
command, and you can define any additional options you need as long as they don't have the same key names as existing values.
- Add more error and exception handling.
- Add more logging.
- Add tests.
See CONTRIBUTING.md.