Skip to content

Latest commit

 

History

History
513 lines (462 loc) · 30.1 KB

config.md

File metadata and controls

513 lines (462 loc) · 30.1 KB

bdbag: Configuration Guide

Some components of the bdbag software are configured via JSON-formatted configuration files.

There are two global configuration files: bdbag.json and keychain.json. Skeleton versions of these files with simple default values are automatically created in the current user's home directory the first time a bag is created or opened.

Additionally, three JSON-formatted configuration files can be passed as arguments to bdbag in order to supply input for certain bag creation and update functions. These files are known as metadata, ro metadata and remote-file-manifest configurations.

bdbag.json

The file bdbag.json is a global configuration file that allows the user to specify a set of parameters to be used as defaults when performing various bag manipulation functions.

The format of bdbag.json is a single JSON object containing a set of JSON child objects (used as configuration sub-sections) which control various default behaviors of the software.

Object: root

This is the parent object for the entire configuration.

Parameter Description
bdbag_config_version The version number of the configuration file. In general, it matches the release version number of bdbag
bag_config This object contains all bag-related configuration parameters.
fetch_config This object contains all fetch-related configuration parameters.
resolver_config This object contains all implementation-specific resolver configuration parameters.
identifier_resolvers This is a global list of identifier "meta" resolvers. It can be overridden on a per-resolver basis via the individual configuration blocks for each resolver in resolver_config.
Object: bag_config

This object contains all bag-related configuration parameters.

Parameter Description
bag_algorithms This is an array of strings representing the default checksum algorithms to use for bag manifests, if not otherwise specified. Valid values are "md5", "sha1", "sha256", and "sha512".
bag_archiver This is a string representing the default archiving format to use if not otherwise specified. Valid values are "zip", "tar", and "tgz".
bag_metadata This is a list of simple JSON key-value pairs that will be written as-is to bag-info.txt.
bag_processes This is a numeric value representing the default number of concurrent processes to use when calculating checksums.
bagit_spec_version The version of the bagit specification that created bags will conform to. Valid values are "0.97" or "1.0".
bag_archive_idempotent A boolean value indicating that idempotent mode should be used by default when creating and archiving new bags.
Object: fetch_config

The fetch_config object contains a set of child objects each keyed by the scheme of the transport protocol that contains the transport handler configuration parameters.

Custom Transports

There is a default set of transport handlers installed with bdbag. In addition, bdbag supports externally implemented transport handlers that can be plugged-in (i.e., declared as run-time imports) via the fetch_config configuration object in the bdbag.json config file. This requires developers to perform some integration tasks.

Custom Transports: Implementation

Developers should create a class deriving from bdbag.fetch.transports.base_transport.BaseFetchTransport and implement three required functions:

  • __init__(self, config, keychain, **kwargs): The class constructor. Derived classes should first call super(<derived class name>, self).__init__(config, keychain, **kwargs) which sets the config, keychain, and kwargs variables as class member variables with the same names.

  • fetch(self, url, output_path, **kwargs): This method should implement the logic required to transfer a file referenced by url to the local path referenced by output_path. The **kwargs argument is an extensible argument dictionary that the framework may populate with extra data, for example: an integer argument size may present (if it can be found in fetch.txt for a given fetch entry), representing the expected size of the remote file in bytes.

  • cleanup(self): This method should implement any transport-specific release of resources. Note this function will be called only once per-transport at the end of a entire bag fetch, and not once per-file.

Custom Transports: Configuration

Configure the usage of the external transport via the fetch_config object of the bdbag.json configuration file. The fetch_config object is comprised of child configuration objects keyed by a lowercase string value representing the URL protocol scheme that is being configured. When configuring an external handler, the following applies:

  • There is a single required top-level string parameter with the key name handler which maps to the fully-qualified class name implementing the required methods of the bdbag.fetch.transports.base_transport.BaseFetchTransport base class. At runtime the bdbag fetch framework code will attempt to load this class via importlib.import_module machinery and if successful, it will be instantiated and returned to the bdbag fetch framework code and the class instance cached for the duration of the bag fetch operation. Subsequently, whenever a URL is encountered in fetch.txt with a protocol scheme matching that of the installed handler, that handler's fetch method will be invoked.

  • There is also an optional string parameter, allow_keychain, which must be present and evaluate to true in order to toggle the propagation of the bdbag keychain into the handler code during the __init__ call. If the allow_keychain parameter is missing or set to any other value that cannot be evaluated as a Python boolean True, then the value of the keychain variable passed to the __init__ call will be None. In general, if the custom handler code has its own mechanism for managing credentials, then this parameter may be omitted. If the handler intends to make use of the bdbag keychain that is currently in context for the current user and fetch operation, then this parameter must be present and evaluate to True.

  • The remainder of the protocol scheme handler configuration object can consist of any valid JSON; the entire object value assigned to the scheme key will be passed as the config parameter to the __init__ method of the custom handler.

For example, given the following fetch_config section:

{
    "fetch_config": {
        "s3": {
            "handler":"my.custom.S3Transport",
            "max_read_retries": 5,
            "read_chunk_size": 10485760,
            "read_timeout_seconds": 120
        },
	    "foo": {
            "handler":"my.custom.FooTransport",
            "allow_keychain": true,
            "my_foo_complex_config": {
                "bar":[
                    "a","b","c"
                ],
                "baz":{
                    "xyz":123
                }
            }
        }
    }
}

For the scheme foo, the following object will be passed as the config parameter to the __init__ method of my.custom.FooTransport upon class instantiation:

{
    "handler":"my.custom.FooTransport",
    "allow_keychain": true,
    "my_foo_complex_config": {
        "bar":[
            "a","b","c"
         ],
        "baz":{
            "xyz":123
        }
    }
}
Default Transports: Configuration

Currently, only the default http, https and s3 transport handlers have configuration objects that control their behavior.

Parameter Description
http Configuration for the http fetch handler.
https Configuration for the https fetch handler.
s3 Configuration for the s3 fetch handler.
Object: fetch_config:http

This object contains configuration parameters for the http fetch handler.

Parameter Description
session_config Session configuration parameters for the requests HTTP client library. The parameters mainly control retry logic.
http_cookies Configuration parameters for automatic loading and merging of HTTP cookie files.
allow_redirects A boolean indicating that redirects should automatically be followed, or not.
redirect_status_codes An array of integers representing the HTTP status codes used for determining redirection. Defaults to [301, 302, 303, 307, 308].
Object: fetch_config:http:session_config

Session configuration parameters for the requests HTTP client library. The parameters mainly control retry logic. The retry logic is provided via the urllib3 library, wrapped by requests. For more infomation, see this external documentation.

Parameter Description
retry_backoff_factor The exponential backoff factor for all retry attempts. Defaults to 1.0.
retry_connect The number of connect attempts to retry. Defaults to 5.
retry_read The number of read attempts to retry. Defaults to 5.
retry_status_forcelist A list of HTTP response codes that will force and automatic retry. Defaults to: [500,502,503,504].
Object: fetch_config:http:http_cookies

Configuration parameters for automatic loading and merging of HTTP cookie files. These cookie files must follow the Mozilla/Netscape/CURL/WGET format as described here.

Parameter Description
scan_for_cookie_files A boolean value that enables/disables the cookie scan feature globally. Defaults to True (enabled).
search_paths An array of base directory paths from which to recursively search with search_paths_filter for file_names to use as input. Defaults to the system-dependent expansion of ~.
search_paths_filter An fnmatch.filter pattern that can be used to filter specific subdirectories of each path specified in search_paths. Defaults to .bdbag.
file_names An array of input cookie filenames or fnmatch.filter patterns to match cookie filenames against. Defaults to [*cookies.txt].
Object: fetch_config:https

This object contains configuration parameters for the https fetch handler. The https fetch handler configuration is identical to the http fetch handler configuration, with the following exceptions:

Parameter Description
bypass_ssl_cert_verification Either the boolean value true or false, or an array of string values consisting of URL patterns to be used in a simple substring match against the target URLs found in a bag's fetch.txt file. For example, "bypass_ssl_cert_verification": ["https://raw.githubusercontent.com/fair-research/bdbag/"] will match a fetch.txt entry with a URL of "https://raw.githubusercontent.com/fair-research/bdbag/master/test/test-data/test-http/test-fetch-http.txt". Defaults to false.
NOTE:

It is NOT RECOMMENDED setting bypass_ssl_cert_verification: true as it will bypass SSL certificate validation for ALL HTTPS requests. This will accept any TLS certificate presented by a remote server and will ignore hostname mismatches and/or expired certificates, which will make the application vulnerable to man-in-the-middle (MitM) attacks.

Object: fetch_config:s3

This object contains configuration parameters for the s3 fetch handler.

Parameter Description
max_read_retries Maximum number of socket read retries. Defaults to 5.
read_chunk_size Number of bytes to consume per read attempt. Defaults to 10485760 bytes (10MB).
read_timeout_seconds Timeout in seconds per read attempt. Defaults to 120.
Object: resolver_config

This object contains all implementation-specific resolver configuration parameters, keyed by resolver scheme. The current default handlers schemes are: [ark, minid, doi, and ga4ghdos]. Each scheme can have multiple resolver configuration blocks in an array, where each block can be mapped to a different resolver namespace prefix.

Parameter Description
handler This is the fully-qualified Python class name of a class derived from bdbag.fetch.resolvers.base_resolver.BaseResolverHandler and implementing the required functions. The bdbag resolver code will attempt to locate and instantiate this class at runtime.
prefix This is an optional parameter that maps the handler resolution to only instances that contain the specific prefix found in the identifier.
identifier_resolvers This is the same parameter as the global identifier_resolvers array. If found at this level, it will override the global setting for this scheme/prefix combination.

Below is a sample bdbag.json file:

{
  "bag_config": {
    "bag_algorithms": [
      "md5",
      "sha256"
    ],
    "bag_metadata": {
      "BagIt-Profile-Identifier": "https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-profile.json",
      "Contact-Name": "mdarcy",
      "Contact-Orcid": "0000-0003-2280-917X"
    },
    "bag_processes": 1,
    "bagit_spec_version": "0.97"
  },
  "bdbag_config_version": "1.5.0",
  "fetch_config": {
    "http": {
      "session_config": {
        "retry_backoff_factor": 1.0,
        "retry_connect": 5,
        "retry_read": 5,
        "retry_status_forcelist": [
          500,
          502,
          503,
          504
        ]
      },
      "http_cookies": {
        "file_names": [
            "*cookies.txt"
        ],
        "scan_for_cookie_files": true,
        "search_paths": [
            "/home/mdarcy"
        ],
        "search_paths_filter": ".bdbag"
      }
    },
    "s3": {
      "max_read_retries": 5,
      "read_chunk_size": 10485760,
      "read_timeout_seconds": 120
    }
  },
  "identifier_resolvers": [
    "n2t.net",
    "identifiers.org"
  ],
  "resolver_config": {
    "ark": [
      {
        "identifier_resolvers": [
          "n2t.net",
          "identifiers.org"
        ],
        "prefix": null
      },
      {
        "handler": "bdbag.fetch.resolvers.ark_resolver.MinidResolverHandler",
        "identifier_resolvers": [
          "n2t.net",
          "identifiers.org"
        ],
        "prefix": "57799"
      },
      {
        "handler": "bdbag.fetch.resolvers.ark_resolver.MinidResolverHandler",
        "identifier_resolvers": [
          "n2t.net",
          "identifiers.org"
        ],
        "prefix": "99999/fk4"
      }
    ],
    "doi": [
      {
        "handler": "bdbag.fetch.resolvers.doi_resolver.DOIResolverHandler",
        "identifier_resolvers": [
          "n2t.net",
          "identifiers.org"
        ],
        "prefix": "10.23725/"
      }
    ],
    "ga4ghdos": [
      {
        "handler": "bdbag.fetch.resolvers.dataguid_resolver.DataGUIDResolverHandler",
        "identifier_resolvers": [
          "n2t.net"
        ],
        "prefix": "dg.4503/"
      }
    ],
    "minid": [
      {
        "handler": "bdbag.fetch.resolvers.ark_resolver.MinidResolverHandler",
        "identifier_resolvers": [
          "n2t.net",
          "identifiers.org"
        ]
      }
    ]
  }
}

keychain.json

The file keychain.json is used to specify the authentication mechanisms and credentials for the various URLs that might be encountered while trying to resolve (download) the files listed in a bag's fetch.txt file.

The format of keychain.json is a JSON array containing a list of JSON objects, each of which specify a set of parameters used to configure the authentication method and credentials to use for a specifed base URL.

Parameters
Parameter Description
uri This is the base URI used to specify when authentication should be used. When a URI reference is encountered in fetch.txt, an attempt will be made to match it against all base URIs specified in keychain.json and if a match is found, the request will be authenticated before file retrieval is attempted.
auth_uri This is the authentication URI used to establish an authenticated session for the specified uri. This is currently assumed to be an HTTP(s) protocol URL.
auth_type This is the authentication type used by the server specified by uri or auth_uri (if present).
auth_params This is a child object containing authentication-type specific parameters used in session establishment. It will generally contain credential information such as a username and password, a cookie value, or client certificate parameters. It can also contain other parameters required for authentication with the given auth_type mechanism; for example the HTTP method (i.e., GET or POST) to use with HTTP Basic Auth.

Below is a sample keychain.json file:

[
    {
        "uri": "https://some.host.com/somefiles/",
        "auth_uri": "https://some.host.com/authenticate",
        "auth_type": "http-form",
        "auth_params": {
            "username": "me",
            "password": "mypassword",
            "username_field": "username",
            "password_field": "password"
        }
    },
    {
        "uri": "https://some.host.com/somefiles/",
        "auth_uri": "https://some.host.com/authenticate",
        "auth_type": "http-basic",
        "auth_params": {
            "auth_method":"POST",
            "username": "me",
            "password": "mypassword"
        }
    },
    {
        "uri": "https://some.host.com/somefiles/",
        "auth_type": "cookie",
        "auth_params": {
            "cookies": [ "a_cookie_name=zxyfw1231_secret"]
        }
    },
    {
        "uri": "https://some.host.com/somefiles/",
        "auth_type": "bearer-token",
        "auth_params": {
            "token": "<token>",
            "allow_redirects_with_token": true
        }
    },
    {
        "uri": "ftp://some.host.com/somefiles/",
        "auth_type": "ftp-basic",
        "auth_params": {
            "username": "anonymous",
            "password": "bdbag@users.noreply.github.com"
        }
    },
    {
        "uri": "s3://mybucket",
        "auth_type": "aws-credentials",
        "auth_params": {
            "key": "foo",
            "secret": "bar"
        }
    }, 
    {
        "uri": "gs://gcs-bdbag-integration-testing/",
        "auth_type": "gcs-credentials",
        "auth_params": {
            "project_id": "bdbag-204999", 
            "allow_requester_pays": true
        }
    },
    {
        "uri": "gs://bdbag-dev/",
        "auth_type": "gcs-credentials",
        "auth_params": {
            "project_id": "bdbag-204999",
            "service_account_credentials_file": "/home/bdbag/bdbag-204400-41babdd46e24.json"
        }
    },
    {
        "uri": "globus://my_endpoint/my_files/",
        "auth_type": "globus_transfer",
        "auth_params": {
            "local_endpoint": "b06c5a10-0b17-11e7-a73f-22000bf2d559",
            "transfer_token": "AQBXNMizAAAAAAADPIg9SoyPk_dm0BOFcWT7pe-52fQKv2Je6zi-hEvJ5xkfXw8rLaL9mVg8RtOY-vy4qrQd"
        }
    }
]

remote-file-manifest

A remote-file-manifest configuration file is used by bdbag during bag creation and update as a way to include files in a bag that are not necesarily present on the local system, and therefore cannot be hashed. The file is processed by bdbag and the data used to generate both payload manifest entries and fetch.txt entries in the result bag.

The remote-file-manifest is structured as a JSON array containing a list of JSON objects that have the following attributes:

  • url: The url where the file can be located or dereferenced from. This value MUST be present.
  • length: The length of the file in bytes. This value MUST be present.
  • filename: The filename (or path), relative to the bag 'data' directory as it will be referenced in the bag manifest(s) and fetch.txt files. This value MUST be present.
  • AT LEAST one (and ONLY one of each) of the following algorithm:checksum key-value pairs:
    • md5:<md5 hex value>
    • sha1:<sha1 hex value>
    • sha256:<sha256 hex value>
    • sha512:<sha512 hex value>
  • Other legal JSON keys and values of arbitrary complexity MAY be included, as long as the basic requirements of the structure (as described above) are fulfilled.

Below is a sample remote-file-manifest configuration file:

[
    {
        "url":"https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-profile.json",
        "length":699,
        "filename":"bdbag-profile.json",
        "sha256":"eb42cbc9682e953a03fe83c5297093d95eec045e814517a4e891437b9b993139"
    },
    {
        "url":"ark:/88120/r8059v",
        "length": 632860,
        "filename": "minid_v0.1_Nov_2015.pdf",
        "sha256": "cacc1abf711425d3c554277a5989df269cefaa906d27f1aaa72205d30224ed5f"
    }
]

bag-info metadata

A bag-info metadata configuration file consists of a single JSON object containing a set of JSON key-value pairs that will be written as-is to the bag's bag-info.txt file. NOTE: per the bagit specification, strings are the only supported value type in bag-info.txt.

Below is a sample bag-info metadata configuration file:

{
    "BagIt-Profile-Identifier": "https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-profile.json",
    "External-Description": "Simple bdbag test",
    "Arbitrary-Metadata-Field": "This is completely arbitrary"
}

ro metadata

A Research Object metadata configuration file consists of a single JSON object containing a set of JSON key-object pairs where the key is a / delimited relative file path and the object is any aribitratily complex JSON content. This format allows bdbag to process all RO metadata as an aggregation which can then be serialized into individual JSON file components relative to the bag's metadata directory.

NOTE: while this documentation refers to this configuration file as a ro metadata file, the contents of this configuration file only have to conform to the bagit-ro conventions if bagit-ro compatibility is the goal. Otherwise, this mechanism can be used as a generic way to create any number of arbitrary JSON (or JSON-LD) metadata files as bagit tagfiles.

Below is a sample ro metadata configuration file:

{
  "manifest.json": {
    "@context": [ "https://w3id.org/bundle/context" ],
    "@id": "../",
    "createdOn": "2018-02-08T12:23:00Z",
    "aggregates": [
      { "uri": "../data/CTD_chem_gene_ixn_types.csv",
        "mediatype": "text/csv"
      },
      { "uri": "../data/CTD_chemicals.csv",
        "mediatype": "text/csv"
      },
      { "uri": "../data/CTD_pathways.csv",
        "mediatype": "text/csv"
      }
    ],
    "annotations": [
      { "about": "../data/CTD_chem_gene_ixn_types.csv",
        "content": "annotations/CTD_chem_gene_ixn_types.csv.jsonld"
      }
    ]
  },
  "annotations/CTD_chem_gene_ixn_types.csv.jsonld": {
    "@context": {
      "schema": "http://schema.org/",
      "object": "schema:object",
      "TypeName": {
        "@type": "schema:name",
        "@id": "schema:name"
      },
      "Code": {
        "@type": "schema:code",
        "@id": "schema:code"
      },
      "Description": {
        "@type": "schema:description",
        "@id": "schema:description"
      },
      "ParentCode": {
        "@type": "schema:code",
        "@id": "schema:parentItem"
      },
      "results": {
        "@id": "schema:object",
        "@type": "schema:object",
        "@container": "@set"
      }
    }
  }
}