Skip to content

Commit

Permalink
Adds metadata source specification:
Browse files Browse the repository at this point in the history
The source specification defines how to structure a collection
of metadata records that together form the source material for
a catalog instance. It separates metadata source files and formats
from tooling, ensuring that users can provide and maintain a
metadata collection without depending on datalad-catalog tools, while
providing a validated structure from which automated tools can generate
datalad-catalog-compatible records to be rendered.

This commit adds the specification as part of the project docs.
Future commits should update the 'Pipeline description' section of the
docs to suggest the use of tools that understand the metadata source
specification, and should also remove or update the 'Metadata formats'
section of the docs accordingly.
  • Loading branch information
jsheunis committed Jul 23, 2024
1 parent c0ed29f commit 002c8f3
Show file tree
Hide file tree
Showing 3 changed files with 148 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ Index
overview
installation
usage
metadata_source_spec
pipeline_description
metadata_formats
catalog_schema
Expand Down
138 changes: 138 additions & 0 deletions docs/source/metadata_source_spec.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
Metadata source specification
*****************************

This metadata source specification defines how to structure a collection of metadata records
that together form the source material for a ``datalad-catalog`` catalog instance.

The specification benefits both users and developers in that it separates metadata formats
from the tooling that processes it:

- users can create and maintain such specification-compliant metadata collections without
having to employ ``datalad-catalog`` tooling
- both generic and format-specific tooling can be developed and deployed, either as part of
``datalad-catalog`` or as custom extensions, to transform specification-compliant metadata
collections into a state renderable by a catalog


High-level design
=================

The metadata source specification supports:

1. **Per-catalog versioned customizations**: the top-level functional unit of the source
specification is a catalog instance, which can be customized via a versioned configuration
file as defined in the section :doc:`catalog_config`. This means a specification-compliant
collection of records can specify the (version-specific) "look and feel" of a catalog,
in addition to its displayed content.
2. **Multi-dataset, multi-version records**: the source specification has a filesystem layout
with a directory for each unique dataset identifier, which in turn has a subdirectory for
each unique version identifier of a given dataset. This ensures a modular setup within which
records for multiple versions of the same dataset can coexist.
3. **Multi-format metadata records**: the specification places no restrictions on the number
and type of metadata records in a collection for a given dataset version, since in reality
metadata often originate from a variety of sources and exist in a variety of formats.
The transformation of different record formats into ``datalad-catalog``-compatible records
is conveniently shifted into the tooling domain, and is not part of the specification itself.


The specification
=================

The following filesystem layout and record naming scheme should be adhered to for
a given collection of records:

.. code-block::
.
├── config/
│ └── <config-version-id>/
│ └── config.json
└── records/
└── <dataset-id>/
├── config.json
└── <dataset-version-id>/
└── <format-id>
``config/``
-----------

This directory should contain the catalog-level configuration file(s), one per version,
with the name ``config.json``.

``<config-version-id>``
-----------------------

This directory name specifies the version of the configuration file,
and should have a unique string value.

``records/``
------------

All metadata records for all versions of all datasets should be placed in the appropriate
relative location within this directory.


``<dataset-id>/``
-----------------

All metadata records for all versions of *a specific dataset* should be placed in this
directory. ``<dataset-id>`` should be a unique string identifying the dataset, avoiding
white space and special characters.


``<dataset-version-id>/``
-------------------------

All metadata records for *a specific version* of *a specific dataset* should be placed
in this directory. ``<dataset-version-id>`` should be a unique string identifying the version,
avoiding white space and special characters.

``<format-id>``
---------------

This should be a unique filename of a single record, with identifying characters that
can be parsed in order to match the specific file format with a specific reader or processing
tool. There is no restriction on the number of files contained in a given ``<dataset-version-id>``
directory, they should just all be unique.


An example
==========

This is an example record collection:

.. code-block::
.
├── config/
│ ├── v1/
│ │ └── config.json
│ └── v2/
│ └── config.json
└── records/
└── myDatasetA/
│ ├── v0.1.1/
│ │ └── datacite.json
│ └── v0.1.2/
│ ├── studyminimeta.yaml
│ └── datacite.json
└── myDatasetB/
├── config.json
└── latest/
├── dataset_description.json
├── tabby.tsv
├── data-package.json
├── LICENSE
└── citations.cff
.. note::

**TO DO**: Construct and point to an actual specification-compliant collection of records


.. note::

**TO DO**: Point to the toolset description of how such a collection can be transformed
into a set of ``datalad-catalog``-compatible records
9 changes: 9 additions & 0 deletions docs/source/pipeline_description.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
Pipeline Description
********************

.. warning::

This section describes a functioning but outdated view of generating a catalog
entry from a DataLad dataset using ``datalad-metalad`` extractors and
``datalad-catalog`` translators. This will soon be updated to suggest a
metadata ingestion pipeline using the :doc:`metadata_source_spec` and
dedicated toolset.


The DataLad ecosystem provides a complete set of free and open source tools
that, together, provide full control over dataset access and distribution,
version control, provenance tracking, metadata addition, extraction, and
Expand Down

0 comments on commit 002c8f3

Please sign in to comment.