Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds metadata source specification #484

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ Index
overview
installation
usage
metadata_source_spec
pipeline_description
metadata_formats
catalog_schema
Expand Down
138 changes: 138 additions & 0 deletions docs/source/metadata_source_spec.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
Metadata source specification
*****************************

This metadata source specification defines how to structure a collection of metadata records
that together form the source material for a ``datalad-catalog`` catalog instance.

The specification benefits both users and developers in that it separates metadata formats
from the tooling that processes it:

- users can create and maintain such specification-compliant metadata collections without
having to employ ``datalad-catalog`` tooling
- both generic and format-specific tooling can be developed and deployed, either as part of
``datalad-catalog`` or as custom extensions, to transform specification-compliant metadata
collections into a state renderable by a catalog


High-level design
=================

The metadata source specification supports:

1. **Per-catalog versioned customizations**: the top-level functional unit of the source
specification is a catalog instance, which can be customized via a versioned configuration
file as defined in the section :doc:`catalog_config`. This means a specification-compliant
collection of records can specify the (version-specific) "look and feel" of a catalog,
in addition to its displayed content.
2. **Multi-dataset, multi-version records**: the source specification has a filesystem layout
with a directory for each unique dataset identifier, which in turn has a subdirectory for
each unique version identifier of a given dataset. This ensures a modular setup within which
records for multiple versions of the same dataset can coexist.
3. **Multi-format metadata records**: the specification places no restrictions on the number
and type of metadata records in a collection for a given dataset version, since in reality
metadata often originate from a variety of sources and exist in a variety of formats.
The transformation of different record formats into ``datalad-catalog``-compatible records
is conveniently shifted into the tooling domain, and is not part of the specification itself.


The specification
=================

The following filesystem layout and record naming scheme should be adhered to for
a given collection of records:

.. code-block::

.
├── config/
│ └── <config-version-id>/
│ └── config.json
Comment on lines +47 to +49
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm uncertain about here, wrt versioned configs, is how the ingestion pipeline will know which config version to use to create the catalog entries. It will have to be parameterized somehow, but ideally the agent that created the metadata collection should be the one to specify which config version to use. I.e. that argument should be part of the collection somehow?

Comment on lines +47 to +49
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point about the config, it can also include a logo path (specified relative to the location of the config, within the context of the environment running the datalad-catalog code). For the purposes of the collection, this logo will either have to be provided as an image file in the collection itself (likely alongside the config.json file) or as a downloadable URL. Thoughts?

└── records/
└── <dataset-id>/
├── config.json
└── <dataset-version-id>/
└── <format-id>


``config/``
-----------

This directory should contain the catalog-level configuration file(s), one per version,
with the name ``config.json``.
Comment on lines +60 to +61
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, datalad-catalog can also read YAML config files. Should we allow all possibilities (.json, .yml, .yaml), or just specify a single option?


``<config-version-id>``
-----------------------

This directory name specifies the version of the configuration file,
and should have a unique string value.

``records/``
------------

All metadata records for all versions of all datasets should be placed in the appropriate
relative location within this directory.


``<dataset-id>/``
-----------------

All metadata records for all versions of *a specific dataset* should be placed in this
directory. ``<dataset-id>`` should be a unique string identifying the dataset, avoiding
white space and special characters.


``<dataset-version-id>/``
-------------------------

All metadata records for *a specific version* of *a specific dataset* should be placed
in this directory. ``<dataset-version-id>`` should be a unique string identifying the version,
avoiding white space and special characters.

``<format-id>``
---------------

This should be a unique filename of a single record, with identifying characters that
can be parsed in order to match the specific file format with a specific reader or processing
tool. There is no restriction on the number of files contained in a given ``<dataset-version-id>``
directory, they should just all be unique.
Comment on lines +94 to +97
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me that it might not always be individual files, e.g. a tabby collection might be included here as a directory containing all the related tabby files?



An example
==========

This is an example record collection:

.. code-block::

.
├── config/
│ ├── v1/
│ │ └── config.json
│ └── v2/
│ └── config.json
└── records/
└── myDatasetA/
│ ├── v0.1.1/
│ │ └── datacite.json
│ └── v0.1.2/
│ ├── studyminimeta.yaml
│ └── datacite.json
└── myDatasetB/
├── config.json
└── latest/
├── dataset_description.json
├── tabby.tsv
├── data-package.json
├── LICENSE
└── citations.cff


.. note::

**TO DO**: Construct and point to an actual specification-compliant collection of records


.. note::

**TO DO**: Point to the toolset description of how such a collection can be transformed
into a set of ``datalad-catalog``-compatible records
9 changes: 9 additions & 0 deletions docs/source/pipeline_description.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
Pipeline Description
********************

.. warning::

This section describes a functioning but outdated view of generating a catalog
entry from a DataLad dataset using ``datalad-metalad`` extractors and
``datalad-catalog`` translators. This will soon be updated to suggest a
metadata ingestion pipeline using the :doc:`metadata_source_spec` and
dedicated toolset.


The DataLad ecosystem provides a complete set of free and open source tools
that, together, provide full control over dataset access and distribution,
version control, provenance tracking, metadata addition, extraction, and
Expand Down
Loading