-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds metadata source specification #484
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
Metadata source specification | ||
***************************** | ||
|
||
This metadata source specification defines how to structure a collection of metadata records | ||
that together form the source material for a ``datalad-catalog`` catalog instance. | ||
|
||
The specification benefits both users and developers in that it separates metadata formats | ||
from the tooling that processes it: | ||
|
||
- users can create and maintain such specification-compliant metadata collections without | ||
having to employ ``datalad-catalog`` tooling | ||
- both generic and format-specific tooling can be developed and deployed, either as part of | ||
``datalad-catalog`` or as custom extensions, to transform specification-compliant metadata | ||
collections into a state renderable by a catalog | ||
|
||
|
||
High-level design | ||
================= | ||
|
||
The metadata source specification supports: | ||
|
||
1. **Per-catalog versioned customizations**: the top-level functional unit of the source | ||
specification is a catalog instance, which can be customized via a versioned configuration | ||
file as defined in the section :doc:`catalog_config`. This means a specification-compliant | ||
collection of records can specify the (version-specific) "look and feel" of a catalog, | ||
in addition to its displayed content. | ||
2. **Multi-dataset, multi-version records**: the source specification has a filesystem layout | ||
with a directory for each unique dataset identifier, which in turn has a subdirectory for | ||
each unique version identifier of a given dataset. This ensures a modular setup within which | ||
records for multiple versions of the same dataset can coexist. | ||
3. **Multi-format metadata records**: the specification places no restrictions on the number | ||
and type of metadata records in a collection for a given dataset version, since in reality | ||
metadata often originate from a variety of sources and exist in a variety of formats. | ||
The transformation of different record formats into ``datalad-catalog``-compatible records | ||
is conveniently shifted into the tooling domain, and is not part of the specification itself. | ||
|
||
|
||
The specification | ||
================= | ||
|
||
The following filesystem layout and record naming scheme should be adhered to for | ||
a given collection of records: | ||
|
||
.. code-block:: | ||
|
||
. | ||
├── config/ | ||
│ └── <config-version-id>/ | ||
│ └── config.json | ||
Comment on lines
+47
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another point about the config, it can also include a logo path (specified relative to the location of the config, within the context of the environment running the |
||
└── records/ | ||
└── <dataset-id>/ | ||
├── config.json | ||
└── <dataset-version-id>/ | ||
└── <format-id> | ||
|
||
|
||
``config/`` | ||
----------- | ||
|
||
This directory should contain the catalog-level configuration file(s), one per version, | ||
with the name ``config.json``. | ||
Comment on lines
+60
to
+61
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Technically, |
||
|
||
``<config-version-id>`` | ||
----------------------- | ||
|
||
This directory name specifies the version of the configuration file, | ||
and should have a unique string value. | ||
|
||
``records/`` | ||
------------ | ||
|
||
All metadata records for all versions of all datasets should be placed in the appropriate | ||
relative location within this directory. | ||
|
||
|
||
``<dataset-id>/`` | ||
----------------- | ||
|
||
All metadata records for all versions of *a specific dataset* should be placed in this | ||
directory. ``<dataset-id>`` should be a unique string identifying the dataset, avoiding | ||
white space and special characters. | ||
|
||
|
||
``<dataset-version-id>/`` | ||
------------------------- | ||
|
||
All metadata records for *a specific version* of *a specific dataset* should be placed | ||
in this directory. ``<dataset-version-id>`` should be a unique string identifying the version, | ||
avoiding white space and special characters. | ||
|
||
``<format-id>`` | ||
--------------- | ||
|
||
This should be a unique filename of a single record, with identifying characters that | ||
can be parsed in order to match the specific file format with a specific reader or processing | ||
tool. There is no restriction on the number of files contained in a given ``<dataset-version-id>`` | ||
directory, they should just all be unique. | ||
Comment on lines
+94
to
+97
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It just occurred to me that it might not always be individual files, e.g. a tabby collection might be included here as a directory containing all the related tabby files? |
||
|
||
|
||
An example | ||
========== | ||
|
||
This is an example record collection: | ||
|
||
.. code-block:: | ||
|
||
. | ||
├── config/ | ||
│ ├── v1/ | ||
│ │ └── config.json | ||
│ └── v2/ | ||
│ └── config.json | ||
└── records/ | ||
└── myDatasetA/ | ||
│ ├── v0.1.1/ | ||
│ │ └── datacite.json | ||
│ └── v0.1.2/ | ||
│ ├── studyminimeta.yaml | ||
│ └── datacite.json | ||
└── myDatasetB/ | ||
├── config.json | ||
└── latest/ | ||
├── dataset_description.json | ||
├── tabby.tsv | ||
├── data-package.json | ||
├── LICENSE | ||
└── citations.cff | ||
|
||
|
||
.. note:: | ||
|
||
**TO DO**: Construct and point to an actual specification-compliant collection of records | ||
|
||
|
||
.. note:: | ||
|
||
**TO DO**: Point to the toolset description of how such a collection can be transformed | ||
into a set of ``datalad-catalog``-compatible records |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm uncertain about here, wrt versioned configs, is how the ingestion pipeline will know which config version to use to create the catalog entries. It will have to be parameterized somehow, but ideally the agent that created the metadata collection should be the one to specify which config version to use. I.e. that argument should be part of the collection somehow?