This document is work in progress (and not yet a reference).
Persistent identifiers (PIDs) play an important role in achieving FAIR data. This text is about the different use cases for PIDs in NFDI4Cat and how each case is solved technically.
According to the architecture document, different repositories are considered for the NFDI4Cat data infra structure:
- Repo4Cat, the central global repository for publishing data and collaborating on data.
- Local repositories hosted at various institutions, e.g. BasCat or LARA
Since Researchers expect DOIs for their data publications, the data sets published in Repo4Cat should get a DOIs from DataCite as PID.
However, DOIs are not a good fit for earlier stages of research due to the metadata requirements and associated costs. Nevertheless researchers want to share research artifacts already in the early stage with collaborators and having PIDs already at this stage would be very useful. Therefore, NFDI4Cat will in addition to DOIs use a handle-based solution PID4Cat (see below) to address the need for PIDs in the early phase when data are shared in private with collaborators. Repo4Cat seeks to support this private sharing between collaborators.
NFDI4cat has selected Dataverse as software for Repo4Cat, the global data portal. Dataverse supports two types of PIDs: DOIs and handles. It is however not possible to use handles and DOIs at the same time in a single Dataverse installation.
To address this limitations, two possible solutions are discussed:
- [preferred] Use one Dataverse instance and PID4Cat-handles as primary PID. For data sets being published DOIs will be minted (and linked to the handle).
- Use two Dataverse instances, one for published data that is configured to use DOIs and another one for early-stage data that is configured to use PID4-Cat-handles.
For the handle-PIDs in Dataverse, it is suggested
-
To generate individual PIDs for all files in a dataset.
Configuration
- :DataFilePIDFormat = independent
-
To generate the ID-part via a procedure stored in the DB. What scheme is used does not matter much, but clashes with PID4Cat-handles or other Dataverse (BasCat) must be avoided. Via a stored procedure UUID4 (or UUID7) identifiers could be created. An alternative is to add a prefix to the numeric ID used by default. The numeric ID may be base32 encoded to shorten the ID length. In this case the start ID should maybe not set to 1 but to 1001 to make the numeric part stick out. For example:
- r4c-9OH (prefix and base32 encoded integer number)
- "9OH" is 10001 using base-32 alphabet
- r4c-10001 (prefix & integer number)
Preferred is the first scheme, because it is shorter.
It is suggested to not use RFC4648 base32 (letters A to Z and numbers 2 to 7) but another, more human-friendly alphabet, preferably z-base-32. The z-base-32 encoding avoids human reading problems by excluding 0, l, v, 2 (number 0 may be confused with letter o, letter l with number 1 or letter i, letter v is close to u or r especially in handwriting, also number 2 is close to z in handwriting). The z-base-32 alphabet is optimized for lowercase letters so should be represented in this form.
Configuration:
- :IdentifierGenerationStyle = storedProcGenerated
- r4c-9OH (prefix and base32 encoded integer number)
Related Dataverse documentation:
PIDs like the well known DOIs require a certain set of metadata (see DataCite Schema). DOIs are based on handles, which in principle allow to store additional data in the record. However, DOI metadata are stored in a different service and have to be requested via the DataCite API which is a different service than the handle-system used for resolving a DOI. The reason for DataCite`s approach is performance(?).
For NFDI4Cat a simpler system that reduces complexity is more attractive. Therefore, PID4Cat handles are similar to ePIC-handles in that they use the handle record itself to store metadata. This has the advantage that these metadata are available directly from the resolver.
PID4Cat comes with a well-defined schema for how and which metadata to store in the handle record. The schema is defined in LinkML and developed in this repo (NFDI4Cat/PID4Cat). The schema does only require a minimum of information; it does not disclose anything about the resource except its type. Moreover, it is not required to specify an owner/creator but only a curator. Therefore, PID4Cat-handles do not mandate to put resources into context if this is undesired for whatever reasons.
The PID4Cat schema is mapped to the handle record as follows:
Table Example record for a PID4Cat handle with the suffix "lik-dfi345", URL: https://hdl.handle.net/20.1000/lik-dfi345
The PID4CatRecord column contains the slot name of the LinkML model that is written to the respective line of the handle record; this column is not part of the handle record.
Index | Type | Timestamp | Data | PID4CatRecord |
---|---|---|---|---|
1 | URL | 2024-01-01 10:47:38Z | https://pid4cat.example.org/lik-dfi345 | landing pageURL |
2 | STATUS | 2024-02-19 13:40:02Z | REGISTERED | status |
3 | REC_VER | 2024-02-19 13:40:02Z | 20240219v0 | record_version |
4 | SCH_VER | 2024-01-01 10:47:38Z | 1.0.0 | pid_schema_version |
5 | RIGHTS | 2024-01-01 10:47:38Z | CC0-1.0 | dc_rights |
6 | 2024-01-01 10:47:38Z | datafuzzi@example.org | curation_contact | |
7 | IMFO | 2024-01-01 10:47:38Z | {json} | resource_info |
8 | RELATED | 2024-02-19 13:40:02Z | {json} | related_identifiers |
9 | CHANGES | 2024-02-19 13:40:02Z | {json} | change_log |
In a future version, the non-standard values in Type-column may be replaced by references to type declarations in a datatype registry (DTR). Such DTRs are still under development and not yet widely used.
Since PID4Cat is a linkML-model we have all tools at hand to create records or an API. For example, we can use the pydantic-model created from the PID4cat schema to create the json-objects for the PID record above, for example the resource_info json-object:
from linkml_runtime.dumpers import json_dumper
from pid4cat_model.datamodel import pid4cat_model_pydantic as p4c
pid1_ressource_info = p4c.ResourceInfo(
label="Resource label",
description="Resource description",
resource_category=p4c.ResourceCategory.SAMPLE,
rdf_url="https://example.org/resource/pid1",
rdf_type="TURTLE",
schema_url="https://example.org/resource_schema",
schema_type="XSD",
)
print(json_dumper.dumps(pid1_ressource_info, inject_type=False))
which will print the json-object to be stored under index 7 in the handle-record:
{
"label": "Resource label",
"description": "Resource description",
"resource_category": "SAMPLE",
"rdf_url": "https://example.org/resource",
"rdf_type": "TURTLE",
"schema_url": "https://example.org/schema",
"schema_type": "XSD"
}
The pydantic model can also be used to create web APIs with other Python-based tools like django-ninja or fastapi.
This text is based on the proposal "PID-suffix, Version 2" made in NFDI4Cat-openproject on 2023-12-07
Two handle-suffix schemes are supported:
- UUID4 (uuid pid4cat)
- Can be either supplied from the PID-applicant when minting a new PID or can be generated by the gateway (both ways should be possible). UUID7 should also be accepted.
- namespace_of_agent/local_ID (name-spaced pid4cat)
- The handle-gateway should only validate the uniqueness of
namespace_of_agent/local_ID
-combinations. It is suggested to use 3 characters from a human-friendly base32-alphabet as namespaces (e.g. z-base-32, see above). This gives 3**32 prefixes and if ever necessary we can switch to 4 chars. - The uniqueness of the
local_ID
has to be taken care of by the namespace owner. - The
local_ID
can be freely chosen by the namespace owner but is limited to ASCII letters [a-zA-Z], numbers [0-9] and symbols.-
. OK or too restrictive? - The request for becoming the owner of a new namespace should be made by email.
- The handle-gateway should only validate the uniqueness of
Examples for PID4Cat handles (non-resolvable):
-
UUID4 pid4cat:
https://hdl.handle.net/21.T11978/pid4cat/7e82d892-6acf-41a8-9c91-df826f67a806
-
name-spaced pid4cat in the example-namespace "
k3a
":https://hdl.handle.net/21.T11978/pid4cat/k3a/123-456
- The corresponding regex in Python syntax is:
re.compile(r"21\.T11978/pid4cat/(?P<namespace>k3a)/(?P<local_ID>\d{3}-\d{3})")
...which extracts "k3a" as group "namespace" and "123-456" as group "local_ID".
Note, the
pid4cat
-part in the above URLs/suffixes has no functional meaning. It is only present to make pid4cat-handles easily recognizable. We should consider registeringpid4cat
as a prefix on bioregistry.io just like we registered the voc4cat: prefix.
The role of the handle-server-gateway (HSG) is to restrict and manage write access to the handle-server and to add PID4Cat-specific validation for the handles. The HSG provides an API only. Create and updating PID4Cat-handles will exclusively managed via the HSG. Suggested URLs for the HSG are https://pid4cat.nfdi4cat.org or https://pid.nfdi4cat.org
Note: Dataverse will not use the HSG but interface the handle-server directly.
The HSG controls which users are allowed to create and update handle records. Except for sysadmins, HSG users will only be allowed to create and update handles for named-spaced PIDs. All PID name-spaces will have at least one name-space admin user. Institutions that like to manage a PID4Cat-PID-namespace need to register a PID-namespace admin with NFDI4Cat.
Suggested minimal API of the HSG:
[unedited copy from internal OpenProject]
Here is a tentative minimal API (to be discussed). API access is limited to special users "namespace-owners" (read-write), and "viewers" (read-only). Anonymous users have no access.
- Regarding permissions and the minimal API we like to get your feedback. Is this OK? Do you see gaps or need more functionality?
https://pid.nfdi4cat.org/<api-version>/<name-space> [GET,PUT]
https://pid.nfdi4cat.org/
is the base url of the gateway server.api-version
(e.g.v1.0.0
) is a version specifier for the API (it is recommended to version APIs)name-space
is the first part of the suffix of name-spaced PID.- Special notes on create [PUT]:
- HSG should set "record_version" automatically on create
- ...and some more validation & consistency checks
https://pid.nfdi4cat.org/<api-version>/<name-space>/<local-ID-of-pid> [GET,PUT]
where <name-space>/<local-ID-of-pid>
form the suffix of the handle to get or update. For the examples above it is k3a/123-456
and 7e82d892-6acf-41a8-9c91-df826f67a806
, respectively.
- Special notes on updates [PUT]:
- The data are supplied as PID4Cat-schema-conform json object in the body of the request.
- each handle record update needs an update of the changelog (append)
- HSG should set/increment "record_version" automatically on updates
- ...and some more validation & consistency checks
We could allow [DELETE] but we should not perform any true delete of a handle. Instead the record should be modified to state "OBSOLETE" (see PID4CatStatus enum in schema).
The API may be extended to make the information in the handle record more accessible, for example
- Routes to retrieve all PIDs by category [GET]
- Routes to retrieve all PIDs by status [GET]
For linked-data applications permanent IRIs are required. Several services exists that provide a stable base URL and a redirect service that can be configured to redirect to the actual resource location.
Such services should support serving different data to humans and to machines. One way to do this is via content-negotiation. Depending on the request that is send to the redirect service, the client is redirected to human-oriented representation of the resource of to a machine-oriented one.
NFDI4Cat has selected w3id.org for this service. Comparable services are purl.org or pida.org which work in the same way.
In principle, it would also be possible to use handle-based PIDs for the terms in ontologies. However, the additional features and the more reliable redundant network of the handle-system are of minor importance for this application.