From 2b7de8b408c0bb701b56ae94f98135e19bcf7102 Mon Sep 17 00:00:00 2001
From: Balduin Landolt <33053745+BalduinLandolt@users.noreply.github.com>
Date: Mon, 4 Nov 2024 18:28:42 +0100
Subject: [PATCH 1/8] move old documentation for disambiguity
---
docs/data/{datamodel.md => current-datamodel.md} | 2 +-
docs/index.md | 2 +-
mkdocs.yml | 6 +++---
3 files changed, 5 insertions(+), 5 deletions(-)
rename docs/data/{datamodel.md => current-datamodel.md} (99%)
diff --git a/docs/data/datamodel.md b/docs/data/current-datamodel.md
similarity index 99%
rename from docs/data/datamodel.md
rename to docs/data/current-datamodel.md
index cd1c7c85..1372fb6a 100644
--- a/docs/data/datamodel.md
+++ b/docs/data/current-datamodel.md
@@ -1,4 +1,4 @@
-# Data Model
+# Current Data Model
All metadata are modelled according to the model as described in the following.
diff --git a/docs/index.md b/docs/index.md
index af6cb9f5..d5e22889 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -10,7 +10,7 @@ as well as the code of the [DSP Metadata Browser](https://meta.dasch.swiss).
If you are interested in viewing the metadata in human-readable form,
you can visit the [DSP Metadata Browser](https://meta.dasch.swiss).
-If you are interested in re-using our metadata, you can find extensive documentation [here](data/datamodel.md).
+If you are interested in re-using our metadata, you can find extensive documentation [here](data/current-datamodel.md).
The metadata itself can be found [here](https://github.com/dasch-swiss/dsp-meta/tree/main/data/json)
or requested over the API as described [here](data/api.md).
diff --git a/mkdocs.yml b/mkdocs.yml
index 27e3f62a..f5544ebf 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -4,7 +4,7 @@ nav:
- DSP-META: index.md
- Consuming Metadata:
- Metadata API: data/api.md
- - Data Model: data/datamodel.md
+ - Current Data Model: data/current-datamodel.md
- Adding Metadata: adding-metadata.md
- Code Documentation:
- Overview: code/overview.md
@@ -33,8 +33,8 @@ theme:
name: Switch to light mode
features:
- search.suggest
- - navigation.tabs
- - navigation.sections
+ # - navigation.tabs
+ # - navigation.sections
markdown_extensions:
- admonition
From 92ead29623a243c867a71879cdd126b89e9846b5 Mon Sep 17 00:00:00 2001
From: Balduin Landolt <33053745+BalduinLandolt@users.noreply.github.com>
Date: Tue, 5 Nov 2024 16:56:51 +0100
Subject: [PATCH 2/8] set up new model based on the old one
---
docs/data/provisional-datamodel.md | 601 +++++++++++++++++++++++++++++
mkdocs.yml | 1 +
2 files changed, 602 insertions(+)
create mode 100644 docs/data/provisional-datamodel.md
diff --git a/docs/data/provisional-datamodel.md b/docs/data/provisional-datamodel.md
new file mode 100644
index 00000000..b5218120
--- /dev/null
+++ b/docs/data/provisional-datamodel.md
@@ -0,0 +1,601 @@
+# Provisional Data Model
+
+!!! warning
+ This document does _not_ represent the current state of the metadata model.
+ It is a working document for planned upcoming changes to the metadata model.
+
+!!! note
+ This model is an idealized version of the metadata model.
+ With the current implementation that is entirely separate from the DSP,
+ it is not feasible to implement metadata on the record level.
+ Such a system may be implemented in the archive in the future,
+ but for now, we will keep the metadata on the dataset level.
+ A separate, simplified model for applying some of these changes,
+ while remaining compatible with the current implementation,
+ should be created alongside this model.
+
+## Overview
+
+The metadata model is a hierarchical structure of metadata elements.
+
+```mermaid
+
+flowchart TD
+ hyper-project[Hyper-Project /
Uber-Project /
Meta-Project /
Compound Project] -->|1-n| project[Project /
Research Project]
+ project -->|1-n| dataset[Dataset]
+ dataset -->|1-n| record[Record /
Resource]
+ project -->|0-n| collection[Collection]
+ collection --> collection
+ hyper-project -->|0-n| collection
+ collection --> record
+```
+
+- A `Compound Project` is optional and collects one or more `Research Projects`.
+ It is typically of institutional nature,
+ not directly tied to a specific funding grant,
+ and may be long-lived.
+ Examples are EKWS/CAS, BEOL or LIMC.
+- A `Research Project` is the main entity of the metadata model.
+ It corresponds to a `project` in the DSP.
+ It is typically tied to a specific funding grant,
+ and hence has a limited lifetime of ~3-5 years;
+ multiple funding rounds and a longer lifetime are possible.
+ A `Research Project` is part of 0-1 `Compound Project`,
+ it has 1-n `Datasets` and 0-n `Collections`.
+- A `Dataset` is a collection of `Records` within a `Research Project`.
+ It is mostly meant for system-internal and technical use,
+ and should not have particular semantics or a "historical meaning" in the context of the project.
+ A `Dataset` is part of exactly 1 `Research Project`
+ and contains 1-n `Records`.
+- A `Collection` is also a collection of `Records` within a `Research Project`.
+ It is meant for semantic grouping of `Records` within a `Research Project`,
+ and may have a "historical meaning" in the context of the project.
+ Examples may be physical collections such as p person's "Nachlass" in an archive,
+ or groupings of records based on a specific research question within a project.
+ A `Collection` is part of at least 1 `Research Project`, `Compound Project` or `Collection`,
+ but can be part of multiple. It may either contain 0-n `Collections` or 1-n `Records`.
+- A `Record` is a single resource within a `Dataset`.
+ It represents a single entity, and the smallest unit that can meaningfully have an identifier.
+ It maps to a `knora-base:Resource` (DSP-API) or an `Asset` (SIPI/Ingest) in the DSP.
+ A `Record` is part of exactly 1 `Dataset` and may be part of 0-n `Collections`.
+
+Additionally, there are the entities `Person` and `Organization`:
+`Person` and `Organization` are entities that are independent of the `Research Project` hierarchy,
+and may be related to various entities within the hierarchy.
+
+
+## Top Level
+
+A set of metadata consists of the following top-level elements:
+
+- Compound Project
+- Project
+- Dataset
+- Collection
+- Record
+- Person
+- Organization
+
+Each of these elements is an entity identified by a unique identifier.
+Other elements can refer to these entities by their identifier.
+
+Any other metadata element may itself be a complex object,
+but it is always part of one of the top-level elements.
+Such elements do not have an identifier,
+but are identified by their position in the hierarchy.
+
+| Field | Type | Cardinality |
+| ----------------- | --------------- | ----------- |
+| `$schema` | string | 0-1 |
+| `compoundProject` | compoundProject | 0-1 |
+| `project` | project | 1 |
+| `datasets` | dataset[] | 1-n |
+| `collections` | collection[] | 0-n |
+| `records` | record[] | 0-n |
+| `persons` | person[] | 0-n |
+| `organizations` | organization[] | 0-n |
+
+
+## Types
+
+### Entity Types
+
+#### Compound Project
+
+| Field | Type | Cardinality | Restrictions | Remarks |
+| ------------------------ | ------------------- | ----------- | ------------------------------------------------------------ | ------------------ |
+| `__type` | string | 1 | Literal 'CompoundProject' | |
+| `name` | string | 1 | | |
+| `url` | url | 1 | | |
+| `howToCite` | string | 1 | | Needed? |
+| `projects` | id[] | 1-n | String containing the identifier of a project | |
+| `description` | lang_string | 0-1 | | Optional? |
+| `contactPoint` | id | 0-1 | String containing the identifier of a person or organization | Optional? |
+| `keywords` | lang_string[] | 0-n | | Needed? |
+| `disciplines` | lang_string / url[] | 0-n | | Needed? |
+| `temporalCoverage` | lang_string / url[] | 0-n | | Needed? |
+| `spatialCoverage` | url[] | 0-n | | Needed? |
+| `funders` | id[] | 0-n | String containing the identifier of a person | Needed? |
+| `publications` | publication[] | 0-n | | Needed? |
+| `grants` | grant[] | 0-n | | Needed? |
+| `alternativeNames` | lang_string[] | 0-n | | Needed? |
+| `consistingInstitutions` | id[] | 0-n | String containing the identifier of an organization | Makes sense? Name? |
+
+!!! question
+ This opens up the questions of how to deal with multiple projects in a compound project.
+ We probably want to keep one entry per project,
+ so this leaves us with either duplicating the compound project metadata for each project,
+ or having compound project metadata separately and only linking it from the project.
+ The latter seems preferable,
+ but then the question arises who gets to edit the compound project metadata.
+ For a first implementation, we could simply duplicate the metadata for each project,
+ and later factor it out.
+
+!!! important
+ The properties for `Compound Project` were invented by me on the fly.
+ That does not mean they are correct or useful.
+
+
+#### Project
+
+| Field | Type | Cardinality | Restrictions | Remarks |
+| -------------------- | ------------------- | ----------- | ------------------------------------------------------------ | --------------------- |
+| `__type` | string | 1 | Literal "Project" | |
+| `shortcode` | string | 1 | 4 char hexadecimal | |
+| `status` | string | 1 | Literal "Ongoing" or "Finished" | |
+| `name` | string | 1 | | |
+| `description` | lang_string | 1 | | |
+| `startDate` | date | 1 | String of format "YYYY-MM-DD" | |
+| `teaserText` | string | 1 | | |
+| `url` | url | 1 | | |
+| `howToCite` | string | 1 | | |
+| `datasets` | id[] | 1-n | String containing the identifier of a dataset | |
+| `keywords` | lang_string[] | 1-n | | |
+| `disciplines` | lang_string / url[] | 1-n | | |
+| `temporalCoverage` | lang_string / url[] | 1-n | | |
+| `spatialCoverage` | url[] | 1-n | | |
+| `funders` | id[] | 1-n | String containing the identifier of a person or organization | Does this make sense? |
+| `endDate` | date | 0-1 | String of format "YYYY-MM-DD" | |
+| `secondaryURL` | url | 0-1 | | |
+| `dataManagementPlan` | dmp | 0-1 | | |
+| `contactPoint` | id | 0-1 | String containing the identifier of a person or organization | |
+| `publications` | publication[] | 0-n | | |
+| `grants` | grant[] | 0-n | | Does this make sense? |
+| `alternativeNames` | lang_string[] | 0-n | | |
+
+!!! question
+ If we can have copyright/license on dataset level,
+ do we want to have it on project level as well?
+
+!!! question
+ Do we still need funders if we have grants?
+
+!!! question
+ What about projects that do not have funding?
+
+
+#### Dataset
+
+| Field | Type | Cardinality | Restrictions | Remarks |
+| ------------------- | ----------------- | ----------- | ------------------------------------------------------- | ----------------------------------- |
+| `__id` | string | 1 | | |
+| `__type` | string | 1 | Literal "Dataset" | |
+| `title` | string | 1 | | |
+| `accessConditions` | string | 1 | Literal "open", "restricted" or "closed" | change to proper terms |
+| `howToCite` | string | 1 | | |
+| `status` | string | 1 | Literal "In Planning", "Ongoing", "On hold", "Finished" | not aligned with project status |
+| `abstract` | lang_string / url | 1-n | | naming: maybe 'description'? |
+| `typeOfData` | string[] | 1-n | Literal "XML", "Text", "Image", "Video", "Audio" | does this still make sense? |
+| `licenses` | license[] | 1-n | | should be computed from the records |
+| `copyright` | string[] | 1-n | | computed along with license |
+| `languages` | lang_string[] | 1-n | | does this make sense? |
+| `attributions` | attribution[] | 1-n | | can this be calculated? |
+| `datePublished` | date | 0-1 | | |
+| `dateCreated` | date | 0-1 | | |
+| `dateModified` | date | 0-1 | | |
+| `distribution` | url | 0-1 | | does this make sense? |
+| `alternativeTitles` | lang_string[] | 0-n | | |
+| `urls` | url[] | 0-n | | |
+| `additional` | lang_string / url | 0-n | | |
+
+!!! question
+ Do we conssider datasets something merely "internal"?
+ If so, do metadata on datasets even make sense at all? Should we even "expose" datasets publicly?
+
+!!! question
+ Do we need to store the license on the dataset level,
+ or can we compute it from the records?
+ If we store it on the dataset level,
+ how do we deal with datasets that contain records with different licenses?
+
+!!! question
+ Do we need to store the language on the dataset level,
+ or can we compute it from the records?
+ If we store it on the dataset level,
+ how do we deal with datasets that contain records in different languages?
+
+!!! question
+ Do we need to store the attribution on the dataset level,
+ or can we compute it from the records?
+ If we store it on the dataset level,
+ how do we deal with datasets that contain records with different attributions?
+
+!!! question
+ Do we need a reference to the records in the dataset?
+
+
+#### Collection
+
+| Field | Type | Cardinality | Restrictions | Remarks |
+| ------------------ | ----------------- | ----------- | ------------------------------------------------ | -------------------------------------------------------- |
+| `__id` | string | 1 | | |
+| `__type` | string | 1 | Literal 'Collection' | |
+| `name` | string | 1 | | |
+| `accessConditions` | string | 1 | Literal "open", "restricted" or "closed" | copied from dataset; change to proper terms |
+| `provenance` | string | 0-1 | | |
+| `datePublished` | date | 0-1 | | copied from dataset; do we still need those? |
+| `dateCreated` | date | 0-1 | | copied from dataset; do we still need those? |
+| `dateModified` | date | 0-1 | | copied from dataset; do we still need those? |
+| `distribution` | url | 0-1 | | copied from dataset; does this make sense? |
+| `records` | id[] | 0-n | Record IDs | can be 0 in case it points to a collection |
+| `collections` | id[] | 0-n | Collection IDs | |
+| `alternativeNames` | lang_string[] | 0-n | | |
+| `keywords` | lang_string[] | 0-n | | does this make sense? |
+| `urls` | url[] | 0-n | | copied from dataset; |
+| `additional` | lang_string / url | 0-n | | copied from dataset; |
+| `description` | string / url | 1-n | | |
+| `typeOfData` | string[] | 1-n | Literal "XML", "Text", "Image", "Video", "Audio" | copied from dataset; does this still make sense? |
+| `licenses` | license[] | 1-n | | copied from dataset; should be computed from the records |
+| `copyright` | string[] | 1-n | | computed along with license |
+| `languages` | lang_string[] | 1-n | | copied from dataset; does this make sense? |
+| `attributions` | attribution[] | 1-n | | copied from dataset; can this be calculated? |
+
+
+!!! important
+ The properties for `Compound Project` were invented by me on the fly.
+ That does not mean they are correct or useful.
+
+
+!!! question
+ Do we need a reference to the records in the collection?
+
+
+#### Record
+
+| Field | Type | Cardinality | Restrictions | Remarks |
+| ------------------ | ----------- | ----------- | ------------------------------------------------ | -------------------------------------------------------- |
+| `__id` | string | 1 | | |
+| `__type` | string | 1 | Literal 'Record' | |
+| `pid` | id | 1 | | or `ARK`? |
+| `label` | lang_string | 1 | | do we want this, or does it go too far? |
+| `accessConditions` | string | 1 | Literal "open", "restricted" or "closed" | copied from dataset; change to proper terms |
+| `license` | license | 1 | | copied from dataset; should be computed from the records |
+| `copyright` | string | 1 | | computed along with license |
+| `attribution` | attribution | 1 | | do we want this, or does it go too far? |
+| `provenance` | string | 0-1 | | do we want this, or does it go too far? |
+| `datePublished` | date | 0-1 | | copied from dataset; do they make sense? |
+| `dateCreated` | date | 0-1 | | copied from dataset; do they make sense? |
+| `dateModified` | date | 0-1 | | copied from dataset; do they make sense? |
+| `typeOfData` | string | 0-1 | Literal "XML", "Text", "Image", "Video", "Audio" | copied from dataset; wanted? what values? |
+
+!!! important
+ The properties for `Record` were invented by me on the fly.
+ That does not mean they are correct or useful.
+
+!!! question
+ How granular do we want to be with the metadata on the record level?
+
+!!! question
+ If we have copyright, what is the purpose of attribution?
+
+
+#### Person
+
+| Field | Type | Cardinality | Restrictions | Remarks |
+| ---------------- | -------- | ----------- | -------------------------------------- | ------- |
+| `__id` | string | 1 | | |
+| `__type` | string | 1 | Literal 'Person' | |
+| `givenNames` | string[] | 1-n | | |
+| `familyNames` | string[] | 1-n | | |
+| `jobTitles` | string[] | 0-n | | |
+| `affiliations` | id[] | 0-n | Organization IDs | |
+| `address` | address | 0-1 | | |
+| `email` | string | 0-1 | | |
+| `secondaryEmail` | string | 0-1 | | |
+| `authorityRefs` | url[] | 0-n | References to external authority files | |
+
+
+#### Organization
+
+| Field | Type | Cardinality | Restrictions | Remarks |
+| ----------------- | ----------- | ----------- | -------------------------------------- | ------- |
+| `__id` | string | 1 | | |
+| `__type` | string | 1 | Literal 'Organization' | |
+| `name` | string | 1 | | |
+| `url` | url | 1 | | |
+| `address` | address | 0-1 | | |
+| `email` | string | 0-1 | | |
+| `alternativeName` | lang_string | 0-1 | | |
+| `authorityRefs` | url[] | 0-n | References to external authority files | |
+
+
+### Value Types
+
+#### String with Language Tag (`lang_string`)
+
+Object with an ISO language code as key and a string as value.
+
+```json
+{
+ "en": "Lorem ipsum in English.",
+ "de": "Lorem ipsum auf Deutsch."
+}
+```
+
+
+#### Date
+
+String with the format `YYYY-MM-DD`.
+
+
+#### URL
+
+An object representing a URL.
+Depending on the `type` field,
+the URL may be a generic URL
+or a more specific link, like a PID
+or a reference to a resource in an external authority file.
+
+
+| Field | Type | Cardinality | Restrictions |
+| -------- | ------ | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
+| `__type` | string | 1 | Literal 'URL' |
+| `type` | string | 1 | Literal 'URL', 'Geonames', 'Pleiades', 'Skos', 'Periodo', 'Chronontology', 'GND', 'VIAF', 'Grid', 'ORCID', 'Creative Commons', 'DOI', 'ARK' |
+| `url` | string | 1 | |
+| `text` | string | 0-1 | |
+
+!!! question
+ can we model different types of URLs in a more sensible way?
+
+
+#### Data Management Plan (`dmp`)
+
+| Field | Type | Cardinality | Restrictions |
+| ----------- | ------- | ----------- | ---------------------------- |
+| `__type` | string | 1 | Literal 'DataManagementPlan' |
+| `available` | boolean | 0-1 | |
+| `url` | url | 0-1 | |
+
+
+!!! question
+ Does the model for `Data Management Plan` still make sense?
+ Could it be a string?
+ Is "available" useful information?
+ How do we ensure that either `available` or `url` is set?
+
+
+#### Publication
+
+| Field | Type | Cardinality | Restrictions |
+| ------ | ------ | ----------- | ------------ |
+| `text` | string | 1 | |
+| `url` | url | 0-1 | |
+
+
+#### Address
+
+| Field | Type | Cardinality | Restrictions |
+| ------------ | ------ | ----------- | ----------------- |
+| `__type` | string | 1 | Literal 'Address' |
+| `street` | string | 1 | |
+| `postalCode` | string | 1 | |
+| `locality` | string | 1 | |
+| `country` | string | 1 | |
+| `canton` | string | 0-1 | |
+| `additional` | string | 0-1 | |
+
+
+#### License
+
+| Field | Type | Cardinality | Restrictions |
+| --------- | ------ | ----------- | ----------------- |
+| `__type` | string | 1 | Literal 'License' |
+| `license` | url | 1 | |
+| `date` | date | 1 | |
+| `details` | string | 0-1 | |
+
+!!! question
+ Is this model up to date with our current understanding of licenses?
+ Is `details` ever used?
+ What is the purpose of `date` here?
+ How does it relate to a copyright statement?
+
+
+#### Attribution
+
+| Field | Type | Cardinality | Restrictions | Remark |
+| -------- | ------ | ----------- | ------------------------- | --------------------------- |
+| `__type` | string | 1 | Literal 'Attribution' | |
+| `agent` | id | 1 | Person or Organization ID | Or can this only be person? |
+| `roles` | string | 1-n | | |
+
+
+#### Grant
+
+| Field | Type | Cardinality | Restrictions |
+| --------- | ------ | ----------- | -------------------------- |
+| `__type` | string | 1 | Literal 'Grant' |
+| `funders` | id[] | 1-n | Person or Organization IDs |
+| `number` | string | 0-1 | |
+| `name` | string | 0-1 | |
+| `url` | url | 0-1 | |
+
+
+## Entity-Relationship Diagram
+
+```mermaid
+erDiagram
+ compoundProject |o--|{ project : projects
+ project ||--|{ dataset : datasets
+ project ||--|| person : contactPoint
+ project ||--|| organization : contactPoint
+ project ||--|{ person : funders
+ project ||--|{ organization : funders
+ project |o--|{ collection : collections
+ dataset ||--|{ record : records
+ collection |o--o{ collection : collections
+ collection |o--o{ record : records
+ person ||--|{ organization : affiliations
+
+ compoundProject {
+ string __type "1; Literal 'CompoundProject'"
+ string name "1"
+ url url "1"
+ string howToCite "1"
+ lang_string description "0-1"
+ id contactPoint "0-1"
+ id[] projects "1-n; Project IDs"
+ lang_string[] keywords "0-n"
+ lang_string_or_url[] disciplines "0-n"
+ lang_string_or_url[] temporalCoverage "0-n"
+ url[] spatialCoverage "0-n"
+ id[] funders "0-n; Person or Organization IDs"
+ publication[] publications "0-n"
+ grant[] grants "0-n"
+ lang_string[] alternativeNames "0-n"
+ id[] consistingInstitutions "0-n; Organization IDs"
+ }
+
+ project {
+ string __type "1; Literal 'Project'"
+ string shortcode "1"
+ string status "1; Literal 'Ongoing' or 'Finished'"
+ string name "1"
+ lang_string description "1"
+ date startDate "1"
+ string teaserText "1"
+ url url "1"
+ string howToCite "1"
+ id[] datasets "1-n; Dataset IDs"
+ lang_string[] keywords "1-n"
+ lang_string_or_url[] disciplines "1-n"
+ lang_string_or_url[] temporalCoverage "1-n"
+ url[] spatialCoverage "1-n"
+ id[] funders "1-n; Person or Organization IDs"
+ date endDate "0-1"
+ url secondaryURL "0-1"
+ dmp dataManagementPlan "0-1"
+ id contactPoint "0-1"
+ publication[] publications "0-n"
+ grant[] grants "0-n"
+ lang_string[] alternativeNames "0-n"
+ }
+
+ dataset {
+ string __id "1"
+ string __type "1; Literal 'Dataset'"
+ string title "1"
+ string accessConditions "1; Literal 'open', 'restricted' or 'closed'"
+ string howToCite "1"
+ string status "1; Literal 'In Planning', 'Ongoing', 'On hold', 'Finished'"
+ lang_string_or_url[] abstract "1-n"
+ string[] typeOfData "1-n; Literal 'XML', 'Text', 'Image', 'Video', 'Audio'"
+ license[] licenses "1-n"
+ string[] copyright "1-n"
+ lang_string[] languages "1-n"
+ attribution[] attributions "1-n"
+ date datePublished "0-1"
+ date dateCreated "0-1"
+ date dateModified "0-1"
+ url distribution "0-1"
+ lang_string[] alternativeTitles "0-n"
+ url[] urls "0-n"
+ lang_string_or_url[] additional "0-n"
+ }
+
+ collection {
+ string __id "1"
+ string __type "1; Literal 'Collection'"
+ string name "1"
+ string accessConditions "1; Literal 'open', 'restricted' or 'closed'"
+ string provenance "0-1"
+ date datePublished "0-1"
+ date dateCreated "0-1"
+ date dateModified "0-1"
+ url distribution "0-1"
+ id[] records "0-n; Record IDs"
+ id[] collections "0-n; Collection IDs"
+ lang_string[] alternativeNames "0-n"
+ lang_string[] keywords "0-n"
+ url[] urls "0-n"
+ lang_string_or_url[] additional "0-n"
+ lang_string_or_url[] description "1-n"
+ string[] typeOfData "1-n; Literal 'XML', 'Text', 'Image', 'Video', 'Audio'"
+ license[] licenses "1-n"
+ string[] copyright "1-n"
+ lang_string[] languages "1-n"
+ attribution[] attributions "1-n"
+ }
+
+ record {
+ string __id "1"
+ string __type "1; Literal 'Record'"
+ string pid "1"
+ lang_string label "1"
+ string accessConditions "1; Literal 'open', 'restricted' or 'closed'"
+ license license "1"
+ string copyright "1"
+ attribution attribution "1"
+ string provenance "0-1"
+ date datePublished "0-1"
+ date dateCreated "0-1"
+ date dateModified "0-1"
+ string typeOfData "0-1; Literal 'XML', 'Text', 'Image', 'Video', 'Audio'"
+ }
+
+ person {
+ string __id "1"
+ string __type "1; Literal 'Person'"
+ string[] givenNames "1-n"
+ string[] familyNames "1-n"
+ string[] jobTitles "0-n"
+ id[] affiliations "0-n; Organization IDs"
+ address address "0-1"
+ string email "0-1"
+ string secondaryEmail "0-1"
+ url[] authorityRefs "0-n"
+ }
+
+ organization {
+ string __id "1"
+ string __type "1; Literal 'Organization'"
+ string name "1"
+ url url "1"
+ address address "0-1"
+ string email "0-1"
+ lang_string alternativeName "0-1"
+ url[] authorityRefs "0-n"
+ }
+```
+
+
+
+## Change Log
+
+### Changes
+
+- Make `Grant` a value type and remove it from the top level.
+- Added entity `compoundProject` to the top level.
+- Added entity `collection` to the top level.
+- Added entity `record` to the top level.
+- Added `copyright` to `dataset`.
+
+### Implementation/migration Notes
+
+- inline grant in project
+- add/remove entities and properties accordingly
+
+
+### Mapping Old -> New
+
+TODO: Add mapping from old to new model.
diff --git a/mkdocs.yml b/mkdocs.yml
index f5544ebf..f8ced1ff 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -5,6 +5,7 @@ nav:
- Consuming Metadata:
- Metadata API: data/api.md
- Current Data Model: data/current-datamodel.md
+ - Provisional Data Model: data/provisional-datamodel.md
- Adding Metadata: adding-metadata.md
- Code Documentation:
- Overview: code/overview.md
From 888e444767f45c5be5e22a8d027278c6b52fb0e9 Mon Sep 17 00:00:00 2001
From: Balduin Landolt <33053745+BalduinLandolt@users.noreply.github.com>
Date: Tue, 5 Nov 2024 17:10:36 +0100
Subject: [PATCH 3/8] remove broken markdown linting rule
---
.markdownlint.yml | 8 +++++---
docs/data/provisional-datamodel.md | 2 +-
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/.markdownlint.yml b/.markdownlint.yml
index a4c058ac..c467d2fe 100644
--- a/.markdownlint.yml
+++ b/.markdownlint.yml
@@ -1,7 +1,7 @@
# Config file for https://github.com/igorshubovych/markdownlint-cli
# MD007/ul-indent - Unordered list indentation
-MD007:
+MD007:
# Whether to indent the first level of the list
start_indented: false
# By how many spaces every next level must be indented. The default of 2 is not compatible with mkdocs!
@@ -14,7 +14,7 @@ MD009: false
MD012: false
# MD013/line-length - Line length
-MD013:
+MD013:
line_length: 120
heading_line_length: 120
code_block_line_length: 120
@@ -30,8 +30,10 @@ MD013:
# Stern length checking
stern: false
+MD018: false
+
# MD033/no-inline-html - Inline HTML
-MD033:
+MD033:
allowed_elements: [br, center]
# MD041/first-line-heading/first-line-h1 - First line in a file should be a top-level heading
diff --git a/docs/data/provisional-datamodel.md b/docs/data/provisional-datamodel.md
index b5218120..cb8ff6ec 100644
--- a/docs/data/provisional-datamodel.md
+++ b/docs/data/provisional-datamodel.md
@@ -596,6 +596,6 @@ erDiagram
- add/remove entities and properties accordingly
-### Mapping Old -> New
+### Mapping Old -> New
TODO: Add mapping from old to new model.
From 4b57f6a986213cb0cc1e1e4aab3eb3bdf8a33b65 Mon Sep 17 00:00:00 2001
From: Balduin Landolt <33053745+BalduinLandolt@users.noreply.github.com>
Date: Tue, 5 Nov 2024 17:12:14 +0100
Subject: [PATCH 4/8] maybe fix linting issue?
---
.markdownlint.yml | 2 --
docs/data/provisional-datamodel.md | 2 +-
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/.markdownlint.yml b/.markdownlint.yml
index c467d2fe..8960ebe8 100644
--- a/.markdownlint.yml
+++ b/.markdownlint.yml
@@ -30,8 +30,6 @@ MD013:
# Stern length checking
stern: false
-MD018: false
-
# MD033/no-inline-html - Inline HTML
MD033:
allowed_elements: [br, center]
diff --git a/docs/data/provisional-datamodel.md b/docs/data/provisional-datamodel.md
index cb8ff6ec..880d558c 100644
--- a/docs/data/provisional-datamodel.md
+++ b/docs/data/provisional-datamodel.md
@@ -596,6 +596,6 @@ erDiagram
- add/remove entities and properties accordingly
-### Mapping Old -> New
+### Mapping (Old to New)
TODO: Add mapping from old to new model.
From 7bce4ce356bf20335ca0b0ec3edd85d5752020a8 Mon Sep 17 00:00:00 2001
From: Balduin Landolt <33053745+BalduinLandolt@users.noreply.github.com>
Date: Tue, 5 Nov 2024 17:27:51 +0100
Subject: [PATCH 5/8] Update provisional-datamodel.md
---
docs/data/provisional-datamodel.md | 134 ++++++++++++++++++++++++++++-
1 file changed, 133 insertions(+), 1 deletion(-)
diff --git a/docs/data/provisional-datamodel.md b/docs/data/provisional-datamodel.md
index 880d558c..c65e51be 100644
--- a/docs/data/provisional-datamodel.md
+++ b/docs/data/provisional-datamodel.md
@@ -104,6 +104,7 @@ but are identified by their position in the hierarchy.
| Field | Type | Cardinality | Restrictions | Remarks |
| ------------------------ | ------------------- | ----------- | ------------------------------------------------------------ | ------------------ |
+| `__id` | string | 1 | | |
| `__type` | string | 1 | Literal 'CompoundProject' | |
| `name` | string | 1 | | |
| `url` | url | 1 | | |
@@ -448,6 +449,7 @@ erDiagram
person ||--|{ organization : affiliations
compoundProject {
+ string __id "1"
string __type "1; Literal 'CompoundProject'"
string name "1"
url url "1"
@@ -598,4 +600,134 @@ erDiagram
### Mapping (Old to New)
-TODO: Add mapping from old to new model.
+#### Compound Project
+
+- `compoundProject.__id` : new
+- `compoundProject.__type` : new
+- `compoundProject.name`: new
+- `compoundProject.url`: new
+- `compoundProject.howToCite`: new
+- `compoundProject.description`: new
+- `compoundProject.contactPoint`: new
+- `compoundProject.keywords`: new
+- `compoundProject.disciplines`: new
+- `compoundProject.temporalCoverage`: new
+- `compoundProject.spatialCoverage`: new
+- `compoundProject.funders`: new
+- `compoundProject.publications`: new
+- `compoundProject.grants`: new
+- `compoundProject.alternativeNames`: new
+- `compoundProject.consistingInstitutions`: new
+
+This entity is new and does not have a direct mapping from the old model.
+All values need to be defined and added manually.
+
+#### Project
+
+- `project.__type`: unchanged
+- `project.shortcode`: unchanged
+- `project.status`: unchanged
+- `project.name`: unchanged
+- `project.description`: unchanged
+- `project.startDate`: unchanged
+- `project.teaserText`: unchanged
+- `project.url`: unchanged
+- `project.howToCite`: unchanged
+- `project.datasets`: unchanged
+- `project.keywords`: unchanged
+- `project.disciplines`: unchanged
+- `project.temporalCoverage`: unchanged
+- `project.spatialCoverage`: unchanged
+- `project.funders`: unchanged
+- `project.endDate`: unchanged
+- `project.secondaryURL`: unchanged
+- `project.dataManagementPlan`: unchanged
+- `project.contactPoint`: unchanged
+- `project.publications`: unchanged
+- `project.grants`: inlined from top level to project
+- `project.alternativeNames`: unchanged
+
+#### Dataset
+
+- `dataset.__id`: unchanged
+- `dataset.__type`: unchanged
+- `dataset.title`: unchanged
+- `dataset.accessConditions`: unchanged
+- `dataset.howToCite`: unchanged
+- `dataset.status`: unchanged
+- `dataset.abstract`: unchanged
+- `dataset.typeOfData`: unchanged
+- `dataset.licenses`: unchanged
+- `dataset.copyright`: newly added
+- `dataset.languages`: unchanged
+- `dataset.attributions`: unchanged
+- `dataset.datePublished`: unchanged
+- `dataset.dateCreated`: unchanged
+- `dataset.dateModified`: unchanged
+- `dataset.distribution`: unchanged
+- `dataset.alternativeTitles`: unchanged
+- `dataset.urls`: unchanged
+- `dataset.additional`: unchanged
+
+#### Collection
+
+- `collection.__id`: new
+- `collection.__type`: new
+- `collection.name`: new
+- `collection.accessConditions`: new
+- `collection.provenance`: new
+- `collection.datePublished`: new
+- `collection.dateCreated`: new
+- `collection.dateModified`: new
+- `collection.distribution`: new
+- `collection.records`: new
+- `collection.collections`: new
+- `collection.alternativeNames`: new
+- `collection.keywords`: new
+- `collection.urls`: new
+- `collection.additional`: new
+- `collection.description`: new
+- `collection.typeOfData`: new
+- `collection.licenses`: new
+- `collection.copyright`: new
+- `collection.languages`: new
+- `collection.attributions`: new
+
+#### Record
+
+- `record.__id`: new
+- `record.__type`: new
+- `record.pid`: new
+- `record.label`: new
+- `record.accessConditions`: new
+- `record.license`: new
+- `record.attribution`: new
+- `record.provenance`: new
+- `record.datePublished`: new
+- `record.dateCreated`: new
+- `record.dateModified`: new
+- `record.typeOfData`: new
+
+#### Person
+
+- `person.__id`: unchanged
+- `person.__type`: unchanged
+- `person.givenNames`: unchanged
+- `person.familyNames`: unchanged
+- `person.jobTitles`: unchanged
+- `person.affiliations`: unchanged
+- `person.address`: unchanged
+- `person.email`: unchanged
+- `person.secondaryEmail`: unchanged
+- `person.authorityRefs`: unchanged
+
+#### Organization
+
+- `organization.__id`: unchanged
+- `organization.__type`: unchanged
+- `organization.name`: unchanged
+- `organization.url`: unchanged
+- `organization.address`: unchanged
+- `organization.email`: unchanged
+- `organization.alternativeName`: unchanged
+- `organization.authorityRefs`: unchanged
From 88323d1a20dcd9b59e80d820a72034161538d06e Mon Sep 17 00:00:00 2001
From: Balduin Landolt <33053745+BalduinLandolt@users.noreply.github.com>
Date: Tue, 5 Nov 2024 17:28:19 +0100
Subject: [PATCH 6/8] revert unrelated changes
---
.markdownlint.yml | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/.markdownlint.yml b/.markdownlint.yml
index 8960ebe8..a4c058ac 100644
--- a/.markdownlint.yml
+++ b/.markdownlint.yml
@@ -1,7 +1,7 @@
# Config file for https://github.com/igorshubovych/markdownlint-cli
# MD007/ul-indent - Unordered list indentation
-MD007:
+MD007:
# Whether to indent the first level of the list
start_indented: false
# By how many spaces every next level must be indented. The default of 2 is not compatible with mkdocs!
@@ -14,7 +14,7 @@ MD009: false
MD012: false
# MD013/line-length - Line length
-MD013:
+MD013:
line_length: 120
heading_line_length: 120
code_block_line_length: 120
@@ -31,7 +31,7 @@ MD013:
stern: false
# MD033/no-inline-html - Inline HTML
-MD033:
+MD033:
allowed_elements: [br, center]
# MD041/first-line-heading/first-line-h1 - First line in a file should be a top-level heading
From 3ae0b5c19c9630e7d4012f20d364f314fe6ebea8 Mon Sep 17 00:00:00 2001
From: Balduin Landolt <33053745+BalduinLandolt@users.noreply.github.com>
Date: Tue, 5 Nov 2024 18:11:24 +0100
Subject: [PATCH 7/8] Update .markdownlint.yml
---
.markdownlint.yml | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/.markdownlint.yml b/.markdownlint.yml
index a4c058ac..3a503fca 100644
--- a/.markdownlint.yml
+++ b/.markdownlint.yml
@@ -1,7 +1,7 @@
# Config file for https://github.com/igorshubovych/markdownlint-cli
# MD007/ul-indent - Unordered list indentation
-MD007:
+MD007:
# Whether to indent the first level of the list
start_indented: false
# By how many spaces every next level must be indented. The default of 2 is not compatible with mkdocs!
@@ -14,7 +14,7 @@ MD009: false
MD012: false
# MD013/line-length - Line length
-MD013:
+MD013:
line_length: 120
heading_line_length: 120
code_block_line_length: 120
@@ -30,8 +30,11 @@ MD013:
# Stern length checking
stern: false
+MD024:
+ siblings_only: true
+
# MD033/no-inline-html - Inline HTML
-MD033:
+MD033:
allowed_elements: [br, center]
# MD041/first-line-heading/first-line-h1 - First line in a file should be a top-level heading
From 4973ce74cae675a5c6116d2d375b7c26b27c9674 Mon Sep 17 00:00:00 2001
From: Balduin Landolt <33053745+BalduinLandolt@users.noreply.github.com>
Date: Wed, 6 Nov 2024 18:17:47 +0100
Subject: [PATCH 8/8] changes according to discussion
---
docs/data/provisional-datamodel.md | 408 ++++++++++-------------------
1 file changed, 137 insertions(+), 271 deletions(-)
diff --git a/docs/data/provisional-datamodel.md b/docs/data/provisional-datamodel.md
index c65e51be..ca104cb6 100644
--- a/docs/data/provisional-datamodel.md
+++ b/docs/data/provisional-datamodel.md
@@ -21,7 +21,7 @@ The metadata model is a hierarchical structure of metadata elements.
```mermaid
flowchart TD
- hyper-project[Hyper-Project /
Uber-Project /
Meta-Project /
Compound Project] -->|1-n| project[Project /
Research Project]
+ hyper-project[Umbrella Project] -->|1-n| project[Research Project]
project -->|1-n| dataset[Dataset]
dataset -->|1-n| record[Record /
Resource]
project -->|0-n| collection[Collection]
@@ -30,7 +30,7 @@ flowchart TD
collection --> record
```
-- A `Compound Project` is optional and collects one or more `Research Projects`.
+- A `Umbrella Project` is optional and collects one or more `Research Projects`.
It is typically of institutional nature,
not directly tied to a specific funding grant,
and may be long-lived.
@@ -40,7 +40,7 @@ flowchart TD
It is typically tied to a specific funding grant,
and hence has a limited lifetime of ~3-5 years;
multiple funding rounds and a longer lifetime are possible.
- A `Research Project` is part of 0-1 `Compound Project`,
+ A `Research Project` is part of 0-1 `Umbrella Project`,
it has 1-n `Datasets` and 0-n `Collections`.
- A `Dataset` is a collection of `Records` within a `Research Project`.
It is mostly meant for system-internal and technical use,
@@ -52,7 +52,7 @@ flowchart TD
and may have a "historical meaning" in the context of the project.
Examples may be physical collections such as p person's "Nachlass" in an archive,
or groupings of records based on a specific research question within a project.
- A `Collection` is part of at least 1 `Research Project`, `Compound Project` or `Collection`,
+ A `Collection` is part of at least 1 `Research Project`, `Umbrella Project` or `Collection`,
but can be part of multiple. It may either contain 0-n `Collections` or 1-n `Records`.
- A `Record` is a single resource within a `Dataset`.
It represents a single entity, and the smallest unit that can meaningfully have an identifier.
@@ -68,7 +68,7 @@ and may be related to various entities within the hierarchy.
A set of metadata consists of the following top-level elements:
-- Compound Project
+- Umbrella Project
- Project
- Dataset
- Collection
@@ -87,7 +87,7 @@ but are identified by their position in the hierarchy.
| Field | Type | Cardinality |
| ----------------- | --------------- | ----------- |
| `$schema` | string | 0-1 |
-| `compoundProject` | compoundProject | 0-1 |
+| `umbrellaProject` | umbrellaProject | 0-1 |
| `project` | project | 1 |
| `datasets` | dataset[] | 1-n |
| `collections` | collection[] | 0-n |
@@ -96,77 +96,89 @@ but are identified by their position in the hierarchy.
| `organizations` | organization[] | 0-n |
+!!! question
+ Do we consider "permissions" as metadata?
+ (Not as they are in the DSP, but as they will be in the archive;
+ that is: "open", "restricted", "embargo", "metadata only".)
+ If so, this should be added on each level, I suppose.
+
+
## Types
### Entity Types
-#### Compound Project
-
-| Field | Type | Cardinality | Restrictions | Remarks |
-| ------------------------ | ------------------- | ----------- | ------------------------------------------------------------ | ------------------ |
-| `__id` | string | 1 | | |
-| `__type` | string | 1 | Literal 'CompoundProject' | |
-| `name` | string | 1 | | |
-| `url` | url | 1 | | |
-| `howToCite` | string | 1 | | Needed? |
-| `projects` | id[] | 1-n | String containing the identifier of a project | |
-| `description` | lang_string | 0-1 | | Optional? |
-| `contactPoint` | id | 0-1 | String containing the identifier of a person or organization | Optional? |
-| `keywords` | lang_string[] | 0-n | | Needed? |
-| `disciplines` | lang_string / url[] | 0-n | | Needed? |
-| `temporalCoverage` | lang_string / url[] | 0-n | | Needed? |
-| `spatialCoverage` | url[] | 0-n | | Needed? |
-| `funders` | id[] | 0-n | String containing the identifier of a person | Needed? |
-| `publications` | publication[] | 0-n | | Needed? |
-| `grants` | grant[] | 0-n | | Needed? |
-| `alternativeNames` | lang_string[] | 0-n | | Needed? |
-| `consistingInstitutions` | id[] | 0-n | String containing the identifier of an organization | Makes sense? Name? |
+#### Unbrella Project
+
+| Field | Type | Card. | Restrictions |
+| ---------------------- | ------------- | ----- | ------------------------------------------------------------ |
+| `__id` | string | 1 | |
+| `__type` | string | 1 | Literal 'UmbrellaProject' |
+| `name` | string | 1 | |
+| `projects` | id[] | 1-n | String containing the identifier of a project |
+| `description` | lang_string | 0-1 | |
+| `alternativeNames` | lang_string[] | 0-n | |
+| `url` | url | 0-1 | |
+| `contactPoint` | id | 0-1 | String containing the identifier of a person or organization |
+| `institutionalPartner` | id[] | 0-n | String containing the identifier of an organization |
!!! question
- This opens up the questions of how to deal with multiple projects in a compound project.
+ This opens up the questions of how to deal with multiple projects in a umbrella project.
We probably want to keep one entry per project,
- so this leaves us with either duplicating the compound project metadata for each project,
- or having compound project metadata separately and only linking it from the project.
+ so this leaves us with either duplicating the umbrella project metadata for each project,
+ or having umbrella project metadata separately and only linking it from the project.
The latter seems preferable,
- but then the question arises who gets to edit the compound project metadata.
+ but then the question arises who gets to edit the umbrella project metadata.
For a first implementation, we could simply duplicate the metadata for each project,
and later factor it out.
-!!! important
- The properties for `Compound Project` were invented by me on the fly.
- That does not mean they are correct or useful.
+!!! question
+ what is the best name for `institutionalPartner`?
+ AI suggested:
+ - Affiliated Institution
+ - Associated Body
+ - Supporting Organization
+ - Institutional Partner
+
+!!! question
+ How do we capture the time aspect of the data provenance and genesis in this context? Should this be here?
+ Concretely, an umbrella project is often like a "timeline" of projects, or the "history" of a series of projects.
+
+To make the model of this entity as flexible as possible,
+most of the fields are optional.
#### Project
-| Field | Type | Cardinality | Restrictions | Remarks |
-| -------------------- | ------------------- | ----------- | ------------------------------------------------------------ | --------------------- |
-| `__type` | string | 1 | Literal "Project" | |
-| `shortcode` | string | 1 | 4 char hexadecimal | |
-| `status` | string | 1 | Literal "Ongoing" or "Finished" | |
-| `name` | string | 1 | | |
-| `description` | lang_string | 1 | | |
-| `startDate` | date | 1 | String of format "YYYY-MM-DD" | |
-| `teaserText` | string | 1 | | |
-| `url` | url | 1 | | |
-| `howToCite` | string | 1 | | |
-| `datasets` | id[] | 1-n | String containing the identifier of a dataset | |
-| `keywords` | lang_string[] | 1-n | | |
-| `disciplines` | lang_string / url[] | 1-n | | |
-| `temporalCoverage` | lang_string / url[] | 1-n | | |
-| `spatialCoverage` | url[] | 1-n | | |
-| `funders` | id[] | 1-n | String containing the identifier of a person or organization | Does this make sense? |
-| `endDate` | date | 0-1 | String of format "YYYY-MM-DD" | |
-| `secondaryURL` | url | 0-1 | | |
-| `dataManagementPlan` | dmp | 0-1 | | |
-| `contactPoint` | id | 0-1 | String containing the identifier of a person or organization | |
-| `publications` | publication[] | 0-n | | |
-| `grants` | grant[] | 0-n | | Does this make sense? |
-| `alternativeNames` | lang_string[] | 0-n | | |
+| Field | Type | Cardinality | Restrictions |
+| -------------------- | ------------------- | ----------- | ------------------------------------------------------------ |
+| `__type` | string | 1 | Literal "Project" |
+| `shortcode` | string | 1 | 4 char hexadecimal |
+| `status` | string | 1 | Literal "Ongoing" or "Finished" |
+| `name` | string | 1 | |
+| `description` | lang_string | 1 | |
+| `startDate` | date | 1 | String of format "YYYY-MM-DD" |
+| `teaserText` | string | 1 | |
+| `url` | url | 1 | |
+| `howToCite` | string | 1 | |
+| `datasets` | id[] | 1-n | String containing the identifier of a dataset |
+| `keywords` | lang_string[] | 1-n | |
+| `disciplines` | lang_string / url[] | 1-n | |
+| `temporalCoverage` | lang_string / url[] | 1-n | |
+| `spatialCoverage` | url[] | 1-n | |
+| `funders` | id[] | 1-n | String containing the identifier of a person or organization |
+| `attributions` | attribution[] | 1-n | |
+| `endDate` | date | 0-1 | String of format "YYYY-MM-DD" |
+| `secondaryURL` | url | 0-1 | |
+| `dataManagementPlan` | dmp | 0-1 | |
+| `contactPoint` | id | 0-1 | String containing the identifier of a person or organization |
+| `publications` | publication[] | 0-n | |
+| `grants` | grant[] | 0-n | |
+| `alternativeNames` | lang_string[] | 0-n | |
!!! question
If we can have copyright/license on dataset level,
- do we want to have it on project level as well?
+ do we want to have it on project level as well?
+ In any case, it should be computed from the datasets/records.
!!! question
Do we still need funders if we have grants?
@@ -174,34 +186,33 @@ but are identified by their position in the hierarchy.
!!! question
What about projects that do not have funding?
+!!! question
+ Do we want my proposed `attributions` field n project?
+
+!!! question
+ Should we have an `abstract` field in the project, like we used to have in the dataset?
+
#### Dataset
-| Field | Type | Cardinality | Restrictions | Remarks |
-| ------------------- | ----------------- | ----------- | ------------------------------------------------------- | ----------------------------------- |
-| `__id` | string | 1 | | |
-| `__type` | string | 1 | Literal "Dataset" | |
-| `title` | string | 1 | | |
-| `accessConditions` | string | 1 | Literal "open", "restricted" or "closed" | change to proper terms |
-| `howToCite` | string | 1 | | |
-| `status` | string | 1 | Literal "In Planning", "Ongoing", "On hold", "Finished" | not aligned with project status |
-| `abstract` | lang_string / url | 1-n | | naming: maybe 'description'? |
-| `typeOfData` | string[] | 1-n | Literal "XML", "Text", "Image", "Video", "Audio" | does this still make sense? |
-| `licenses` | license[] | 1-n | | should be computed from the records |
-| `copyright` | string[] | 1-n | | computed along with license |
-| `languages` | lang_string[] | 1-n | | does this make sense? |
-| `attributions` | attribution[] | 1-n | | can this be calculated? |
-| `datePublished` | date | 0-1 | | |
-| `dateCreated` | date | 0-1 | | |
-| `dateModified` | date | 0-1 | | |
-| `distribution` | url | 0-1 | | does this make sense? |
-| `alternativeTitles` | lang_string[] | 0-n | | |
-| `urls` | url[] | 0-n | | |
-| `additional` | lang_string / url | 0-n | | |
+| Field | Type | Cardinality | Restrictions | Remarks |
+| -------------- | ------------- | ----------- | ------------------------------------------------ | ------------------------------------------------------- |
+| `__id` | string | 1 | | |
+| `__type` | string | 1 | Literal "Dataset" | |
+| `title` | string | 1 | | may be auto-generated? |
+| `typeOfData` | string[] | 1-n | Literal "XML", "Text", "Image", "Video", "Audio" | does this still make sense? should it be cardinality 1? |
+| `licenses` | license[] | 1-n | | should be computed from the records |
+| `copyright` | string[] | 1-n | | computed along with license |
+| `attributions` | attribution[] | 1-n | | can this be computed? |
+| `howToCite` | string | 0-1 | | still wanted? |
+| `description` | lang_string | 0-1 | | |
+| `dateCreated` | date | 0-1 | | |
-!!! question
- Do we conssider datasets something merely "internal"?
- If so, do metadata on datasets even make sense at all? Should we even "expose" datasets publicly?
+!!! note
+ If we think of a dataset as something internal,
+ we should limit the metadata to what is necessary for the system to work.
+ Additionally, we may want to have some minimal descriptive metadata for the dataset,
+ (like for the use case that a project once a year grabs a box of achrival material and digitizes it).
!!! question
Do we need to store the license on the dataset level,
@@ -224,6 +235,17 @@ but are identified by their position in the hierarchy.
!!! question
Do we need a reference to the records in the dataset?
+!!! question
+ Does `dateCreated` suffice here? There were more date properties in the old model.
+
+Data sets arefor internal use,
+they serve to partition the data into manageable chunks.
+This is done both by type of data (RDF vs. assets), and by size.
+
+In some cases, there may be a "logical" grouping consisting a dataset,
+e.g. if data is digitized in a batch and there is a temporal separation between the batches.
+In these cases, the project may make use of the descriptive metadata of the dataset.
+But normally, the dataset is just a technical entity, and should not carry semantic information.
#### Collection
@@ -232,11 +254,13 @@ but are identified by their position in the hierarchy.
| `__id` | string | 1 | | |
| `__type` | string | 1 | Literal 'Collection' | |
| `name` | string | 1 | | |
-| `accessConditions` | string | 1 | Literal "open", "restricted" or "closed" | copied from dataset; change to proper terms |
+| `description` | string / url | 1-n | | |
+| `typeOfData` | string[] | 1-n | Literal "XML", "Text", "Image", "Video", "Audio" | copied from dataset; does this still make sense? |
+| `licenses` | license[] | 1-n | | copied from dataset; should be computed from the records |
+| `copyright` | string[] | 1-n | | computed along with license |
+| `languages` | lang_string[] | 1-n | | copied from dataset; does this make sense? |
+| `attributions` | attribution[] | 1-n | | copied from dataset; can this be calculated? |
| `provenance` | string | 0-1 | | |
-| `datePublished` | date | 0-1 | | copied from dataset; do we still need those? |
-| `dateCreated` | date | 0-1 | | copied from dataset; do we still need those? |
-| `dateModified` | date | 0-1 | | copied from dataset; do we still need those? |
| `distribution` | url | 0-1 | | copied from dataset; does this make sense? |
| `records` | id[] | 0-n | Record IDs | can be 0 in case it points to a collection |
| `collections` | id[] | 0-n | Collection IDs | |
@@ -244,17 +268,6 @@ but are identified by their position in the hierarchy.
| `keywords` | lang_string[] | 0-n | | does this make sense? |
| `urls` | url[] | 0-n | | copied from dataset; |
| `additional` | lang_string / url | 0-n | | copied from dataset; |
-| `description` | string / url | 1-n | | |
-| `typeOfData` | string[] | 1-n | Literal "XML", "Text", "Image", "Video", "Audio" | copied from dataset; does this still make sense? |
-| `licenses` | license[] | 1-n | | copied from dataset; should be computed from the records |
-| `copyright` | string[] | 1-n | | computed along with license |
-| `languages` | lang_string[] | 1-n | | copied from dataset; does this make sense? |
-| `attributions` | attribution[] | 1-n | | copied from dataset; can this be calculated? |
-
-
-!!! important
- The properties for `Compound Project` were invented by me on the fly.
- That does not mean they are correct or useful.
!!! question
@@ -279,10 +292,6 @@ but are identified by their position in the hierarchy.
| `dateModified` | date | 0-1 | | copied from dataset; do they make sense? |
| `typeOfData` | string | 0-1 | Literal "XML", "Text", "Image", "Video", "Audio" | copied from dataset; wanted? what values? |
-!!! important
- The properties for `Record` were invented by me on the fly.
- That does not mean they are correct or useful.
-
!!! question
How granular do we want to be with the metadata on the record level?
@@ -436,7 +445,7 @@ or a reference to a resource in an external authority file.
```mermaid
erDiagram
- compoundProject |o--|{ project : projects
+ umbrellaProject |o--|{ project : projects
project ||--|{ dataset : datasets
project ||--|| person : contactPoint
project ||--|| organization : contactPoint
@@ -448,30 +457,23 @@ erDiagram
collection |o--o{ record : records
person ||--|{ organization : affiliations
- compoundProject {
+ umbrellaProject {
string __id "1"
- string __type "1; Literal 'CompoundProject'"
+ string __type "1; Literal 'UmbrellaProject'"
string name "1"
- url url "1"
- string howToCite "1"
- lang_string description "0-1"
- id contactPoint "0-1"
id[] projects "1-n; Project IDs"
- lang_string[] keywords "0-n"
- lang_string_or_url[] disciplines "0-n"
- lang_string_or_url[] temporalCoverage "0-n"
- url[] spatialCoverage "0-n"
- id[] funders "0-n; Person or Organization IDs"
- publication[] publications "0-n"
- grant[] grants "0-n"
+ lang_string description "0-1"
lang_string[] alternativeNames "0-n"
- id[] consistingInstitutions "0-n; Organization IDs"
+ url url "0-1"
+ id contactPoint "0-1"
+ id[] institutionalPartner "0-n; Organization IDs"
}
project {
+ string __id "1"
string __type "1; Literal 'Project'"
string shortcode "1"
- string status "1; Literal 'Ongoing' or 'Finished'"
+ string status "1; Literal 'Ongoing', 'Finished'"
string name "1"
lang_string description "1"
date startDate "1"
@@ -484,6 +486,7 @@ erDiagram
lang_string_or_url[] temporalCoverage "1-n"
url[] spatialCoverage "1-n"
id[] funders "1-n; Person or Organization IDs"
+ attribution[] attributions "1-n"
date endDate "0-1"
url secondaryURL "0-1"
dmp dataManagementPlan "0-1"
@@ -497,22 +500,13 @@ erDiagram
string __id "1"
string __type "1; Literal 'Dataset'"
string title "1"
- string accessConditions "1; Literal 'open', 'restricted' or 'closed'"
- string howToCite "1"
- string status "1; Literal 'In Planning', 'Ongoing', 'On hold', 'Finished'"
- lang_string_or_url[] abstract "1-n"
string[] typeOfData "1-n; Literal 'XML', 'Text', 'Image', 'Video', 'Audio'"
license[] licenses "1-n"
string[] copyright "1-n"
- lang_string[] languages "1-n"
attribution[] attributions "1-n"
- date datePublished "0-1"
+ string howToCite "0-1"
+ lang_string description "0-1"
date dateCreated "0-1"
- date dateModified "0-1"
- url distribution "0-1"
- lang_string[] alternativeTitles "0-n"
- url[] urls "0-n"
- lang_string_or_url[] additional "0-n"
}
collection {
@@ -584,150 +578,22 @@ erDiagram
## Change Log
-### Changes
- Make `Grant` a value type and remove it from the top level.
-- Added entity `compoundProject` to the top level.
+- Added entity `umbrellaProject` to the top level.
- Added entity `collection` to the top level.
- Added entity `record` to the top level.
- Added `copyright` to `dataset`.
-
-### Implementation/migration Notes
-
-- inline grant in project
-- add/remove entities and properties accordingly
-
-
-### Mapping (Old to New)
-
-#### Compound Project
-
-- `compoundProject.__id` : new
-- `compoundProject.__type` : new
-- `compoundProject.name`: new
-- `compoundProject.url`: new
-- `compoundProject.howToCite`: new
-- `compoundProject.description`: new
-- `compoundProject.contactPoint`: new
-- `compoundProject.keywords`: new
-- `compoundProject.disciplines`: new
-- `compoundProject.temporalCoverage`: new
-- `compoundProject.spatialCoverage`: new
-- `compoundProject.funders`: new
-- `compoundProject.publications`: new
-- `compoundProject.grants`: new
-- `compoundProject.alternativeNames`: new
-- `compoundProject.consistingInstitutions`: new
-
-This entity is new and does not have a direct mapping from the old model.
-All values need to be defined and added manually.
-
-#### Project
-
-- `project.__type`: unchanged
-- `project.shortcode`: unchanged
-- `project.status`: unchanged
-- `project.name`: unchanged
-- `project.description`: unchanged
-- `project.startDate`: unchanged
-- `project.teaserText`: unchanged
-- `project.url`: unchanged
-- `project.howToCite`: unchanged
-- `project.datasets`: unchanged
-- `project.keywords`: unchanged
-- `project.disciplines`: unchanged
-- `project.temporalCoverage`: unchanged
-- `project.spatialCoverage`: unchanged
-- `project.funders`: unchanged
-- `project.endDate`: unchanged
-- `project.secondaryURL`: unchanged
-- `project.dataManagementPlan`: unchanged
-- `project.contactPoint`: unchanged
-- `project.publications`: unchanged
-- `project.grants`: inlined from top level to project
-- `project.alternativeNames`: unchanged
-
-#### Dataset
-
-- `dataset.__id`: unchanged
-- `dataset.__type`: unchanged
-- `dataset.title`: unchanged
-- `dataset.accessConditions`: unchanged
-- `dataset.howToCite`: unchanged
-- `dataset.status`: unchanged
-- `dataset.abstract`: unchanged
-- `dataset.typeOfData`: unchanged
-- `dataset.licenses`: unchanged
-- `dataset.copyright`: newly added
-- `dataset.languages`: unchanged
-- `dataset.attributions`: unchanged
-- `dataset.datePublished`: unchanged
-- `dataset.dateCreated`: unchanged
-- `dataset.dateModified`: unchanged
-- `dataset.distribution`: unchanged
-- `dataset.alternativeTitles`: unchanged
-- `dataset.urls`: unchanged
-- `dataset.additional`: unchanged
-
-#### Collection
-
-- `collection.__id`: new
-- `collection.__type`: new
-- `collection.name`: new
-- `collection.accessConditions`: new
-- `collection.provenance`: new
-- `collection.datePublished`: new
-- `collection.dateCreated`: new
-- `collection.dateModified`: new
-- `collection.distribution`: new
-- `collection.records`: new
-- `collection.collections`: new
-- `collection.alternativeNames`: new
-- `collection.keywords`: new
-- `collection.urls`: new
-- `collection.additional`: new
-- `collection.description`: new
-- `collection.typeOfData`: new
-- `collection.licenses`: new
-- `collection.copyright`: new
-- `collection.languages`: new
-- `collection.attributions`: new
-
-#### Record
-
-- `record.__id`: new
-- `record.__type`: new
-- `record.pid`: new
-- `record.label`: new
-- `record.accessConditions`: new
-- `record.license`: new
-- `record.attribution`: new
-- `record.provenance`: new
-- `record.datePublished`: new
-- `record.dateCreated`: new
-- `record.dateModified`: new
-- `record.typeOfData`: new
-
-#### Person
-
-- `person.__id`: unchanged
-- `person.__type`: unchanged
-- `person.givenNames`: unchanged
-- `person.familyNames`: unchanged
-- `person.jobTitles`: unchanged
-- `person.affiliations`: unchanged
-- `person.address`: unchanged
-- `person.email`: unchanged
-- `person.secondaryEmail`: unchanged
-- `person.authorityRefs`: unchanged
-
-#### Organization
-
-- `organization.__id`: unchanged
-- `organization.__type`: unchanged
-- `organization.name`: unchanged
-- `organization.url`: unchanged
-- `organization.address`: unchanged
-- `organization.email`: unchanged
-- `organization.alternativeName`: unchanged
-- `organization.authorityRefs`: unchanged
+- Changed type of `abstract`/`description` in `dataset` to `lang_string`.
+- Changed cardinality of `abstract`/`description` in `dataset` to 1.
+- Changed cardinality of `howToCite` in `dataset` to 0-1.
+- Changed cardinality of `description` in `dataset` to 0-1.
+- Removed `accessConditions` from `dataset`.
+- Removed `status` from `dataset`.
+- Renamed `abstract` to `description` in `dataset`.
+- Removed `languages` from `dataset`.
+- Removed `datePublished`, and `dateModified` from `dataset`.
+- Removed `distribution` from `dataset`.
+- Removed `additional` from `dataset`.
+- Removed `alternativeTitles` from `dataset`.
+- Removed `urls` from `dataset`.