Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

super.json documentation #485

Closed
CPernet opened this issue Aug 14, 2024 · 9 comments
Closed

super.json documentation #485

CPernet opened this issue Aug 14, 2024 · 9 comments

Comments

@CPernet
Copy link

CPernet commented Aug 14, 2024

Hi guys,

I could not find documentation about the metadata/super.json.
How do you fill it for multiple datasets?

{dataset1},{dataset2}
or
{{dataset1},{dataset2}}
did not work for me

thx

@mslw
Copy link
Collaborator

mslw commented Aug 14, 2024

AFAIK super.json is just to say which dataset opens as a "main page" (or "landing page") i.e. which ID and version get filled in if you only go to your base URL.

So I think the literal answer is "there can be just one". The next question is, what is your intended purpose of declaring multiple datasets? If it's listing these datasets on the landing page, then the catalog's solution is to add those as subdatasets of your superdataset (they will then be shown shown in its Datasets tab).

@CPernet
Copy link
Author

CPernet commented Aug 14, 2024

Hi Michał,

yes the goal is to list many datasets, some having sub-datasets too. For instance

- dataset1
- dataset2
|-- subdataset2-1
|-- subdataset2-2

so If I understand your answer, I need to create my metadataset

-metadataset
|- dataset1
|- dataset2
  |-- subdataset2-1
  |-- subdataset2-2

and the super.json only has to list metadataset

the next question is how to make sub-datasets? (then nesting becomes obvious) I see in the example type: 'redirect' but nothing in the documentation itself and uncertain about that

@mslw
Copy link
Collaborator

mslw commented Aug 14, 2024

Yes, in terms of catalog inputs, the metadataset needs to have subdatasets property (dataset1 and dataset2 in your example). I am afraid the "Datasets" tab might not support showing the dataset tree (subdataset2-[1,2]), the only way to see them might be clicking through to the dataset2 page.

@CPernet
Copy link
Author

CPernet commented Aug 14, 2024

ok I see the subdataset, but I'm pretty sure @jsheunis or @mih showed me an example of catalogue with multiple datasets at the landing page -- hopefully that is still possible

I can already move on my sudatasets anyway - thx

@CPernet
Copy link
Author

CPernet commented Aug 14, 2024

well actually not obvious how to set the id ?? for instance

{"type":"dataset","name","superdataset",dataset_id":"PN000003",...}
{"type":"subdataset","name","subdataset1","dataset_id":"PN000003",...}
{"type":"file","dataset_id":"PN000003",..}
{"type":"subdataset","name","subdataset2","dataset_id":"PN000003",...}
{"type":"file","dataset_id":"PN000003",..}

this will not work since file should point to the subdataset_id, unless there is a subdataset_id we can specify, like:

{"type":"dataset","name","superdataset",dataset_id":"PN000003",...}
{"type":"subdataset","name","subdataset1","dataset_id":"PN000003","subdataset_id":"PN000003-1"...}
{"type":"file","dataset_id":"PN000003","subdataset_id":"PN000003-1",..}
{"type":"subdataset","name","subdataset2","dataset_id":"PN000003","subdataset_id":"PN000003-2",...}
{"type":"file","dataset_id":"PN000003","subdataset_id":"PN000003-2",..}

Is there a list or file I can look up at what values are allowed in the shemas?

thx

@jsheunis
Copy link
Member

Hey @CPernet

I think there's some confusion about the super-dataset sub-dataset relationship, and the main page of the catalog. Like @mslw said, the super.json file only points to the id and version of the dataset that should be displayed as the homepage of the catalog. This can be set via the command line using:

datalad catalog-set --catalog my-cat --dataset-id mydataset1234 --dataset-version myversion6578 home

This does nothing with regards to subdatasets showing on the main page. The subdatasets showing up in the "datasets" tab on any dataset page (the main page or any other), are all taken from the metadata of that specific dataset (i.e. not from super.json).

So if your metadataset is set as your catalog's homepage, then the homepage will show the subdatasets of the metadataset, i.e. dataset1 and dataset2, in the "Datasets" tab.

Getting those two datasets listed as subdatasets of the metadataset is done via the catalog-add command. Any metadata added via this command has to conform to the catalog schema, and you can see exactly which fields are available and/or required here: https://github.com/datalad/datalad-catalog/tree/main/datalad_catalog/catalog/schema

(I'm sorry about the broken documentation links here: http://docs.datalad.org/projects/catalog/en/latest/catalog_schema.html, I've created an issue for this and will fix it soon: #488)

Specifically: https://github.com/datalad/datalad-catalog/blob/main/datalad_catalog/catalog/schema/jsonschema_dataset.json#L233 shows what format the subdataset property of a metadata item should have if adding a subdataset to that dataset.

Lastly, yes there are metadata files that are specified as redirects, although this is not strictly necessary. This functionality was added as part of a PR #430 (see the PR description for a detailed explanation, and how to use it). So I would suggest not going that route from the start, and just adding dataset-level metadata (including a list of its subdatasets) as per usual, and then only using the script mentioned in the PR later if you want to use the redirect functionality.

@mslw
Copy link
Collaborator

mslw commented Aug 14, 2024

(typed at the same time, so I'll post anyway as it seems complementary)

Try something more like:

Meta-dataset (this would be selected in super.json):

{
  "type": "dataset",
  "dataset_id": "meta",
  "subdatasets": [
    {
      "dataset_id": "PN000003",
      "dataset_version": "some",
      "dataset_path": "PN000003"
    },
    {
      "dataset_id": "PN000004",
      "dataset_version": "some",
      "dataset_path": "PN000004"
    }
  ]
}

And each (sub)dataset (PN000003, PN000004 and their children) as a dataset in its own right:

{
  "type": "dataset",
  "dataset_id": "PN000003",
  "subdatasets": [
    {
      "dataset_id": "PN000003-1",
      "dataset_version": "some",
      "dataset_path": "PN000003-2"
    }
  ]
}

(typed by hand, so there will probably be missing properties or errors but you get the point).

@CPernet
Copy link
Author

CPernet commented Aug 15, 2024

thx guys, what I'd prefer is that the home page list datasets separately not landing on any in particular but if i understand that is not how this work -- ie create a fake meta data of the repo that lists all the datasets as children is the way to do it

@jsheunis
Copy link
Member

create a fake meta data of the repo that lists all the datasets as children is the way to do it

Exactly. Here are some examples:

@CPernet CPernet closed this as completed Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants