Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align NetCDF data product attributes with ACDD Metadata Standards #259

Open
ladsmund opened this issue Jun 19, 2024 · 1 comment
Open

Align NetCDF data product attributes with ACDD Metadata Standards #259

ladsmund opened this issue Jun 19, 2024 · 1 comment

Comments

@ladsmund
Copy link
Contributor

We need to update our current processing pipeline to align with the Attribute Convention for Data Discovery (ACDD) 1-3 guidelines. This will improve the consistency, discoverability, and interoperability of our datasets.

The convention has a subset of attributes which are Highly Recommended that we should prioritize to follow.

In addition, I also suggest we maintain a source attribute and maybe product_version attribute for reproducability and to determine the need for reprocessing.

https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3#Index_by_Attribute_Name

Attribute Current status Target Example
id hash string based on stid. This conflicts with the conventions since it is not unique across datasets. unique string for dataset stid+level+sample. It could be a explicit string or a uuid. dk.geus.promice.station.daily.QAS_Lv3
dk.geus.promice.site.daily.QAS_L
title - A short phrase or sentence describing the dataset. Automatic weather station mesurements from QAS_Lv3. Daily average
summary - A paragraph describing the dataset ....
date_created The datetime when the script was executed The datetime when the script was executed 2024-06-19T05:12:55.594009
source - Necessary information for reproducing the dataset. Including versions of pypromice , data sources and configurations. {'pypromice': 1.3.6, 'aws-l0':2e1aa426246, 'aws-metadata': 132201a1}
product_version - Version identifier that can be used to determine if reprocessing is necessary. I might be redundant with source
institution GEUS GEUS Geological Survey of Denmark and Greenland (GEUS)
date_issued Same as date_created
date_modified Same as data_created
processing_level A textual representation of the procesing level Maybe fine. Level 2
@ladsmund
Copy link
Contributor Author

ladsmund commented Jun 25, 2024

We should also consider:

  1. IDs for source level datasets like tx and raw.
  2. IDs for different levels. Station datasets can be stored at both level 2 and level 3. Maybe level 3 could be implicit since it is the official output level.
  3. Making IDs unique for each iteration of a dataset. This makes it possible to precisely refer to the actual data used for analysis and processing. We should use dataset IDs extensively in our pipeline to determine whether an output has already been processed. We can use information about the input datasets, pypromice version, etc., to make the iteration ID deterministic.

uuid3 is a hash function that generates a 128-bit number from an input string, designed to be globally unique. The output depends solely on the input string (and namespace) and will always return the same value for the same input. A benefit of using a hash function for the IDs is to control and limit the format of the ID string. This might be especially relevant for point (3).

#252 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant