diff --git a/CHANGELOG.md b/CHANGELOG.md index 240ea3d8..1bf9fcd2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,8 @@ Log of changes in the versions ## v0.13.0 - scale and offset is now implemented in the package is should not longer be defined in a convention. - bugfix normalization extension +- bugfix exporting xr.DataArray built with the toolbox to netCDF +- support usage of IRI to describe metadata ## v0.12.2 diff --git a/README.md b/README.md index e2e7f635..654a0fcc 100644 --- a/README.md +++ b/README.md @@ -12,13 +12,13 @@ HDF5 to achieve a sustainable data lifecycle which follows the [FAIR (Findable, Accessible, Interoperable, Reusable)](https://www.nature.com/articles/sdata201618) principles. It specifically supports the five main steps of -1. Planning (defining a internal layout for HDF5 a metadata convention for attribute usage) +1. Planning (defining an internal layout for HDF5 a metadata convention or ontology for attribute usage) 2. Collecting data (creating HDF5 files or converting to HDF5 files from other sources) 3. Analyzing and processing data (Plotting, deriving data, ...) 4. Sharing data (publishing, archiving, ... e.g. to databases like [mongoDB](https://www.mongodb.com/) or repositories like [Zenodo](https://zenodo.org/)) 5. Reusing data (Searching data in databases, local file structures or online repositories - like [Zenodo](https://zenodo.org)). + like [Zenodo](https://zenodo.org), discover metadata based on persistent identifier like IRI). ## Quickstart diff --git a/docs/colab/quickstart.ipynb b/docs/colab/quickstart.ipynb index 7da53d82..c3b043ae 100644 --- a/docs/colab/quickstart.ipynb +++ b/docs/colab/quickstart.ipynb @@ -8,7 +8,7 @@ }, "outputs": [], "source": [ - "# !pip install h5rdmtoolbox==0.10.0" + "# !pip install h5rdmtoolbox" ] }, { @@ -19,33 +19,21 @@ "\n", "- Decide to use HDF5 as your core file format\n", "- Define important attributes and their usage in a metadata convention (e.g. a YAML file)\n", - "- Publish your convention on a repository like [Zenodo](https://zenodo.org/)" + "- Publish your convention on a repository like [Zenodo](https://zenodo.org/)\n", + "\n", + "At this time we assume, that we have done this already, thus we'll be using a convention published on zenodo, that already exists:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "using(\"h5rdmtoolbox-tutorial-convention\")" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "import h5rdmtoolbox as h5tbx\n", "\n", "# Assume we published a convention here: https://zenodo.org/record/8281285\n", - "cv = h5tbx.conventions.from_zenodo(doi='10156750')\n", - "\n", - "# enable the convention:\n", - "h5tbx.use(cv)" + "cv = h5tbx.conventions.from_zenodo(doi='10156750')" ] }, { @@ -54,6 +42,8 @@ "source": [ "# 2. Collecting\n", "\n", + "We start with writing data to an HDF5 file. This is syntactically almost like using `h5py`, just with a few features wrapped around it.\n", + "\n", "- Fill an HDF5 file with the required data and mandatory metadata\n", "- Data may come in various sources, e.g. from a measurement, a simulation or a database\n", "- HDF5 is best for multidimensional data, but can also be used for 1D data\n", @@ -61,10 +51,38 @@ " or the datasets and groups\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Before we can start writing data to the file, we must enable the convention. This results in changing the behaviour of methods like `create_datasets` as they now require parameters like `units` for example:" + ] + }, { "cell_type": "code", "execution_count": 3, "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "using(\"h5rdmtoolbox-tutorial-convention\")" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# enable the convention:\n", + "h5tbx.use(cv)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, "outputs": [], "source": [ "filename = 'my_file.hdf'\n", @@ -82,7 +100,35 @@ " data=np.random.normal(10, 2, 1000),\n", " standard_name='x_velocity',\n", " units='m/s',\n", - " attach_scale='time')" + " attach_scale='time')\n", + " h5['u'].dims[0].attach_scale(h5['time'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Refining the metadata\n", + "\n", + "We already provided quite some metadata with the file. Although, attributes like \"units\" is quire self-explaining, let's associate it with a persistent identifier. In this way, the metadata becomes *FAIR*:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "with h5tbx.File(filename) as h5:\n", + " h5.u.attrs.iri['units'] = 'http://qudt.org/schema/qudt/Unit'\n", + " h5.attrs.iri['contact'] = 'http://w3id.org/nfdi4ing/metadata4ing#ContactPerson'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the next step, we will see one practical effect of assigning IRIs to the metadata." ] }, { @@ -96,7 +142,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -290,43 +336,27 @@ "\n", "
<xarray.DataArray 'u' (time: 1000)>\n", - "4.344 11.49 10.58 12.29 6.109 9.985 8.054 ... 3.853 10.36 10.43 8.935 11.1 10.36\n", + "13.58 6.293 11.32 8.755 9.991 7.003 12.74 ... 8.022 9.582 12.16 9.29 8.712 11.78\n", "Coordinates:\n", " * time (time) float64 0.0 0.001001 0.002002 0.003003 ... 0.998 0.999 1.0\n", "Attributes:\n", " standard_name: x_velocity\n", - " units: m/s