Skip to content

Commit

Permalink
Merge branch 'develop' of github.com:ecmwf/anemoi-datasets into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
b8raoult committed May 30, 2024
2 parents 5926ae9 + d8f9cae commit 07a5d69
Show file tree
Hide file tree
Showing 7 changed files with 67 additions and 19 deletions.
19 changes: 11 additions & 8 deletions docs/building/handling-missing-values.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,19 @@
Handling missing values
#########################

When handling data for machine learning models, missing values (NaNs)
When handling data for machine learning models, missing values (`NaNs`)
can pose a challenge, as models require complete data to operate
effectively and may crash otherwise. Ideally, we anticipate having
complete data in all fields. However, there are scenarios where NaNs
naturally occur, such as with variables only relevant on land or at sea
(such as sea surface temperature (`sst`), for example). In such cases,
the default behavior is to reject data with NaNs as invalid. To
accommodate NaNs and accurately compute statistics based on them, you
can include the `allow_nans` key in the configuration. Here's an example
of how to implement it:
complete data in all fields.

However, there are scenarios where `NaNs` naturally occur, such as with
variables only relevant on land or at sea. This happens for sea surface
temperature (`sst`), for example. In such cases, the default behavior is
to reject data with `NaNs` as invalid. To accommodate `NaNs` and
accurately compute statistics based on them, you can include the
``allow_nans`` key in the configuration.

Here's an example of how to implement it:

.. literalinclude:: yaml/nan.yaml
:language: yaml
11 changes: 11 additions & 0 deletions docs/building/sources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,17 @@
Sources
#########

The source is a software component that given a list of dates and
variables will return the corresponding fields.

A `source` is responsible for reading data from the source and
converting it to a set of fields. A `source` is also responsible for
handling the metadata of the data, such as the variables names, and
more.

A example of source is ECMWF’s MARS archive, a collection of GRIB or
NetCDF files, etc.

The following `sources` are currently available:

.. toctree::
Expand Down
28 changes: 20 additions & 8 deletions docs/building/statistics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,20 @@
it is created. These statistics are intended to be used to normalise the
data during training.

The statistics are stored in the `statistics` attribute of the dataset.
The computed statistics include:
The statistics are stored in the :doc:`statistics attribute
<../using/statistics>` of the dataset. The computed statistics include
`minimum, maximum, mean, standard deviation`.

- Minimum
- Maximum
- Mean
- Standard deviation
************************
Statistics dates range
************************

By defaults, the statistics are not computed on the whole dataset, but
on a subset of dates. The subset is defined using the following
algorithm:
on a subset of dates. This usually is done to avoid any data leakage
from the validation and test sets to the training set.

The dates subset used to compute the statistics is defined using the
following algorithm:

- If the dataset covers 20 years or more, the last 3 years are
excluded.
Expand Down Expand Up @@ -51,3 +54,12 @@ Example configuration gathering statistics using only 2020 data :
statistics:
start: 2020
end: 2020
**************************
Data with missing values
**************************

If the dataset contains missing values (known as `NaNs`), an error will
be raised when trying to compute the statistics. To allow `NaNs` in the
dataset, you can set the `allow_nans` as described :doc:`here
</building/handling-missing-values>`.
10 changes: 9 additions & 1 deletion docs/cli/compare.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
compare
=======

Use this command to compatre two datasets:
Use this command to compare two datasets.

The command will run a quick comparison of the two datasets and output a summary of the differences.

.. warning::

This command will not compare the data in the datasets, only some of the metadata.
Subsequent versions of this command may include more detailed comparisons.


.. argparse::
:module: anemoi.datasets.__main__
Expand Down
2 changes: 1 addition & 1 deletion docs/cli/copy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The chunk pattern for the source dataset has been defined for good reasons, and

.. warning::

When resuming the copying process (using ``--resume``), calling the script with the same arguments for --block-size and --rechunk is recommended.
When resuming the copying process (using ``--resume``), calling the script with the same arguments for ``--block-size`` and ``--rechunk`` is recommended.
Using different values for these arguments to resume copying the same dataset may lead to unexpected behavior.


Expand Down
15 changes: 15 additions & 0 deletions docs/cli/inspect.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,24 @@ inspect
=======


Anemoi datasets are stored in a zarr format and can be located on a local file system or on a remote server.
The `inspect` command is used to inspect the contents of a dataset.
This command will output the metadata of the dataset, including the variables, dimensions, and attributes.

.. code:: console
$ anemoi-datasets inspect dataset.zarr
which will output something like the following. The output should be self-explanatory.

.. literalinclude:: ../building/yaml/building1.txt
:language: console

*********************
Command line usage
*********************

.. argparse::
:module: anemoi.datasets.__main__
:func: create_parser
Expand Down
1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,6 @@ optional-dependencies.remote = [
urls.Documentation = "https://anemoi-datasets.readthedocs.io/"
urls.Homepage = "https://github.com/ecmwf/anemoi-datasets/"
urls.Issues = "https://github.com/ecmwf/anemoi-datasets/issues"

# Changelog = "https://github.com/ecmwf/anemoi-datasets/CHANGELOG.md"
urls.Repository = "https://github.com/ecmwf/anemoi-datasets/"
scripts.anemoi-datasets = "anemoi.datasets.__main__:main"
Expand Down

0 comments on commit 07a5d69

Please sign in to comment.