diff --git a/docs/building/handling-missing-values.rst b/docs/building/handling-missing-values.rst index adc07cba..7ee2f127 100644 --- a/docs/building/handling-missing-values.rst +++ b/docs/building/handling-missing-values.rst @@ -2,16 +2,19 @@ Handling missing values ######################### -When handling data for machine learning models, missing values (NaNs) +When handling data for machine learning models, missing values (`NaNs`) can pose a challenge, as models require complete data to operate effectively and may crash otherwise. Ideally, we anticipate having -complete data in all fields. However, there are scenarios where NaNs -naturally occur, such as with variables only relevant on land or at sea -(such as sea surface temperature (`sst`), for example). In such cases, -the default behavior is to reject data with NaNs as invalid. To -accommodate NaNs and accurately compute statistics based on them, you -can include the `allow_nans` key in the configuration. Here's an example -of how to implement it: +complete data in all fields. + +However, there are scenarios where `NaNs` naturally occur, such as with +variables only relevant on land or at sea. This happens for sea surface +temperature (`sst`), for example. In such cases, the default behavior is +to reject data with `NaNs` as invalid. To accommodate `NaNs` and +accurately compute statistics based on them, you can include the +``allow_nans`` key in the configuration. + +Here's an example of how to implement it: .. literalinclude:: yaml/nan.yaml :language: yaml diff --git a/docs/building/sources.rst b/docs/building/sources.rst index 99fb8415..3e5e8aee 100644 --- a/docs/building/sources.rst +++ b/docs/building/sources.rst @@ -4,6 +4,17 @@ Sources ######### +The source is a software component that given a list of dates and +variables will return the corresponding fields. + +A `source` is responsible for reading data from the source and +converting it to a set of fields. A `source` is also responsible for +handling the metadata of the data, such as the variables names, and +more. + +A example of source is ECMWF’s MARS archive, a collection of GRIB or +NetCDF files, etc. + The following `sources` are currently available: .. toctree:: diff --git a/docs/building/statistics.rst b/docs/building/statistics.rst index 2d6915dd..4c561ee3 100644 --- a/docs/building/statistics.rst +++ b/docs/building/statistics.rst @@ -8,17 +8,20 @@ it is created. These statistics are intended to be used to normalise the data during training. -The statistics are stored in the `statistics` attribute of the dataset. -The computed statistics include: +The statistics are stored in the :doc:`statistics attribute +<../using/statistics>` of the dataset. The computed statistics include +`minimum, maximum, mean, standard deviation`. -- Minimum -- Maximum -- Mean -- Standard deviation +************************ + Statistics dates range +************************ By defaults, the statistics are not computed on the whole dataset, but -on a subset of dates. The subset is defined using the following -algorithm: +on a subset of dates. This usually is done to avoid any data leakage +from the validation and test sets to the training set. + +The dates subset used to compute the statistics is defined using the +following algorithm: - If the dataset covers 20 years or more, the last 3 years are excluded. @@ -51,3 +54,12 @@ Example configuration gathering statistics using only 2020 data : statistics: start: 2020 end: 2020 + +************************** + Data with missing values +************************** + +If the dataset contains missing values (known as `NaNs`), an error will +be raised when trying to compute the statistics. To allow `NaNs` in the +dataset, you can set the `allow_nans` as described :doc:`here +`. diff --git a/docs/cli/compare.rst b/docs/cli/compare.rst index be4d0252..f8604293 100644 --- a/docs/cli/compare.rst +++ b/docs/cli/compare.rst @@ -1,7 +1,15 @@ compare ======= -Use this command to compatre two datasets: +Use this command to compare two datasets. + +The command will run a quick comparison of the two datasets and output a summary of the differences. + +.. warning:: + + This command will not compare the data in the datasets, only some of the metadata. + Subsequent versions of this command may include more detailed comparisons. + .. argparse:: :module: anemoi.datasets.__main__ diff --git a/docs/cli/copy.rst b/docs/cli/copy.rst index 67394267..9413c1ec 100644 --- a/docs/cli/copy.rst +++ b/docs/cli/copy.rst @@ -16,7 +16,7 @@ The chunk pattern for the source dataset has been defined for good reasons, and .. warning:: - When resuming the copying process (using ``--resume``), calling the script with the same arguments for --block-size and --rechunk is recommended. + When resuming the copying process (using ``--resume``), calling the script with the same arguments for ``--block-size`` and ``--rechunk`` is recommended. Using different values for these arguments to resume copying the same dataset may lead to unexpected behavior. diff --git a/docs/cli/inspect.rst b/docs/cli/inspect.rst index d07ad201..1c8876fb 100644 --- a/docs/cli/inspect.rst +++ b/docs/cli/inspect.rst @@ -2,9 +2,24 @@ inspect ======= +Anemoi datasets are stored in a zarr format and can be located on a local file system or on a remote server. +The `inspect` command is used to inspect the contents of a dataset. +This command will output the metadata of the dataset, including the variables, dimensions, and attributes. + +.. code:: console + + $ anemoi-datasets inspect dataset.zarr + + +which will output something like the following. The output should be self-explanatory. + .. literalinclude:: ../building/yaml/building1.txt :language: console +********************* + Command line usage +********************* + .. argparse:: :module: anemoi.datasets.__main__ :func: create_parser diff --git a/pyproject.toml b/pyproject.toml index 034579ec..e3d15ec4 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -108,7 +108,6 @@ optional-dependencies.remote = [ urls.Documentation = "https://anemoi-datasets.readthedocs.io/" urls.Homepage = "https://github.com/ecmwf/anemoi-datasets/" urls.Issues = "https://github.com/ecmwf/anemoi-datasets/issues" - # Changelog = "https://github.com/ecmwf/anemoi-datasets/CHANGELOG.md" urls.Repository = "https://github.com/ecmwf/anemoi-datasets/" scripts.anemoi-datasets = "anemoi.datasets.__main__:main"