Merge from develop

ecmwf · Jun 3, 2024 · 2a9c1f7 · 2a9c1f7
2 parents 40b76b8 + 922a9b3
commit 2a9c1f7
Show file tree

Hide file tree

Showing 32 changed files with 514 additions and 368 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -13,19 +13,19 @@ repos:
 
 
 - repo: https://github.com/pre-commit/pre-commit-hooks
-  rev: v4.4.0
+  rev: v4.6.0
   hooks:
   - id: check-yaml # Check YAML files for syntax errors only
     args: [--unsafe, --allow-multiple-documents]
   - id: debug-statements # Check for debugger imports and py37+ breakpoint()
   - id: end-of-file-fixer # Ensure files end in a newline
   - id: trailing-whitespace # Trailing whitespace checker
-#  - id: no-commit-to-branch # Prevent committing to main / master
+  - id: no-commit-to-branch # Prevent committing to main / master
   - id: check-added-large-files # Check for large files added to git
   - id: check-merge-conflict # Check for files that contain merge conflict
 
 - repo: https://github.com/psf/black-pre-commit-mirror
-  rev: 24.1.1
+  rev: 24.4.2
   hooks:
     - id: black
       args: [--line-length=120]
@@ -41,7 +41,7 @@ repos:
 
 
 - repo: https://github.com/astral-sh/ruff-pre-commit
-  rev: v0.3.0
+  rev: v0.4.6
   hooks:
   - id: ruff
     exclude: '(dev/.*|.*_)\.py$'
@@ -51,7 +51,6 @@ repos:
     - --exit-non-zero-on-fix
     - --preview
 
-
 - repo: https://github.com/sphinx-contrib/sphinx-lint
   rev: v0.9.1
   hooks:
@@ -62,9 +61,21 @@ repos:
   rev: v0.0.14
   hooks:
     - id: rstfmt
+      exclude: 'cli/.*' # Because we use argparse
 
 - repo: https://github.com/b8raoult/pre-commit-docconvert
   rev: "0.1.4"
   hooks:
   - id: docconvert
     args: ["numpy"]
+
+- repo: https://github.com/b8raoult/optional-dependencies-all
+  rev: "0.0.2"
+  hooks:
+  - id: optional-dependencies-all
+    args: ["--inplace", "--all-key", "all", "--exclude-keys", "dev,docs"]
+
+- repo: https://github.com/tox-dev/pyproject-fmt
+  rev: "2.1.3"
+  hooks:
+    - id: pyproject-fmt
diff --git a/docs/building/handling-missing-values.rst b/docs/building/handling-missing-values.rst
@@ -2,16 +2,19 @@
  Handling missing values
 #########################
 
-When handling data for machine learning models, missing values (NaNs)
+When handling data for machine learning models, missing values (`NaNs`)
 can pose a challenge, as models require complete data to operate
 effectively and may crash otherwise. Ideally, we anticipate having
-complete data in all fields. However, there are scenarios where NaNs
-naturally occur, such as with variables only relevant on land or at sea
-(such as sea surface temperature (`sst`), for example). In such cases,
-the default behavior is to reject data with NaNs as invalid. To
-accommodate NaNs and accurately compute statistics based on them, you
-can include the `allow_nans` key in the configuration. Here's an example
-of how to implement it:
+complete data in all fields.
+
+However, there are scenarios where `NaNs` naturally occur, such as with
+variables only relevant on land or at sea. This happens for sea surface
+temperature (`sst`), for example. In such cases, the default behavior is
+to reject data with `NaNs` as invalid. To accommodate `NaNs` and
+accurately compute statistics based on them, you can include the
+``allow_nans`` key in the configuration.
+
+Here's an example of how to implement it:
 
 .. literalinclude:: yaml/nan.yaml
    :language: yaml
diff --git a/docs/building/sources.rst b/docs/building/sources.rst
@@ -4,6 +4,17 @@
  Sources
 #########
 
+The source is a software component that given a list of dates and
+variables will return the corresponding fields.
+
+A `source` is responsible for reading data from the source and
+converting it to a set of fields. A `source` is also responsible for
+handling the metadata of the data, such as the variables names, and
+more.
+
+A example of source is ECMWF’s MARS archive, a collection of GRIB or
+NetCDF files, etc.
+
 The following `sources` are currently available:
 
 .. toctree::

diff --git a/docs/building/sources/accumulations.rst b/docs/building/sources/accumulations.rst
@@ -24,9 +24,9 @@ dataset is unknown, the package assumes that the fields to use are
 accumulated since the beginning of the forecast, over a 6h period.
 
 The user can specify the desired accumulation period with the
-``accumulation_period`` parameter. If its value is a single interger,
-the source will attempt to accumulate the variables over that period.
-This does not always mean that the data used is accumulated from the
+``accumulation_period`` parameter. If its value is a single integer, the
+source will attempt to accumulate the variables over that period. This
+does not always mean that the data used is accumulated from the
 beginning of the forecast, but the most recent data available will be
 used:
 
@@ -44,13 +44,13 @@ used:
 If the of ``accumulation_period`` value is a pair of integers `[step1,
 step2]`, the algorithm is different. The source will compute the
 accumulation between the `step1` and `step2` previous forecast that
-valiate at the given date at `step2`. For example, if the accumulation
+validate at the given date at `step2`. For example, if the accumulation
 period is `[6, 12]`, and the valid date is 2020-10-10 18:00, the source
 will use the forecast of 2020-10-10 06:00 and the steps 6h and 12h.
 
 Please note that ``accumulation_period=6`` and ``accumulation_period=[0,
 6]`` are not equivalent. In the first case, the source can use return an
-accumulation bwteen step 1h and step 7h if it is the most appropriate
+accumulation between step 1h and step 7h if it is the most appropriate
 data available, while in the second case, the source will always return
 the accumulation between step 0h and step 6h, if available.
 

diff --git a/docs/building/sources/yaml/hindcasts.yaml b/docs/building/sources/yaml/hindcasts.yaml
@@ -0,0 +1,2 @@
+---
+# TODO
diff --git a/.../building/sources/yaml/perturbations.yaml → docs/building/sources/yaml/recentre.yaml b/.../building/sources/yaml/perturbations.yaml → docs/building/sources/yaml/recentre.yaml
@@ -21,6 +21,6 @@ data_sources:
       stream: oper
 
 input:
-  perturbations:
-    center: ${data_sources.center_source}
+  recentre:
+    centre: ${data_sources.center_source}
     members: ${data_sources.members_source}
diff --git a/docs/building/statistics.rst b/docs/building/statistics.rst
@@ -8,23 +8,58 @@
 it is created. These statistics are intended to be used to normalise the
 data during training.
 
+The statistics are stored in the :doc:`statistics attribute
+<../using/statistics>` of the dataset. The computed statistics include
+`minimum, maximum, mean, standard deviation`.
+
+************************
+ Statistics dates range
+************************
+
 By defaults, the statistics are not computed on the whole dataset, but
-on a subset of dates. The subset is defined using the following
-algorithm:
+on a subset of dates. This usually is done to avoid any data leakage
+from the validation and test sets to the training set.
+
+The dates subset used to compute the statistics is defined using the
+following algorithm:
 
    -  If the dataset covers 20 years or more, the last 3 years are
       excluded.
    -  If the dataset covers 10 years or more, the last year is excluded.
    -  Otherwise, 80% of the dataset is used.
 
-You can override this behaviour by setting the `start` or `end`
-parameters in the `statistics` config.
+You can override this behaviour by setting either the `start` parameter
+or the `end` parameters in the `statistics` config.
+
+Example configuration gathering statistics from 2000 to 2020 :
 
 .. code:: yaml
 
    statistics:
        start: 2000
        end: 2020
 
-..
-   TODO: List the statistics that are computed
+Example configuration gathering statistics from the beginning of the
+dataset period to 2020 :
+
+.. code:: yaml
+
+   statistics:
+       end: 2020
+
+Example configuration gathering statistics using only 2020 data :
+
+.. code:: yaml
+
+   statistics:
+       start: 2020
+       end: 2020
+
+**************************
+ Data with missing values
+**************************
+
+If the dataset contains missing values (known as `NaNs`), an error will
+be raised when trying to compute the statistics. To allow `NaNs` in the
+dataset, you can set the `allow_nans` as described :doc:`here
+</building/handling-missing-values>`.
diff --git a/docs/cli/compare.rst b/docs/cli/compare.rst
@@ -1,9 +1,18 @@
-#########
- compare
-#########
+compare
+=======
 
-Use this command to compatre two datasets:
+Use this command to compare two datasets.
 
-.. code:: bash
+The command will run a quick comparison of the two datasets and output a summary of the differences.
 
-   % anemoi-datasets compare dataset1.zarr dataset2.zarr
+.. warning::
+
+    This command will not compare the data in the datasets, only some of the metadata.
+    Subsequent versions of this command may include more detailed comparisons.
+
+
+.. argparse::
+    :module: anemoi.datasets.__main__
+    :func: create_parser
+    :prog: anemoi-datasets
+    :path: compare
diff --git a/docs/cli/copy.rst b/docs/cli/copy.rst
@@ -1,7 +1,27 @@
-######
- copy
-######
+copy
+====
 
-.. code:: bash
 
-   % anemoi-datasets copy dataset1.zarr dataset2.zarr
+Copying a dataset from one location to another can be error-prone and time-consuming.
+This command-line script allows for incremental copying.
+When the copying process fails, it can be resumed.
+It can be used to copy files from a local directory to a remote server, from a remote server to a local directory as long as there is a zarr backend to read and write the data.
+
+The script uses multiple threads to make the process faster.
+However, it is important to consider that making parallel requests to the same server may not be ideal, for instance if the server internally uses a limited number of threads to handle requests.
+
+The option to rechunk the data is available, which can be useful when the data is stored on a platform that does not support having may small files or many file on the same directory.
+However keep in mind that rechunking has a huge impact on the performance when reading the data:
+The chunk pattern for the source dataset has been defined for good reasons, and changing it is very likey to have a negative impact on the performance.
+
+.. warning::
+
+    When resuming the copying process (using ``--resume``), calling the script with the same arguments for ``--block-size`` and ``--rechunk`` is recommended.
+    Using different values for these arguments to resume copying the same dataset may lead to unexpected behavior.
+
+
+.. argparse::
+    :module: anemoi.datasets.__main__
+    :func: create_parser
+    :prog: anemoi-datasets
+    :path: copy
diff --git a/docs/cli/create.rst b/docs/cli/create.rst
@@ -1,9 +1,11 @@
-########
- create
-########
+create
+======
 
-Use this command to create a dataset from a recipe file:
+Use this command to create a dataset from a recipe file.
+The syntax of the recipe file is described in :doc:`building datasets <../building/introduction>`.
 
-.. code:: bash
-
-   % anemoi-datasets create recipe.yaml dataset.zarr
+.. argparse::
+    :module: anemoi.datasets.__main__
+    :func: create_parser
+    :prog: anemoi-datasets
+    :path: create
diff --git a/docs/cli/inspect.rst b/docs/cli/inspect.rst
@@ -1,7 +1,27 @@
-#########
- inspect
-#########
+inspect
+=======
 
-.. code:: bash
 
-   % anemoi-datasets inspect dataset.zarr
+Anemoi datasets are stored in a zarr format and can be located on a local file system or on a remote server.
+The `inspect` command is used to inspect the contents of a dataset.
+This command will output the metadata of the dataset, including the variables, dimensions, and attributes.
+
+.. code:: console
+
+   $ anemoi-datasets inspect dataset.zarr
+
+
+which will output something like the following. The output should be self-explanatory.
+
+.. literalinclude:: ../building/yaml/building1.txt
+   :language: console
+
+*********************
+ Command line usage
+*********************
+
+.. argparse::
+    :module: anemoi.datasets.__main__
+    :func: create_parser
+    :prog: anemoi-datasets
+    :path: inspect
diff --git a/docs/cli/introduction.rst b/docs/cli/introduction.rst
@@ -1,23 +1,28 @@
-##############
- Introduction
-##############
+Introduction
+============
 
-When you install the `anemoi-datasets` package, this will also install
-command line tool called ``anamois-datasets`` this can be used to manage
-the zarr datasets.
+When you install the `anemoi-datasets` package, this will also install command line tool
+called ``anemoi-datasets`` which can be used to manage the zarr datasets.
 
-The tools can provide help with the ``--help`` options:
+The tool can provide help with the ``--help`` options:
 
-.. code:: bash
+.. code-block:: bash
 
-   % anamoi-datasets --help
+    % anemoi-datasets --help
 
 The commands are:
 
 .. toctree::
-   :maxdepth: 1
+    :maxdepth: 1
 
-   create
-   inspect
-   copy
-   compare
+    compare
+    copy
+    create
+    inspect
+    scan
+
+.. argparse::
+    :module: anemoi.datasets.__main__
+    :func: create_parser
+    :prog: anemoi-datasets
+    :nosubcommands:
diff --git a/docs/cli/scan.rst b/docs/cli/scan.rst
@@ -0,0 +1,10 @@
+scan
+====
+
+Use this command to scan for GRIB files
+
+.. argparse::
+    :module: anemoi.datasets.__main__
+    :func: create_parser
+    :prog: anemoi-datasets
+    :path: scan