Skip to content

Commit

Permalink
Merge from develop
Browse files Browse the repository at this point in the history
  • Loading branch information
sandorkertesz committed Jun 3, 2024
2 parents 40b76b8 + 922a9b3 commit 2a9c1f7
Show file tree
Hide file tree
Showing 32 changed files with 514 additions and 368 deletions.
21 changes: 16 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,19 @@ repos:


- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.4.0
rev: v4.6.0
hooks:
- id: check-yaml # Check YAML files for syntax errors only
args: [--unsafe, --allow-multiple-documents]
- id: debug-statements # Check for debugger imports and py37+ breakpoint()
- id: end-of-file-fixer # Ensure files end in a newline
- id: trailing-whitespace # Trailing whitespace checker
# - id: no-commit-to-branch # Prevent committing to main / master
- id: no-commit-to-branch # Prevent committing to main / master
- id: check-added-large-files # Check for large files added to git
- id: check-merge-conflict # Check for files that contain merge conflict

- repo: https://github.com/psf/black-pre-commit-mirror
rev: 24.1.1
rev: 24.4.2
hooks:
- id: black
args: [--line-length=120]
Expand All @@ -41,7 +41,7 @@ repos:


- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.3.0
rev: v0.4.6
hooks:
- id: ruff
exclude: '(dev/.*|.*_)\.py$'
Expand All @@ -51,7 +51,6 @@ repos:
- --exit-non-zero-on-fix
- --preview


- repo: https://github.com/sphinx-contrib/sphinx-lint
rev: v0.9.1
hooks:
Expand All @@ -62,9 +61,21 @@ repos:
rev: v0.0.14
hooks:
- id: rstfmt
exclude: 'cli/.*' # Because we use argparse

- repo: https://github.com/b8raoult/pre-commit-docconvert
rev: "0.1.4"
hooks:
- id: docconvert
args: ["numpy"]

- repo: https://github.com/b8raoult/optional-dependencies-all
rev: "0.0.2"
hooks:
- id: optional-dependencies-all
args: ["--inplace", "--all-key", "all", "--exclude-keys", "dev,docs"]

- repo: https://github.com/tox-dev/pyproject-fmt
rev: "2.1.3"
hooks:
- id: pyproject-fmt
19 changes: 11 additions & 8 deletions docs/building/handling-missing-values.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,19 @@
Handling missing values
#########################

When handling data for machine learning models, missing values (NaNs)
When handling data for machine learning models, missing values (`NaNs`)
can pose a challenge, as models require complete data to operate
effectively and may crash otherwise. Ideally, we anticipate having
complete data in all fields. However, there are scenarios where NaNs
naturally occur, such as with variables only relevant on land or at sea
(such as sea surface temperature (`sst`), for example). In such cases,
the default behavior is to reject data with NaNs as invalid. To
accommodate NaNs and accurately compute statistics based on them, you
can include the `allow_nans` key in the configuration. Here's an example
of how to implement it:
complete data in all fields.

However, there are scenarios where `NaNs` naturally occur, such as with
variables only relevant on land or at sea. This happens for sea surface
temperature (`sst`), for example. In such cases, the default behavior is
to reject data with `NaNs` as invalid. To accommodate `NaNs` and
accurately compute statistics based on them, you can include the
``allow_nans`` key in the configuration.

Here's an example of how to implement it:

.. literalinclude:: yaml/nan.yaml
:language: yaml
11 changes: 11 additions & 0 deletions docs/building/sources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,17 @@
Sources
#########

The source is a software component that given a list of dates and
variables will return the corresponding fields.

A `source` is responsible for reading data from the source and
converting it to a set of fields. A `source` is also responsible for
handling the metadata of the data, such as the variables names, and
more.

A example of source is ECMWF’s MARS archive, a collection of GRIB or
NetCDF files, etc.

The following `sources` are currently available:

.. toctree::
Expand Down
10 changes: 5 additions & 5 deletions docs/building/sources/accumulations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ dataset is unknown, the package assumes that the fields to use are
accumulated since the beginning of the forecast, over a 6h period.

The user can specify the desired accumulation period with the
``accumulation_period`` parameter. If its value is a single interger,
the source will attempt to accumulate the variables over that period.
This does not always mean that the data used is accumulated from the
``accumulation_period`` parameter. If its value is a single integer, the
source will attempt to accumulate the variables over that period. This
does not always mean that the data used is accumulated from the
beginning of the forecast, but the most recent data available will be
used:

Expand All @@ -44,13 +44,13 @@ used:
If the of ``accumulation_period`` value is a pair of integers `[step1,
step2]`, the algorithm is different. The source will compute the
accumulation between the `step1` and `step2` previous forecast that
valiate at the given date at `step2`. For example, if the accumulation
validate at the given date at `step2`. For example, if the accumulation
period is `[6, 12]`, and the valid date is 2020-10-10 18:00, the source
will use the forecast of 2020-10-10 06:00 and the steps 6h and 12h.

Please note that ``accumulation_period=6`` and ``accumulation_period=[0,
6]`` are not equivalent. In the first case, the source can use return an
accumulation bwteen step 1h and step 7h if it is the most appropriate
accumulation between step 1h and step 7h if it is the most appropriate
data available, while in the second case, the source will always return
the accumulation between step 0h and step 6h, if available.

Expand Down
2 changes: 2 additions & 0 deletions docs/building/sources/yaml/hindcasts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
---
# TODO
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,6 @@ data_sources:
stream: oper

input:
perturbations:
center: ${data_sources.center_source}
recentre:
centre: ${data_sources.center_source}
members: ${data_sources.members_source}
47 changes: 41 additions & 6 deletions docs/building/statistics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,23 +8,58 @@
it is created. These statistics are intended to be used to normalise the
data during training.

The statistics are stored in the :doc:`statistics attribute
<../using/statistics>` of the dataset. The computed statistics include
`minimum, maximum, mean, standard deviation`.

************************
Statistics dates range
************************

By defaults, the statistics are not computed on the whole dataset, but
on a subset of dates. The subset is defined using the following
algorithm:
on a subset of dates. This usually is done to avoid any data leakage
from the validation and test sets to the training set.

The dates subset used to compute the statistics is defined using the
following algorithm:

- If the dataset covers 20 years or more, the last 3 years are
excluded.
- If the dataset covers 10 years or more, the last year is excluded.
- Otherwise, 80% of the dataset is used.

You can override this behaviour by setting the `start` or `end`
parameters in the `statistics` config.
You can override this behaviour by setting either the `start` parameter
or the `end` parameters in the `statistics` config.

Example configuration gathering statistics from 2000 to 2020 :

.. code:: yaml
statistics:
start: 2000
end: 2020
..
TODO: List the statistics that are computed
Example configuration gathering statistics from the beginning of the
dataset period to 2020 :

.. code:: yaml
statistics:
end: 2020
Example configuration gathering statistics using only 2020 data :

.. code:: yaml
statistics:
start: 2020
end: 2020
**************************
Data with missing values
**************************

If the dataset contains missing values (known as `NaNs`), an error will
be raised when trying to compute the statistics. To allow `NaNs` in the
dataset, you can set the `allow_nans` as described :doc:`here
</building/handling-missing-values>`.
21 changes: 15 additions & 6 deletions docs/cli/compare.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
#########
compare
#########
compare
=======

Use this command to compatre two datasets:
Use this command to compare two datasets.

.. code:: bash
The command will run a quick comparison of the two datasets and output a summary of the differences.

% anemoi-datasets compare dataset1.zarr dataset2.zarr
.. warning::

This command will not compare the data in the datasets, only some of the metadata.
Subsequent versions of this command may include more detailed comparisons.


.. argparse::
:module: anemoi.datasets.__main__
:func: create_parser
:prog: anemoi-datasets
:path: compare
30 changes: 25 additions & 5 deletions docs/cli/copy.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,27 @@
######
copy
######
copy
====

.. code:: bash

% anemoi-datasets copy dataset1.zarr dataset2.zarr
Copying a dataset from one location to another can be error-prone and time-consuming.
This command-line script allows for incremental copying.
When the copying process fails, it can be resumed.
It can be used to copy files from a local directory to a remote server, from a remote server to a local directory as long as there is a zarr backend to read and write the data.

The script uses multiple threads to make the process faster.
However, it is important to consider that making parallel requests to the same server may not be ideal, for instance if the server internally uses a limited number of threads to handle requests.

The option to rechunk the data is available, which can be useful when the data is stored on a platform that does not support having may small files or many file on the same directory.
However keep in mind that rechunking has a huge impact on the performance when reading the data:
The chunk pattern for the source dataset has been defined for good reasons, and changing it is very likey to have a negative impact on the performance.

.. warning::

When resuming the copying process (using ``--resume``), calling the script with the same arguments for ``--block-size`` and ``--rechunk`` is recommended.
Using different values for these arguments to resume copying the same dataset may lead to unexpected behavior.


.. argparse::
:module: anemoi.datasets.__main__
:func: create_parser
:prog: anemoi-datasets
:path: copy
16 changes: 9 additions & 7 deletions docs/cli/create.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
########
create
########
create
======

Use this command to create a dataset from a recipe file:
Use this command to create a dataset from a recipe file.
The syntax of the recipe file is described in :doc:`building datasets <../building/introduction>`.

.. code:: bash
% anemoi-datasets create recipe.yaml dataset.zarr
.. argparse::
:module: anemoi.datasets.__main__
:func: create_parser
:prog: anemoi-datasets
:path: create
30 changes: 25 additions & 5 deletions docs/cli/inspect.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,27 @@
#########
inspect
#########
inspect
=======

.. code:: bash

% anemoi-datasets inspect dataset.zarr
Anemoi datasets are stored in a zarr format and can be located on a local file system or on a remote server.
The `inspect` command is used to inspect the contents of a dataset.
This command will output the metadata of the dataset, including the variables, dimensions, and attributes.

.. code:: console
$ anemoi-datasets inspect dataset.zarr
which will output something like the following. The output should be self-explanatory.

.. literalinclude:: ../building/yaml/building1.txt
:language: console

*********************
Command line usage
*********************

.. argparse::
:module: anemoi.datasets.__main__
:func: create_parser
:prog: anemoi-datasets
:path: inspect
33 changes: 19 additions & 14 deletions docs/cli/introduction.rst
Original file line number Diff line number Diff line change
@@ -1,23 +1,28 @@
##############
Introduction
##############
Introduction
============

When you install the `anemoi-datasets` package, this will also install
command line tool called ``anamois-datasets`` this can be used to manage
the zarr datasets.
When you install the `anemoi-datasets` package, this will also install command line tool
called ``anemoi-datasets`` which can be used to manage the zarr datasets.

The tools can provide help with the ``--help`` options:
The tool can provide help with the ``--help`` options:

.. code:: bash
.. code-block:: bash
% anamoi-datasets --help
% anemoi-datasets --help
The commands are:

.. toctree::
:maxdepth: 1
:maxdepth: 1

create
inspect
copy
compare
compare
copy
create
inspect
scan

.. argparse::
:module: anemoi.datasets.__main__
:func: create_parser
:prog: anemoi-datasets
:nosubcommands:
10 changes: 10 additions & 0 deletions docs/cli/scan.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
scan
====

Use this command to scan for GRIB files

.. argparse::
:module: anemoi.datasets.__main__
:func: create_parser
:prog: anemoi-datasets
:path: scan
Loading

0 comments on commit 2a9c1f7

Please sign in to comment.