Skip to content

Commit

Permalink
Use new rst formater
Browse files Browse the repository at this point in the history
  • Loading branch information
b8raoult committed May 26, 2024
1 parent 40d6fd4 commit fd8aa56
Show file tree
Hide file tree
Showing 53 changed files with 897 additions and 1,096 deletions.
11 changes: 6 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,12 @@ repos:
hooks:
- id: sphinx-lint

# For now, we use it. But it does not support a lot of sphinx features
# - repo: https://github.com/dzhu/rstfmt
# rev: v0.0.14
# hooks:
# - id: rstfmt
- repo: https://github.com/LilSpazJoekp/docstrfmt
rev: v1.6.1
hooks:
- id: docstrfmt
language_version: python3
types_or: [rst] # Don't touch python docstrings.

- repo: https://github.com/b8raoult/pre-commit-docconvert
rev: "0.1.4"
Expand Down
22 changes: 10 additions & 12 deletions docs/building/filters.rst
Original file line number Diff line number Diff line change
@@ -1,22 +1,20 @@
.. _filters:

#########
Filters
#########
Filters
=======

.. warning::

This is still a work in progress. Some of the filters may be renamed
later.
This is still a work in progress. Some of the filters may be renamed later.

Filters are used to modify the data or metadata in a dataset.

.. toctree::
:maxdepth: 1
:maxdepth: 1

filters/select
filters/rename
filters/rotate_winds
filters/unrotate_winds
filters/noop
filters/empty
filters/select
filters/rename
filters/rotate_winds
filters/unrotate_winds
filters/noop
filters/empty
9 changes: 4 additions & 5 deletions docs/building/filters/empty.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
#######
empty
#######
empty
=====

The ``empty`` filter is for debugging purposes. It always returns an
empty set of fields.
The ``empty`` filter is for debugging purposes. It always returns an empty set of
fields.
8 changes: 3 additions & 5 deletions docs/building/filters/noop.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
######
noop
######
noop
====

The ``noop`` filter is for debugging purposes. It returns its input
unchanged.
The ``noop`` filter is for debugging purposes. It returns its input unchanged.
5 changes: 2 additions & 3 deletions docs/building/filters/rename.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
########
rename
########
rename
======
5 changes: 2 additions & 3 deletions docs/building/filters/rotate_winds.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
##############
rotate_winds
##############
rotate_winds
============
5 changes: 2 additions & 3 deletions docs/building/filters/select.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
########
select
########
select
======
5 changes: 2 additions & 3 deletions docs/building/filters/unrotate_winds.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
###############
unrotate_wind
###############
unrotate_wind
=============
26 changes: 12 additions & 14 deletions docs/building/handling-missing-dates.rst
Original file line number Diff line number Diff line change
@@ -1,24 +1,22 @@
########################
Handling missing dates
########################
Handling missing dates
======================

By default, the package will raise an error if there are missing dates.

Missing dates can be handled by specifying a list of dates in the
configuration file. The dates should be in the same format as the dates
in the time series. The missing dates will be filled ``np.nan`` values.
Missing dates can be handled by specifying a list of dates in the configuration file.
The dates should be in the same format as the dates in the time series. The missing
dates will be filled ``np.nan`` values.

.. literalinclude:: yaml/missing_dates.yaml
:language: yaml
:language: yaml

*Anemoi* will ignore the missing dates when computing the
:ref:`statistics <gathering_statistics>`.
*Anemoi* will ignore the missing dates when computing the :ref:`statistics
<gathering_statistics>`.

You can retrieve the list indices corresponding to the missing dates by
accessing the ``missing`` attribute of the dataset object.
You can retrieve the list indices corresponding to the missing dates by accessing the
``missing`` attribute of the dataset object.

.. literalinclude:: ../using/code/missing_.py
:language: python
:language: python

If you access a missing index, the dataset will throw a
``MissingDateError``.
If you access a missing index, the dataset will throw a ``MissingDateError``.
25 changes: 11 additions & 14 deletions docs/building/handling-missing-values.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,14 @@
#########################
Handling missing values
#########################
Handling missing values
=======================

When handling data for machine learning models, missing values (NaNs)
can pose a challenge, as models require complete data to operate
effectively and may crash otherwise. Ideally, we anticipate having
complete data in all fields. However, there are scenarios where NaNs
naturally occur, such as with variables only relevant on land or at sea
(such as sea surface temperature (`sst`), for example). In such cases,
the default behavior is to reject data with NaNs as invalid. To
accommodate NaNs and accurately compute statistics based on them, you
can include the `allow_nans` key in the configuration. Here's an example
of how to implement it:
When handling data for machine learning models, missing values (NaNs) can pose a
challenge, as models require complete data to operate effectively and may crash
otherwise. Ideally, we anticipate having complete data in all fields. However, there are
scenarios where NaNs naturally occur, such as with variables only relevant on land or at
sea (such as sea surface temperature (`sst`), for example). In such cases, the default
behavior is to reject data with NaNs as invalid. To accommodate NaNs and accurately
compute statistics based on them, you can include the `allow_nans` key in the
configuration. Here's an example of how to implement it:

.. literalinclude:: yaml/nan.yaml
:language: yaml
:language: yaml
168 changes: 76 additions & 92 deletions docs/building/introduction.rst
Original file line number Diff line number Diff line change
@@ -1,154 +1,138 @@
.. _building-introduction:

##############
Introduction
##############

The `anemoi-datasets` package allows you to create datasets for training
data-driven weather models. The datasets are built using a `recipe`
file, which is a YAML file that describes sources of meteorological
fields as well as the operations to perform on them, before they are
written to a zarr file. The input of the process is a range of dates and
some options to control the layout of the output. Statistics will be
computed as the dataset is build, and stored in the metadata, with other
information such as the the locations of the grid points, the list of
variables, etc.
Introduction
============

The `anemoi-datasets` package allows you to create datasets for training data-driven
weather models. The datasets are built using a `recipe` file, which is a YAML file that
describes sources of meteorological fields as well as the operations to perform on them,
before they are written to a zarr file. The input of the process is a range of dates and
some options to control the layout of the output. Statistics will be computed as the
dataset is build, and stored in the metadata, with other information such as the the
locations of the grid points, the list of variables, etc.

.. figure:: ../schemas/recipe.png
:alt: Building datasets
:align: center
:alt: Building datasets
:align: center

**********
Concepts
**********
Concepts
--------

date
Throughout this document, the term `date` refers to a date and time,
not just a date. A training dataset is covers a continuous range of
dates with a given frequency. Missing dates are still part of the
dataset, but the data are missing and marked as such using NaNs.
Dates are always in UTC, and refer to date at which the data is
valid. For accumulations and fluxes, that would be the end of the
accumulation period.
Throughout this document, the term `date` refers to a date and time, not just a
date. A training dataset is covers a continuous range of dates with a given
frequency. Missing dates are still part of the dataset, but the data are missing and
marked as such using NaNs. Dates are always in UTC, and refer to date at which the
data is valid. For accumulations and fluxes, that would be the end of the
accumulation period.

variable
A `variable` is meteorological parameter, such as temperature, wind,
etc. Multilevel parameters are treated as separate variables, one for
each level. For example, temperature at 850 hPa and temperature at
500 hPa will be treated as two separate variables (`t_850` and
`t_500`).
A `variable` is meteorological parameter, such as temperature, wind, etc. Multilevel
parameters are treated as separate variables, one for each level. For example,
temperature at 850 hPa and temperature at 500 hPa will be treated as two separate
variables (`t_850` and `t_500`).

field
A `field` is a variable at a given date. It is represented by a array
of values at each grid point.
A `field` is a variable at a given date. It is represented by a array of values at
each grid point.

source
The `source` is a software component that given a list of dates and
variables will return the corresponding fields. A example of source
is ECMWF's MARS archive, a collection of GRIB or NetCDF files, a
database, etc. See :ref:`sources` for more information.
The `source` is a software component that given a list of dates and variables will
return the corresponding fields. A example of source is ECMWF's MARS archive, a
collection of GRIB or NetCDF files, a database, etc. See :ref:`sources` for more
information.

filter
A `filter` is a software component that takes as input the output of
a source or the output of another filter can modify the fields and/or
their metadata. For example, typical filters are interpolations,
renaming of variables, etc. See :ref:`filters` for more information.
A `filter` is a software component that takes as input the output of a source or the
output of another filter can modify the fields and/or their metadata. For example,
typical filters are interpolations, renaming of variables, etc. See :ref:`filters`
for more information.

************
Operations
************
Operations
----------

In order to build a training dataset, sources and filters are combined
using the following operations:
In order to build a training dataset, sources and filters are combined using the
following operations:

join
The join is the process of combining several sources data. Each
source is expected to provide different variables at the same dates.
The join is the process of combining several sources data. Each source is expected
to provide different variables at the same dates.

pipe
The pipe is the process of transforming fields using filters. The
first step of a pipe is typically a source, a join or another pipe.
The following steps are filters.
The pipe is the process of transforming fields using filters. The first step of a
pipe is typically a source, a join or another pipe. The following steps are filters.

concat
The concatenation is the process of combining different sets of
operation that handle different dates. This is typically used to
build a dataset that spans several years, when the several sources
are involved, each providing a different period.
The concatenation is the process of combining different sets of operation that
handle different dates. This is typically used to build a dataset that spans several
years, when the several sources are involved, each providing a different period.

Each operation is considered as a :ref:`source <sources>`, therefore
operations can be combined to build complex datasets.
Each operation is considered as a :ref:`source <sources>`, therefore operations can be
combined to build complex datasets.

*****************
Getting started
*****************
Getting started
---------------

First recipe
============
~~~~~~~~~~~~

The simplest `recipe` file must contain a ``dates`` section and an
``input`` section. The latter must contain a `source` In that case, the
source is ``mars``
The simplest `recipe` file must contain a ``dates`` section and an ``input`` section.
The latter must contain a `source` In that case, the source is ``mars``

.. literalinclude:: yaml/building1.yaml
:language: yaml
:language: yaml

To create the dataset, run the following command:

.. code:: console
.. code-block:: console
$ anemoi-datasets create recipe.yaml dataset.zarr
$ anemoi-datasets create recipe.yaml dataset.zarr
Once the build is complete, you can inspect the dataset using the
following command:
Once the build is complete, you can inspect the dataset using the following command:

.. code:: console
.. code-block:: console
$ anemoi-datasets inspect dataset.zarr
$ anemoi-datasets inspect dataset.zarr
.. literalinclude:: yaml/building1.txt
:language: console
:language: console

Adding a second source
======================
~~~~~~~~~~~~~~~~~~~~~~

To add a second source, you need to use the ``join`` operation. In that
example, we add pressure level variables to the previous example:
To add a second source, you need to use the ``join`` operation. In that example, we add
pressure level variables to the previous example:

.. literalinclude:: yaml/building2.yaml
:language: yaml
:language: yaml

This will build the following dataset:

.. literalinclude:: yaml/building2.txt
:language: console
:language: console

.. note::

Please note that the pressure levels parameters are named
`param_level`. This is the default behaviour. See
:ref:`remapping_option` for more information.
Please note that the pressure levels parameters are named `param_level`. This is the
default behaviour. See :ref:`remapping_option` for more information.

Adding some forcing variables
=============================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When training a data-driven models, some forcing variables may be
required such as the solar radiation, the time of day, the day in the
year, etc.
When training a data-driven models, some forcing variables may be required such as the
solar radiation, the time of day, the day in the year, etc.

These are provided by the ``forcings`` source. In that example, we add a
few of them. The `template` option is used to point to another source,
in that case the first instance of ``mars``. This source is used to get
information about the grid points, as some of the forcing variables are
grid dependent.
These are provided by the ``forcings`` source. In that example, we add a few of them.
The `template` option is used to point to another source, in that case the first
instance of ``mars``. This source is used to get information about the grid points, as
some of the forcing variables are grid dependent.

.. literalinclude:: yaml/building3.yaml
:language: yaml
:language: yaml

This will build the following dataset:

.. literalinclude:: yaml/building3.txt
:language: console
:language: console

See :ref:`forcing_variables` for more information about forcing
variables.
See :ref:`forcing_variables` for more information about forcing variables.
Loading

0 comments on commit fd8aa56

Please sign in to comment.