Skip to content
Jorn Bruggeman edited this page Apr 26, 2023 · 60 revisions

WARNING: this page describes a pre-release (alpha) version of EAT. Functionality, names of executables and APIs, and the arguments they take can all still change before the final public release.

Installing

EAT works on Linux, Windows and Mac.

We recommend installing EAT with Anaconda. On many systems Anaconda is already installed: try running conda --version. If that fails on Linux, you may need to load an anaconda module first: try module load anaconda or module load anaconda3. If that still does not give you a working conda command, you can install Miniconda, which does not require root/administrator privileges.

Before using Anaconda for the first time, you may need to initialize it for shell interaction (this is not needed if you let the installer do this): execute conda init. Optionally, you can add an argument to specify which shell you will be using (default: cmd.exe/powershell on Windows, bash on Linux). This needs to be done just once. After initializing the conda environment, restart your shell by logging out and back in.

There are three ways to install EAT:

  • install the prebuilt EAT package:
    conda create -n eat -c bolding-bruggeman -c conda-forge eatpy
    This creates a new conda environment named eat and installs EAT into it. This is the recommended way to install on Linux and Windows workstations, unless you want to customize the build process.
  • build EAT yourself and then install it, using conda-provided packages (compilers, MPI, NetCDF, BLAS, LAPACK). This allows you to customize the build process, which for instance makes it possible to include custom biogeochemical models when building the 1D water column model (GOTM-FABM).
  • build EAT yourself and then install it, using system-provided packages (compilers, MPI, NetCDF, BLAS, LAPACK). This is the best option if you will run EAT on a system that requires the use of a preinstalled MPI library, which is typical for HPC clusters. You can still customize the build process as in the previous option.

To install the prebuilt EAT, the above conda create ... command is all you need. Instructions for building and installing yourself are given in the next section.

Building and installing manually

To obtain the EAT code, build it, and then install it:

git clone --recursive https://github.com/BoldingBruggeman/eat.git
cd eat
conda env create -f <ENVIRONMENT_YML>
conda activate eat
source ./install

In the above, <ENVIRONMENT_YML> is the name of your environment file: environment.yml if using conda-provided packages (compilers, MPI, NetCDF, BLAS, LAPACK), environment-min.yml if you want to use system-provided packages instead.

The commands above compile EAT with GOTM and FABM in their default configuration. You can customize the build process by editing install. For instance, you can:

  • include external codebases with additional biogeochemical models in FABM by changing the line with cmake ${REPO_DIR} to something like: cmake ${REPO_DIR} -DFABM_INSTITUTES="<NAME1>;<NAME2>;..." -DFABM_<NAME1>_BASE=<DIR1> -DFABM_<NAME2>_BASE=<DIR2> ...
  • force the use of a specific Fortran compiler by adding -DCMAKE_Fortran_COMPILER=<EXE> to the line with cmake ${REPO_DIR}

Staying up to date

To update to the latest source code, rebuild, and reinstall, make sure you are in the eat directory and execute:

git pull
git submodule update --init --recursive
conda env update -f <ENVIRONMENT_YML>
conda activate eat
source ./install

Uninstalling

If you want to remove EAT from your system, you can do so by removing its Anaconda environment:

conda env remove -n eat

Installing without Anaconda

If you do not want to use conda at all, you will need to install all dependencies manually:

You can then follow the instructions for building and installing manually, skipping the conda commands.

Data assimilation with a 1D water column model

Here we use the one-dimensional model GOTM-FABM, which describes physical and biogeochemical processes in marine and freshwater water columns. This is included in EAT.

To begin, you need a working GOTM setup and observations that you would like to assimilate. These observations can be:

  • profiles of depth-explicit variables (e.g., temperature, salinity, chlorophyll)
  • depth-explicit model variables observed at one particular depth level (e.g., sea surface temperature or chlorophyll)
  • depth-independent model variables (e.g., air-sea fluxes, benthic variables)

You can combine any number of these observations.

If you do not have a GOTM setup with observations, you can try out EAT using the included example for the Northern North Sea. Its setup directory is located at tests/nns_annual. This example includes sea surface temperature observations from the SST CCI.

How to run

To add EAT-based data assimilation to a GOTM setup, you create a new Python file in the setup directory. Below we will assume this file is named run.py. As a starting point, it could contain:

import eatpy
experiment = eatpy.models.GOTM()
filter = eatpy.PDAF(eatpy.pdaf.FilterType.ESTKF)
experiment.add_observations("temp[-1]", "cci_sst.dat")
experiment.run(filter)

This does the following:

  • import eatpy provides access to all EAT routines
  • experiment = eatpy.models.GOTM() tells EAT that the ensemble members (models) will be instances of GOTM
  • filter = eatpy.PDAF(eatpy.pdaf.FilterType.ESTKF) configures the data assimilation filter. The eatpy.pdaf.FilterType.ESTKF specifies that the Error Subspace Transform Kalman filter is to be used.
  • experiment.add_observations("temp[-1]", "cci_sst.dat") adds observations on sea surface temperature (GOTM's temp variable at the last depth level, -1) These will be assimilated. Here they are in a file named cci_sst.dat, which must use one of the predefined observation file formats.
  • experiment.run(filter) starts the data assimilation experiment

To run the data assimilation experiment, we will need to start this script simultaneously with an ensemble of GOTM-FABM models:

  • First activate the EAT Python environment: conda activate eat.
  • EAT is MPI-based. To start the run script together with an ensemble of GOTM models:
    mpiexec -n 1 python run.py : -n 10 eat-gotm --separate_restart_file
    This creates an ensemble of 10 members (that's the -n 10). --separate_restart_file specifies that ensemble members differ in their initial state (the restart file), which we will have to generate first; this is discussed in the next section. Ensemble members can also differ in their configuration (e.g. physical or biogeochemical parameters); this is specified by adding --separate_gotm_yaml.

If you run on an HPC cluster, you typically will want to put these commands in a job submission script, so that the simulations run on the compute nodes rather than on the login nodes. An example of a SLURM script that does this:

#!/bin/bash
#SBATCH --account=tra21_seamless
#SBATCH --partition=g100_usr_prod
#SBATCH --ntasks 11
#SBATCH --time=01:00:00
#SBATCH --output=%j.out
#SBATCH --error=%j.err

conda activate eat
export PYTHONUNBUFFERED=TRUE
srun -n1 python run.py : -n10 eat-gotm --separate_restart_file

Note that the allocated number of task (#SBATCH --ntasks 11) must match the number of processes started by mpiexec (eleven in total: 1 for the run script, 10 for eat-gotm).

Configuring the ensemble

Ensemble members can differ in their initial state or in their configuration (physical or biogeochemical parameters, forcing files, etc.). The model eat-gotm is told which of these two options apply with command line arguments:

  • --separate_restart_file specifies that ensemble members differ in their initial state (restart.nc). GOTM will then look for one restart file per ensemble member; these need to be named restart_<NNNN>.nc, where <NNNN> is 0001 for the first ensemble member, 0002 for the second, etc.
  • --separate_gotm_yaml specifies that ensemble members differ in their configuration (gotm.yaml). GOTM will then look for one configuration file per ensemble member; these need to be named gotm_<NNNN>.yaml, where <NNNN> is 0001 for the first ensemble member, 0002 for the second, etc.
    If you want different ensemble members to use different biogeochemical parameterizations, this can be achieved by pointing each member to a different FABM configuration file (by default they all use fabm.yaml). The path to the FABM configuration file can be set in gotm.yaml, as variable yaml_file under section fabm. Thus, you will need to specify --separate_gotm_yaml and have a gotm_<NNNN>.yaml file in place for each member as described above. In each gotm_<NNNN>.yaml file, you then set fabm/yaml_file to a member-specific value, e.g., fabm_<NNNN>.yaml.

Generating restart files

If you want to use different initial states per ensemble member (--separate_restart_file), you first need a restart file (restart.nc) for your stand-alone simulation. If you do not have this file yet, you can generate it by first running eat-gotm --generate_restart_file from your GOTM setup directory (note: no mpiexec is used; eat-gotm is run in serial).

After you have the initial restart file, you can generate perturbations of it for each ensemble member. EAT makes it easy to do this from Python:

import eatpy
import numpy as np
N = 100    # ensemble size
with eatpy.models.gotm.RestartEnsemble("restart.nc", N) as f:
    for name, values in f.template:
        if name not in ("h", "z", "zi", "lon", "lat"):
            scale_factors = np.random.lognormal(sigma=0.1, (N,) + values.shape)
            f[name] = values * scale_factors

In this example, independent scale factors are applied to every variable that is part of the model state (and at every depth, for depth-explicit variables). These scale factors are drawn from a log-normal distribution with standard deviation 0.1 (in ln units). After these lines execute (upon exiting the with construct), restart files restart_0001.nc to restart_0100.nc will be written for all 100 ensemble members.

To execute this example, copy the lines into a new Python file (<NAME>.py), into an Python or iPython console, or into a Jupyter notebook. The code assumes the original restart file is present in your working directory; if this is not the case, replace "restart.nc" in the above by the full path tot the restart file.

WARNING: when a GOTM simulation finishes, the initial restart file(s) are overwritten with the state of the model at the end of the simulation. Thus, you typically want to store your generated restart files separately, for instance in a subdirectory within your GOTM setup. You then copy these to your setup directory every time you want to start a simulation.

Generating yaml configuration files

If you want to use different gotm.yaml configuration files per ensemble member (--separate_gotm_yaml), you can generate these from a master gotm.yaml file. EAT makes it easy to do this from Python:

import eatpy
import numpy as np
N = 100    # ensemble size
with eatpy.models.gotm.YAMLEnsemble("gotm.yaml", N) as f:
    f["surface/u10/scale_factor"] = np.random.lognormal(sigma=0.1, size=N)
    f["surface/v10/scale_factor"] = np.random.lognormal(sigma=0.1, size=N)

This uses gotm.yaml as template and from that generates 100 different files, one per ensemble member (gotm_<NNNN>.yaml). The values of two parameters (surface/u10/scale_factor, surface/v10/scale_factor) are each drawn from a log-normal probability distribution. The standard deviation of this distribution is 0.1 (ln units) for both parameters (the mean of the distribution is not specified and therefore 0 in ln units).

With a few more lines, the same approach can be used to also specify a different biogeochemical parameterization (fabm.yaml file) for each ensemble member:

import eatpy
import numpy as np
N = 100    # ensemble size
with eatpy.models.gotm.YAMLEnsemble("gotm.yaml", N) as f:
    f["surface/u10/scale_factor"] = np.random.lognormal(sigma=0.1, size=N)
    f["surface/v10/scale_factor"] = np.random.lognormal(sigma=0.1, size=N)
    f["fabm/yaml_file"] = [f"fabm_{i + 1:04}.yaml" for i in range(N)]
with eatpy.models.gotm.YAMLEnsemble("fabm.yaml", N) as f:
    scale_factors = np.random.lognormal(sigma=0.2, size=N)
    f["instances/P1/parameters/sum"] = scale_factors * f["instances/P1/parameters/sum"]

This creates gotm_0001.yaml to gotm_0100.yaml, each pointing to a different FABM configuration file (fabm_0001.yaml to fabm_0100.yaml) with a member-specific scale factor applied to parameter instances/P1/parameters/sum.

Observations

You can add any number of observed variables with lines like

experiment.add_observations("temp", "temp_profs.dat")
experiment.add_observations("temp[-1]", "cci_sst.dat")

The first line adds observations of a depth-explicit model variable, temperature (temp). These observations are thus specific to particular depths, which will be recorded in the observation file (temp_profs.dat). This is appropriate when assimilating Argo float or CTD observations.

In the second line, the variable temp is followed by a depth index ([-1]). As usual in Python, this index is 0-based, with negative indices counting down from the end of the array. Thus 0 specifies the first depth index, -1 the last depth index. Since GOTM orders its depth axis from bottom to surface, an index of 0 indicates the value closest to the bottom, an index of -1 indicates the closest to the surface. Thus, temp[-1] is the model equivalent of the sea surface temperature.

Every observed variable (e.g., temperature, chlorophyll, oxygen concentration) must be described in a separate file. The format of this file depends on whether the variable is depth-independent (e.g., benthic detritus) or depth-dependent. Variables that are depth-explicit in the model but indexed at a particular depth level - temp[-1] in the above example - use the depth-independent format.

Depth-independent variables

Each line of the observation file must contain tab-separated values for the time (GMT), the observed variable, and its standard deviation. For instance:

2015-05-01 12:00:00	11.12	0.32

This could describe an observed sea surface temperature of 11.12 °C (s.d. 0.32 °C) on 1 May 2015, at noon.

The hash mark (#) can be used to denote comments. All characters from # to the end of the line are ignored. Any whitespace at the very start or end of a line is ignored. Empty lines (or lines with only comments) are skipped.

Depth-dependent variables

Each line of the observation file must contain tab-separated values for the time (GMT), the depth at which the variable was observed (in meters above mean sea level), the observed variable, and its standard deviation. For instance:

2015-05-01 12:00:00	-15.	11.12	0.32

This could describe an observed temperature of 11.12 °C (s.d. 0.32 °C) at a depth of 15 m below mean sea level, on 1 May 2015, at noon. Note that as in GOTM, the depth axis is directed upwards, with a reference depth equal to mean sea level. The observed depth will therefore nearly always be negative (i.e., below mean sea level); the only exception is when sampling near the surface during high tide.

The hash mark (#) can be used to denote comments. All characters from # to the end of the line are ignored. Any whitespace at the very start or end of a line is ignored. Empty lines (or lines with only comments) are skipped.

Restricting the period over which to assimilate

You can direct EAT to only use observations from a specific time period, without needing to modify your observation files. This can be useful if you want to include a period during the simulation in which the model runs freely, for instance, during an initial spin-up phase, or in forecast-only mode after a period of data assimilation. To configure this, you can add start and stop arguments to the eatpy.models.GOTM object:

experiment = eatpy.models.GOTM(start=datetime.datetime(2020, 1, 1), stop=datetime.datetime(2021, 1, 1))

This will only use observations from the year 2020. Note that observations that lie outside GOTM's simulated period (time/start and time/stop set in gotm.yaml) will always be ignored. Also, the above requires an additional import datetime at the start of your run script.

The data assimilation method

The data assimilation method is configured when you initialize the filter with a line like

filter = eatpy.PDAF(eatpy.pdaf.FilterType.ESTKF)

eatpy.PDAF takes the following arguments:

  • filtertype: (1) SEIK, (2) EnKF, (3) LSEIK, (4) ETKF, (5) LETKF, (6) ESTKF, (7) LESTKF, (8) LEnKF, (9) NETF, (10) LNETF, (12) PF, (13) 3DVAR. You can either specify the filtertype as an integer value, or as a member of eatpy.pdaf.FilterType.
  • subtype: subtype of filter algorithm, defaulting to 0. The interpretation of this argument depends on filtertype:
    • SEIK: (0) ensemble forecast; new formulation (1) ensemble forecast; old formulation (2) fixed error space basis (3) fixed state covariance matrix (4) SEIK with ensemble transformation
    • LSEIK: (0) ensemble forecast; (2) fixed error space basis (3) fixed state covariance matrix (4) LSEIK with ensemble transformation
    • ETKF: (0) ETKF using T-matrix like SEIK (1) ETKF following Hunt et al. (2007) There are no fixed basis/covariance cases, as these are equivalent to SEIK subtypes 2/3
    • LETKF: (0) LETKF using T-matrix like SEIK (1) LETKF following Hunt et al. (2007) There are no fixed basis/covariance cases, as these are equivalent to LSEIK subtypes 2/3
    • ESTKF: (0) standard ESTKF There are no fixed basis/covariance cases, as these are equivalent to SEIK subtypes 2/3
    • LESTKF: (0) standard LESTKF There are no fixed basis/covariance cases, as these are equivalent to LSEIK subtypes 2/3
    • LEnKF: (0) Standard form of EnKF with covariance localization
    • NETF: (0) standard NETF
    • LNETF: (0) standard LNETF
    • 3DVAR: (0) 3DVAR incremental with control variable transform (1) ensemble 3DVAR using LESTKF for ensemble transformation (4) ensemble 3DVAR using ESTKF for ensemble transformation (6) hybrid 3DVAR using LESTKF for ensemble transformation (7) hybrid 3DVAR using ESTKF for ensemble transformation
  • incremental: whether to perform incremental analysis
  • forget: Forgetting factor (usually >0 and <=1). This controls inflation of the ensemble.
  • type_forget: Type of forgetting factor in SEIK/LSEIK/ETKF/LETKF/ESTKF/LESTKF (0) fixed (1) global adaptive (2) local adaptive for LSEIK/LETKF/LESTKF
  • type_trans: Type of ensemble transformation
    • SEIK/LSEIK and ESTKF/LESTKF: (0) use deterministic omega (1) use random orthonormal omega orthogonal to (1,...,1)^T (2) use product of (0) with random orthonormal matrix with eigenvector (1,...,1)^T
    • ETKF/LETKF: (0) use deterministic symmetric transformation (2) use product of (0) with random orthonormal matrix with eigenvector (1,...,1)^T
  • type_sqrt: type of transformation matrix square root (0: symmetric, 1: Cholesky decomposition)
  • type_opt: optimization method (solver) for 3D-Var (1: LBFGS, 2: CG+, 3: plain CG)
  • rank_analysis_enkf: maximum rank for inversion of HPH^T (if 0, HPH is inverted by solving the representer equation) (if set to >=ensemble size, it is reset to ensemble size - 1)
  • beta_3dvar: Hybrid weight for hybrid 3DVAR [only used for filtertype 13 with subtype 6/7]
  • screen: initalization-only (0), every assimilation cycle (1) (2) add timings

Currently, the following filtertype and subtype combinations have been tested:

  • Error Subspace Transform Kalman filter (ESTKF): filtertype=6, subtype=0
  • 3DVAR incremental with control variable transform: filtertype=13, subtype=0. Extra settings are required as described in the next section.
  • ensemble 3DVAR using ESTKF for ensemble transformation: filtertype=13, subtype=4. Extra settings are required as described in the next section.
  • hybrid 3DVAR using LESTKF for ensemble transformation: filtertype=13, subtype=7. Extra settings are required as described in the next section.

3DVar

To use PDAF with 3D-Var, you need to provide custom routines for forward and adjoint covariance transformation (or their ensemble counterparts). This is done through a plugin, implemented in a stand-alone Python file with the following template:

from typing import Mapping, Any
import numpy as np
import eatpy

class Cvt(eatpy.pdaf.CvtHandler):
    def initialize(self, variables: Mapping[str, Any], ensemble_size: int):
        # If you use parameterized or hybrid 3D-Var, you need typically need the square
        # root of the background covariance matrix (V) in the cvt and cvt_adj methods.
        # Here you could set this matrix V up once, for instance, by reading it from file.
        # You also must set the length of the control vector (dim_cvec);
        # If you are using V in the cvt and cvt_adj methods as shown below, dim_cvec
        # equals the number of columns of V: self.dim_cvec = self.V.shape[1]
        self.V = ...
        self.dim_cvec = ...

    def cvt(self, iter: int, state: np.ndarray, v: np.ndarray) -> np.ndarray:
        """Forward covariance transformation for parameterized 3D-Var

        Args:
           iter: iteration of the optimization (1 for the first)
           state: state vector with shape (nstate,)
           v: control vector with shape (self.dim_cvec,)

        Returns:
           state vector increment with shape (nstate,)
        """
        return self.V @ v

    def cvt_adj(self, iter: int, state: np.ndarray, Vv: np.ndarray) -> np.ndarray:
        """Adjoint covariance transformation for parameterized 3D-Var

        Args:
           iter: iteration of the optimization (1 for the first)
           state: state vector with shape (nstate,)
           Vv: state vector increment with shape (nstate,)

        Returns:
           control vector with shape (self.dim_cvec,)
        """
        return self.V.T @ Vv

    def cvt_ens(self, iter: int, state: np.ndarray, v: np.ndarray) -> np.ndarray:
        """Forward covariance transformation for ensemble 3D-Var

        Args:
           iter: iteration of the optimization (1 for the first)
           state: ensemble state matrix with shape (nensemble, nstate)
           v: control vector with shape (self.dim_cvec_ens,)
              self.dim_cvec_ens defaults to the number of ensemble members (nensemble),
              but this can be overridden by setting it explicitly from initialize.

        Returns:
           state vector increment with shape (nstate,)
        """
        if iter == 1:
            # Unbiased ensemble-based estimator of the square root of the background covariance matrix
            self.V_ens = (state - state.mean(axis=0)).T / np.sqrt(v.size - 1)
        return self.V_ens @ v

    def cvt_adj_ens(self, iter: int, state: np.ndarray, Vv: np.ndarray) -> np.ndarray:
        """Adjoint covariance transformation for ensemble 3D-Var

        Args:
           iter: iteration of the optimization (1 for the first)
           state: ensemble state matrix with shape (nensemble, nstate)
           Vv: state vector increment with shape (nstate,)

        Returns:
           control vector with shape (self.dim_cvec_ens,)
        """
        return self.V_ens.T @ Vv

For more information on what these routines should do, see the PDAF documentation: cvt, cvt_adj, cvt_ens, cvt_adj_ens.

To make EAT use your plugin, you need to add it to your data assimilation experiment. If your plugin is coded in a file named cvt.py, you can add it like this in your run script:

import cvt
experiment.add_plugin(cvt.Cvt())

This must be done before you add observations to the experiment.

Creating plug-ins

You can customize your EAT data assimilation experiment by creating plug-ins. These are snippets of Python code that have full (read and write) access to the model state and observations during the data assimilation cycle. Thus, these plug-ins can see and modify the forecasted and analyzed model state. They can for instance:

  • apply transforms (log, logit, etc.) to selected variables
  • save the model state before and after the data assimilation step
  • prevent selected model variables from being changed by the filter
  • check whether variables lie within specified ranges
  • calculate and report ensemble statistics

An example of a relatively simple plug-in that prints statistics of temperature across the ensemble to screen:

from typing import Mapping, Any
import datetime
import numpy as np
import eatpy

class TestPlugin(eatpy.Plugin):
    def initialize(self, variables: Mapping[str, Any], ensemble_size: int):
        self.logger.info(f"The model state consists of the following variables: {', '.join(variables)}")
        self.temp = variables["temp"]

    def before_analysis(self, time: datetime.datetime, state: np.ndarray, iobs: np.ndarray, obs: np.ndarray, filter: eatpy.Filter):
        temp = self.temp["data"]  # shape: nmember x nz
        sd = np.std(temp, axis=0) # depth-explicit sd over the ensemble
        self.logger.info(f"{time} before analysis:")
        self.logger.info(f"  Temperature range: {temp.min()} - {temp.max()}")
        self.logger.info(f"  Temperature s.d. across the ensemble: {sd.min()} - {sd.max()}")

    def after_analysis(self, state: np.ndarray):
        temp = self.temp["data"]  # shape: nmember x nz
        sd = np.std(temp, axis=0)
        self.logger.info("after analysis:")
        self.logger.info(f"  Temperature range: {temp.min()} - {temp.max()}")
        self.logger.info(f"  Temperature s.d. across the ensemble: {sd.min()} - {sd.max()}")

The above methods in short:

  • initialize is called upon start-up with information about the variables that are part of model state. This information includes metadata such as the variable's long name ("long_name") and units ("units"), as well as array slices for the variable in the originally forecasted state ("model_data"). After all plug-ins have initialized, it will also contain the state as presented to the filter ("data"). These array slices are 2D array for depth-dependent variables (shape: number of ensemble members × number of depth levels), and 1D arrays for depth-independent variables (shape: number of ensemble members). Since "data" is not available until after initialization, many plugins store the information for variables of interest (self.temp = variables["temp"] in the above), so that they case later access the variable data.
  • before_analysis is called after all ensemble members [GOTM instances] have updated their state by time-integrating to the time of the next observation(s). This is the forecast state, supplied by the state argument (shape: <NENSEMBLE>, <NSTATE>). Observations are also supplied as the observed values (obs), along with their mapping to indices into the state array (iobs). Finally, the filter object is provided, which provides access to PDAF settings.
  • after_analysis is called after the filter has updated the model state. This is the analysis state, supplied by the state argument. Between calls to initialize, before_analysis and after_analysis, you can store any information you like as part of your plugin. This allows you, for instance, to store a copy of the forecast state or the observations for use during after_analysis. This information could then be used to compute derived variables, such as the innovation vector.

Plug-ins can either be included in your run script or be placed in a separate Python file (.py). Placing them in a separate file allows them to be used in multiple experiments. If the plug-in is included in the run script, it would be added to the experiment like this:

experiment.add_plugin(TestPlugin())

If the plug-in instead is placed in a separate file named plugin.py, you would add it like this:

import plugin
experiment.add_plugin(plugin.TestPlugin())

In either case, the add_plugin call should come before adding any observations to the experiment.

If you use multiple plugins, their before_analysis methods will be called in the order the were added to the experiment. This is important, as individual plugins may modify the value of the model state or the observations. The next plug-in in line will then see that modified state. The order of the plug-ins calls is reversed when after_analysis is called, which means that you could use an "inner" plugin (added last in the run script) to e.g. log-transform a state variable (i.e., the forecast state and observations are transformed in before_analysis; the transform is then reversed for the analysis state in after_analysis), while all other plugins see the untransformed values of the variable.

Example plug-ins

EAT includes several plug-ins:

  • eatpy.plugins.output.NetCDF writes the forecast and analysis states for all ensemble members to NetCDF. To instantiate this plugin, use eatpy.plugins.output.NetCDF("<FILEPATH>"), where <FILEPATH> is the location of the NetCDF file that you want to save the states to (the file name, with optionally a directory prefixed).
  • eatpy.plugins.transform.Log log-transforms one or more model variables, and any associated observations, before they are presented to the PDAF filter algorithm. The reverse transform is then applied to the updated state returned by the filter, so that the model always sees the untransformed values. To instantiate this plugin, use eatpy.plugins.transform.Log("<VARIABLE1>"[, "<VARIABLE2>"[, ...]]), where <VARIABLE1>, <VARIABLE2> etc. are the names of model variables you want to transform.
  • eatpy.plugins.select.Select selects a subset of state variables that will be updated by the data assimilation step. You can either name the variables to include (in which case all variables are excluded by default), or the variables to exclude (in which case all variables are included by default). The former is done with eatpy.plugins.select.Select(include=("<VARIABLE1>", "<VARIABLE1>", ...)), the latter with eatpy.plugins.select.Select(exclude=("<VARIABLE1>", "<VARIABLE1>", ...)). You can also use wildcards, e.g., eatpy.plugins.select.Select(include="P?_*").
  • eatpy.plugins.report.Spread reports the root-mean-square deviation for each forecasted variable.
  • eatpy.plugins.check.Finite verifies that the model state and observations that will be presented to the filter only contain finite values (as opposed to not-a-number or infinity)

To use one of the above standard plug-ins, for instance Select, add the following to your run script:

experiment.add_plugin(eatpy.plugins.select.Select(include=("var1", "var2", ..., )))

This must be done before adding observations.

You can build on the code from these examples to develop custom functionality. For instance, a plugin that limits the state update to a subset of variables, and log-transforms some of those, could look like:

from typing import Mapping, Any
import datetime
import numpy as np
import eatpy

# state variables that the DA scheme is allowed to manipulate
kept_vars = ["temp", "salt", "P1_chl", "P2_chl", "P3_chl", "P4_chl"]

# state variables that need to be log-transformed
logtransformed_vars = ["P1_chl", "P2_chl", "P3_chl", "P4_chl"]

class MyPlugin(eatpy.Plugin):
    def initialize(self, variables: Mapping[str, Any], ensemble_size: int):
        self.logvars = []
        for name in list(variables):
            if name not in kept_vars:
                del variables[name]
            elif name in logtransformed_vars:
                self.logvars.append(variables[name])

    def before_analysis(self, time: datetime.datetime, state: np.ndarray, iobs: np.ndarray, obs: np.ndarray, obs_sds: np.ndarray, filter: eatpy.Filter):
        for info in self.logvars:
            start, stop = info["start"], info["stop"]
            affected_obs = (iobs >= start) & (iobs < stop)
            if affected_obs.any():
                obs[affected_obs] = np.log10(obs[affected_obs])
            info["data"][...] = np.log10(info["data"])

    def after_analysis(self, state: np.ndarray):
        for info in self.logvars:
            info["data"][...] = 10. ** info["data"]
Clone this wiki locally