Ouranosinc · LamAdr · Aug 7, 2024 · Jun 17, 2024 · Jun 17, 2024 · Jun 17, 2024
diff --git a/.zenodo.json b/.zenodo.json
@@ -140,6 +140,9 @@
  "affiliation": "Santander Meteorology Group, Instituto de Física de Cantabria (CSIC-UC), Santander, Spain",
  "orcid": "0000-0001-9053-2542"
  },
+ {
+ "name": "Lamarche, Adrien"
+ },
  {
  "name": "Wang, Hui-Min",
  "affiliation": "National University of Singapore, Singapore, Singapore",

diff --git a/AUTHORS.rst b/AUTHORS.rst
@@ -44,4 +44,5 @@ Contributors
 * Dante Castro <dante.castro@hereon.de> `@profesorpaiche <https://github.com/profesorpaiche>`_
 * Sascha Hofmann <sascha.hofmann@lobelia.earth> `@saschahofmann <https://github.com/saschahofmann>`_
 * Javier Diez-Sierra <javier.diez@unican.es> `@JavierDiezSierra <https://github.com/JavierDiezSierra>`_
-* Hui-Min Wang `@Hem-W <https://github.com/Hem-W>`
+* Hui-Min Wang `@Hem-W <https://github.com/Hem-W>`_
+* Adrien Lamarche `@LamAdr <https://github.com/LamAdr>`_
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -4,12 +4,13 @@ Changelog
 
 v0.52.0 (unreleased)
 --------------------
-Contributors to this version: David Huard (:user:`huard`), Trevor James Smith (:user:`Zeitsperre`), Hui-Min Wang (:user:`Hem-W`), Éric Dupuis (:user:`coxipi`), Sarah Gammon (:user:`SarahG-579462`).
+Contributors to this version: David Huard (:user:`huard`), Trevor James Smith (:user:`Zeitsperre`), Hui-Min Wang (:user:`Hem-W`), Éric Dupuis (:user:`coxipi`), Sarah Gammon (:user:`SarahG-579462`), Adrien Lamarche (:user:`LamAdr`).
 
 New features and enhancements
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 * ``xclim.sdba.nbutils.quantile`` and its child functions are now faster. If the module `fastnanquantile` is installed, it is used as the backend for the computation of quantiles and yields even faster results. (:issue:`1255`, :pull:`1513`).
-* New multivariate bias adjustment class `MBCn`, giving a faster and more accurate implementation of the 'MBCn' algorithm (:issue:`1551`, :pull:`1580`).
+* New multivariate bias adjustment class ``MBCn``, giving a faster and more accurate implementation of the 'MBCn' algorithm. (:issue:`1551`, :pull:`1580`).
+* New multivariate bias adjustment classes ``OTC`` and ``dOTC``. (:pull:`1787`).
 
 Bug fixes
 ^^^^^^^^^

diff --git a/docs/notebooks/sdba.ipynb b/docs/notebooks/sdba.ipynb
@@ -344,6 +344,7 @@
  "metadata": {},
  "source": [
  "### Third example : Multi-method protocol - Hnilica et al. 2017\n",
+ "\n",
  "In [their paper of 2017](https://doi.org/10.1002/joc.4890), Hnilica, Hanel and Puš present a bias-adjustment method based on the principles of Principal Components Analysis.\n",
  "\n",
  "The idea is simple: use principal components to define coordinates on the reference and on the simulation, and then transform the simulation data from the latter to the former. Spatial correlation can thus be conserved by taking different points as the dimensions of the transform space. The method was demonstrated in the article by bias-adjusting precipitation over different drainage basins.\n",
@@ -439,9 +440,10 @@
  "cell_type": "markdown",
  "metadata": {},
  "source": [
- "### Fourth example : Multivariate bias-adjustment (Cannon, 2018)\n",
+ "### Fourth example : Dynamical Optimal Transport Correction - Robin et al. 2019\n",
+ "Robin, Vrac, Naveau and Yiou presented the dOTC multivariate bias correction method in a [2019 paper](https://hess.copernicus.org/articles/23/773/2019/).\n",
  "\n",
- "This section replicates the \"MBCn\" algorithm described by [Cannon (2018)](https://doi.org/10.1007/s00382-017-3580-6). The method relies on some univariate algorithm, an adaption of the N-pdf transform of [Pitié et al. (2005)](https://ieeexplore.ieee.org/document/1544887/) and a final reordering step.\n",
+ "Here, we use optimal transport to find mappings between reference, simulated historical and simulated future data. Following these mappings, future simulation is corrected by applying the temporal evolution of model data to the reference.\n",
  "\n",
  "In the following, we use the Adjusted and Homogenized Canadian Climate Dataset ([AHCCD](https://open.canada.ca/data/en/dataset/9c4ebc00-3ea4-4fe0-8bf2-66cfe1cddd1d)) and CanESM2 data as reference and simulation, respectively, and correct both `pr` and `tasmax` together."
  ]
@@ -452,12 +454,14 @@
  "metadata": {},
  "outputs": [],
  "source": [
+ "import numpy as np\n",
+ "\n",
  "from xclim.core.units import convert_units_to\n",
  "from xclim.testing import open_dataset\n",
  "\n",
- "dref = open_dataset(\n",
- " \"sdba/ahccd_1950-2013.nc\", chunks={\"location\": 1}, drop_variables=[\"lat\", \"lon\"]\n",
- ").sel(time=slice(\"1981\", \"2010\"))\n",
+ "dref = open_dataset(\"sdba/ahccd_1950-2013.nc\", drop_variables=[\"lat\", \"lon\"]).sel(\n",
+ " time=slice(\"1981\", \"2010\")\n",
+ ")\n",
  "\n",
  "# Fix the standard name of the `pr` variable.\n",
  "# This allows the convert_units_to below to infer the correct CF transformation (precip rate to flux)\n",
@@ -468,12 +472,218 @@
  " tasmax=convert_units_to(dref.tasmax, \"K\"),\n",
  " pr=convert_units_to(dref.pr, \"kg m-2 s-1\"),\n",
  ")\n",
- "dsim = open_dataset(\n",
- " \"sdba/CanESM2_1950-2100.nc\", chunks={\"location\": 1}, drop_variables=[\"lat\", \"lon\"]\n",
- ")\n",
+ "dsim = open_dataset(\"sdba/CanESM2_1950-2100.nc\", drop_variables=[\"lat\", \"lon\"])\n",
  "\n",
  "dhist = dsim.sel(time=slice(\"1981\", \"2010\"))\n",
  "dsim = dsim.sel(time=slice(\"2041\", \"2070\"))\n",
+ "dref"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here we are going to correct the precipitations multiplicatively to make sure they don't become negative. In this context, small precipitation values can lead to huge aberrations. This problem can be mitigated with `adapt_freq_thresh`. We also need to stack our variables into a `dataArray` before feeding them to `dOTC`.\n",
+ "\n",
+ "Since the precipitations are treated multiplicatively, we have no choice but to use \"std\" for the `cov_factor` argument (the default), which means the rescaling of model data to the observed data scale is done independently for every variable. In the situation where one only has additive variables, it is recommended to use the \"cholesky\" `cov_factor`, in which case the rescaling is done in a multivariate fashion."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ref = dref.where(dref.location == \"Amos\", drop=True).squeeze()\n",
+ "hist = dhist.where(dhist.location == \"Amos\", drop=True).squeeze()\n",
+ "sim = dsim.where(dsim.location == \"Amos\", drop=True).squeeze()\n",
+ "\n",
+ "ref = sdba.processing.stack_variables(ref)\n",
+ "hist = sdba.processing.stack_variables(hist)\n",
+ "sim = sdba.processing.stack_variables(sim)\n",
+ "\n",
+ "# This function has random components\n",
+ "np.random.seed(0)\n",
+ "\n",
+ "# Contrary to most algorithms in sdba, dOTC has no `train` method\n",
+ "scen = sdba.adjustment.dOTC.adjust(\n",
+ " ref,\n",
+ " hist,\n",
+ " sim,\n",
+ " kind={\n",
+ " \"pr\": \"*\"\n",
+ " }, # Since this bias correction method is multivariate, `kind` must be specified per variable\n",
+ " adapt_freq_thresh={\"pr\": \"3.5e-4 kg m-2 s-1\"}, # Idem\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Some analysis figures\n",
+ "\n",
+ "# Unstack variables and select a location\n",
+ "ref = sdba.processing.unstack_variables(ref)\n",
+ "hist = sdba.processing.unstack_variables(hist)\n",
+ "sim = sdba.processing.unstack_variables(sim)\n",
+ "scen = sdba.processing.unstack_variables(scen)\n",
+ "\n",
+ "fig = plt.figure(figsize=(10, 10))\n",
+ "gs = plt.matplotlib.gridspec.GridSpec(2, 2, fig)\n",
+ "ax_pr = plt.subplot(gs[0, 0])\n",
+ "ax_tasmax = plt.subplot(gs[0, 1])\n",
+ "ax_scatter = plt.subplot(gs[1, :])\n",
+ "\n",
+ "# Precipitation\n",
+ "hist.pr.plot(ax=ax_pr, color=\"c\", label=\"Simulation (past)\")\n",
+ "ref.pr.plot(ax=ax_pr, color=\"b\", label=\"Reference\", alpha=0.5)\n",
+ "sim.pr.plot(ax=ax_pr, color=\"y\", label=\"Simulation (future)\")\n",
+ "scen.pr.plot(ax=ax_pr, color=\"r\", label=\"Corrected\", alpha=0.5)\n",
+ "ax_pr.set_title(\"Precipitation\")\n",
+ "\n",
+ "# Maximum temperature\n",
+ "hist.tasmax.plot(ax=ax_tasmax, color=\"c\")\n",
+ "ref.tasmax.plot(ax=ax_tasmax, color=\"b\", alpha=0.5)\n",
+ "sim.tasmax.plot(ax=ax_tasmax, color=\"y\")\n",
+ "scen.tasmax.plot(ax=ax_tasmax, color=\"r\", alpha=0.5)\n",
+ "ax_tasmax.set_title(\"Maximum temperature\")\n",
+ "\n",
+ "# Scatter\n",
+ "ref.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"b\", edgecolors=\"k\", s=20)\n",
+ "scen.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"r\", edgecolors=\"k\", s=20)\n",
+ "sim.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"y\", edgecolors=\"k\", s=20)\n",
+ "hist.plot.scatter(x=\"tasmax\", y=\"pr\", ax=ax_scatter, color=\"c\", edgecolors=\"k\", s=20)\n",
+ "ax_scatter.set_title(\"Variables distribution\")\n",
+ "\n",
+ "# Example mapping\n",
+ "max_time = scen.pr.idxmax().data\n",
+ "max_idx = np.where(scen.time.data == max_time)[0][0]\n",
+ "\n",
+ "scen_x = scen.tasmax.isel(time=max_idx)\n",
+ "scen_y = scen.pr.isel(time=max_idx)\n",
+ "sim_x = sim.tasmax.isel(time=max_idx)\n",
+ "sim_y = sim.pr.isel(time=max_idx)\n",
+ "\n",
+ "ax_scatter.scatter(scen_x, scen_y, color=\"r\", edgecolors=\"k\", s=30, linewidth=1)\n",
+ "ax_scatter.scatter(sim_x, sim_y, color=\"y\", edgecolors=\"k\", s=30, linewidth=1)\n",
+ "\n",
+ "prop = dict(arrowstyle=\"-|>,head_width=0.3,head_length=0.8\", facecolor=\"black\", lw=1)\n",
+ "ax_scatter.annotate(\"\", xy=(scen_x, scen_y), xytext=(sim_x, sim_y), arrowprops=prop)\n",
+ "\n",
+ "ax_pr.legend()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from scipy.stats import gaussian_kde\n",
+ "\n",
+ "fig = plt.figure(figsize=(10, 5))\n",
+ "gs = plt.matplotlib.gridspec.GridSpec(1, 2, fig)\n",
+ "\n",
+ "tasmax = plt.subplot(gs[0, 0])\n",
+ "pr = plt.subplot(gs[0, 1])\n",
+ "\n",
+ "sim_t = sim.tasmax.to_numpy()\n",
+ "scen_t = scen.tasmax.to_numpy()\n",
+ "stack = np.vstack([sim_t, scen_t])\n",
+ "z = gaussian_kde(stack)(stack)\n",
+ "idx = z.argsort()\n",
+ "sim_t, scen_t, z = sim_t[idx], scen_t[idx], z[idx]\n",
+ "tasmax.scatter(sim_t, scen_t, c=z, s=1, cmap=\"viridis\")\n",
+ "tasmax.set_title(\"Tasmax\")\n",
+ "tasmax.set_ylabel(\"scen tasmax\")\n",
+ "tasmax.set_xlabel(\"sim tasmax\")\n",
+ "\n",
+ "sim_p = sim.pr.to_numpy()\n",
+ "scen_p = scen.pr.to_numpy()\n",
+ "stack = np.vstack([sim_p, scen_p])\n",
+ "z = gaussian_kde(stack)(stack)\n",
+ "idx = z.argsort()\n",
+ "sim_p, scen_p, z = sim_p[idx], scen_p[idx], z[idx]\n",
+ "pr.scatter(sim_p, scen_p, c=z, s=1, cmap=\"viridis\")\n",
+ "pr.set_title(\"Pr\")\n",
+ "pr.set_ylabel(\"scen pr\")\n",
+ "pr.set_xlabel(\"sim pr\")\n",
+ "\n",
+ "fig.suptitle(\"Correlations between input and output per variable\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This last plot shows the correlation between input and output per variable. Here we see a relatively strong correlation for all variables, meaning they are all taken into account when finding the optimal transport mappings. This is because we're using the (by default) `transform = 'max_distance'` argument. Were the data not transformed, the distances along the precipitation dimension would be very small relative to the temperature distances. Precipitation values would then be spread around at very low cost and have virtually no effect on the result. See this in action with `transform = None`.\n",
+ "\n",
+ "The chunks we see in the tasmax data are artefacts of the `bin_width`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Fifth example : Multivariate bias-adjustment with multiple steps (Cannon, 2018)\n",
+ "\n",
+ "This section replicates the \"MBCn\" algorithm described by [Cannon (2018)](https://doi.org/10.1007/s00382-017-3580-6). The method relies on some univariate algorithm, an adaption of the N-pdf transform of [Pitié et al. (2005)](https://ieeexplore.ieee.org/document/1544887/) and a final reordering step.\n",
+ "\n",
+ "As in the dOTC example, we use the Adjusted and Homogenized Canadian Climate Dataset ([AHCCD](https://open.canada.ca/data/en/dataset/9c4ebc00-3ea4-4fe0-8bf2-66cfe1cddd1d)) and CanESM2 data as reference and simulation, respectively, and correct both `pr` and `tasmax` together. This time, we chunk our data with Dask."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "dref[\"pr\"] = dref.pr.chunk({\"location\": 1})\n",
+ "dref[\"tasmax\"] = dref.tasmax.chunk({\"location\": 1})\n",
+ "\n",
+ "dhist[\"pr\"] = dhist.pr.chunk({\"location\": 1})\n",
+ "dhist[\"tasmax\"] = dhist.tasmax.chunk({\"location\": 1})\n",
+ "\n",
+ "dsim[\"pr\"] = dsim.pr.chunk({\"location\": 1})\n",
+ "dsim[\"tasmax\"] = dsim.tasmax.chunk({\"location\": 1})"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "##### Perform an initial univariate adjustment."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# additive for tasmax\n",
+ "QDMtx = sdba.QuantileDeltaMapping.train(\n",
+ " dref.tasmax, dhist.tasmax, nquantiles=20, kind=\"+\", group=\"time\"\n",
+ ")\n",
+ "# Adjust both hist and sim, we'll feed both to the Npdf transform.\n",
+ "scenh_tx = QDMtx.adjust(dhist.tasmax)\n",
+ "scens_tx = QDMtx.adjust(dsim.tasmax)\n",
+ "\n",
+ "# remove == 0 values in pr:\n",
+ "dref[\"pr\"] = sdba.processing.jitter_under_thresh(dref.pr, \"0.01 mm d-1\")\n",
+ "dhist[\"pr\"] = sdba.processing.jitter_under_thresh(dhist.pr, \"0.01 mm d-1\")\n",
+ "dsim[\"pr\"] = sdba.processing.jitter_under_thresh(dsim.pr, \"0.01 mm d-1\")\n",
+ "\n",
+ "# multiplicative for pr\n",
+ "QDMpr = sdba.QuantileDeltaMapping.train(\n",
+ " dref.pr, dhist.pr, nquantiles=20, kind=\"*\", group=\"time\"\n",
+ ")\n",
+ "# Adjust both hist and sim, we'll feed both to the Npdf transform.\n",
+ "scenh_pr = QDMpr.adjust(dhist.pr)\n",
+ "scens_pr = QDMpr.adjust(dsim.pr)\n",
  "\n",
  "# Stack variables : Dataset -> DataArray with `multivar` dimension\n",
  "dref, dhist, dsim = (sdba.stack_variables(da) for da in (dref, dhist, dsim))"

diff --git a/docs/references.bib b/docs/references.bib
@@ -2152,3 +2152,54 @@ @article{droogers2002
  url = {https://www.scopus.com/inward/record.uri?eid=2-s2.0-0036464359&doi=10.1023%2fA%3a1015508322413&partnerID=40&md5=7322aaa4c6874878f5b1dab3c73c1718},
  type = {Article}
 }
+
+@Article{robin_2019,
+ author = {Robin, Y. and Vrac, M. and Naveau, P. and Yiou, P.},
+ title = {Multivariate stochastic bias corrections with optimal transport},
+ journal = {Hydrology and Earth System Sciences},
+ volume = {23},
+ year = {2019},
+ number = {2},
+ pages = {773--786},
+ url = {https://hess.copernicus.org/articles/23/773/2019/},
+ doi = {10.5194/hess-23-773-2019}
+}
+
+@misc{robin_2021,
+ title = {{SBCK}: {Statistical} {Bias} {Correction} {Kit}},
+ copyright = {GPL-3},
+ shorttitle = {{SBCK}},
+ url = {https://github.com/yrobink/SBCK-python},
+ urldate = {2024-07-03},
+ author = {Robin, Yoann},
+ year = {2021},
+}
+
+@article{higham_1988,
+ title = {Computing a nearest symmetric positive semidefinite matrix},
+ journal = {Linear Algebra and its Applications},
+ volume = {103},
+ pages = {103-118},
+ year = {1988},
+ issn = {0024-3795},
+ doi = {https://doi.org/10.1016/0024-3795(88)90223-6},
+ url = {https://www.sciencedirect.com/science/article/pii/0024379588902236},
+ author = {Nicholas J. Higham},
+ abstract = {The nearest symmetric positive semidefinite matrix in the Frobenius norm to an arbitrary real matrix A is shown to be (B + H)/2, where H is the symmetric polar factor of B=(A + AT)/2. In the 2-norm a nearest symmetric positive semidefinite matrix, and its distance δ2(A) from A, are given by a computationally challenging formula due to Halmos. We show how the bisection method can be applied to this formula to compute upper and lower bounds for δ2(A) differing by no more than a given amount. A key ingredient is a stable and efficient test for positive definiteness, based on an attempted Choleski decomposition. For accurate computation of δ2(A) we formulate the problem as one of zero finding and apply a hybrid Newton-bisection algorithm. Some numerical difficulties are discussed and illustrated by example.}
+}
+
+@article{knol_1989,
+ title = "Least-squares approximation of an improper correlation matrix by a proper one",
+ abstract = "An algorithm is presented for the best least-squares fitting correlation matrix approximating a given missing value or improper correlation matrix. The proposed algorithm is based upon a solution for Mosier's oblique Procrustes rotation problem offered by ten Berge and Nevels. A necessary and sufficient condition is given for a solution to yield the unique global minimum of the least-squares function. Empirical verification of the condition indicates that the occurrence of non-optimal solutions with the proposed algorithm is very unlikely. A possible drawback of the optimal solution is that it is a singular matrix of necessity. In cases where singularity is undesirable, one may impose the additional nonsingularity constraint that the smallest eigenvalue of the solution be δ, where δ is an arbitrary small positive constant. Finally, it may be desirable to weight the squared errors of estimation differentially. A generalized solution is derived which satisfies the additional nonsingularity constraint and also allows for weighting. The generalized solution can readily be obtained from the standard “unweighted singular” solution by transforming the observed improper correlation matrix in a suitable way.",
+ keywords = "Missing value correlation, indefinite correlation matrix, IR-85889, tetrachoric correlation, constrained least-squares approximation",
+ author = "Knol, {Dirk L.} and {ten Berge}, {Jos M.F.}",
+ year = "1989",
+ doi = "10.1007/BF02294448",
+ language = "Undefined",
+ volume = "54",
+ pages = "53--61",
+ journal = "Psychometrika",
+ issn = "0033-3123",
+ publisher = "Springer",
+ number = "1",
+}
diff --git a/pyproject.toml b/pyproject.toml
@@ -110,7 +110,7 @@ docs = [
  "sphinxcontrib-bibtex",
  "sphinxcontrib-svg2pdfconverter[Cairosvg]"
 ]
-extras = ["fastnanquantile"]
+extras = ["fastnanquantile", "POT"]
 all = ["xclim[dev]", "xclim[docs]", "xclim[extras]"]
 
 [project.scripts]
@@ -162,7 +162,7 @@ values = [
 
 [tool.codespell]
 skip = 'xclim/data/*.json,docs/_build,docs/notebooks/xclim_training/*.ipynb,docs/references.bib,__pycache__,*.gz,*.nc,*.png,*.svg,*.whl'
-ignore-words-list = "absolue,astroid,bloc,bui,callendar,degreee,environnement,hanel,inferrable,lond,nam,nd,ressources,socio-economic,sie,vas"
+ignore-words-list = "absolue,astroid,bloc,bui,callendar,degreee,environnement,hanel,inferrable,lond,nam,nd,ot,ressources,socio-economic,sie,vas"
 
 [tool.coverage.run]
 relative_files = true
@@ -176,6 +176,7 @@ pep621_dev_dependency_groups = ["all", "dev", "docs"]
 [tool.deptry.package_module_name_map]
 "scikit-learn" = "sklearn"
 "pyyaml" = "yaml"
+"POT" = "ot"
 
 [tool.deptry.per_rule_ignores]
 DEP001 = ["SBCK"]