Add codes to extract SoilGrids soil texture data and derive ensemble … #3406

Qianyuxuan · 2025-01-02T19:09:49Z

Add codes to extract SoilGrids soil texture data and derive ensemble soil parameter files

Description

Add a script "soilgrids_texture_extraction.R" that can extract and save three types of soil texture data in parallel for a single or group of lat/long locations based on user-defined site location from SoilGrids250m data.
Add a script "soil_params_ensemble.R" that can estimate the soil parameters based on soil texture data and write the ensemble parameter paths into settings.
Modify "soil_utils.R" and "extract_soil_nc.R" to fix the bugs when generating soil parameter files.
Modify "write.config.SIPNET.R" to include codes that can read and write soil water holding capacity into parameter settings.

Motivation and Context

Provide ensemble soil parameter files based on SoilGrids soil texture data for SDA.

Review Time Estimate

Immediately
Within one week
When possible

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation.
My name is in the list of CITATION.cff
I agree that PEcAn Project may distribute my contribution under any or all of
- the same license as the existing code,
- and/or the BSD 3-clause license.
I have updated the CHANGELOG.md.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

…soil parameter files

mdietze · 2025-01-02T21:11:05Z

models/sipnet/R/write.configs.SIPNET.R

+  #working on reading soil file (only working for soilWHC)
+  if (length(settings$run$inputs$soilinitcond$path) > 0) {
+    soil_ICs_num <- length(settings$run$inputs$soilinitcond$path)
+    soil_ICs_path <- settings$run$inputs$soilinitcond$path[[sample(1:soil_ICs_num, 1)]]


I'm guessing the logic of this bit is that if an ensemble input is passed in then you choose one at random? Personally, I'd prefer to throw an error at this point, as by this point in the workflow ensemble inputs should have already been converted to specific inputs for each run. If they're still ensemble at this point it's an error and we should stop. Proceeding on with a random choice means (a) we don't catch an error and (b) we don't know/record which ensemble member mapped to which run.

Yeah I followed the logic as choosing one from ensemble initial conditions here:https://github.com/PecanProject/pecan/blob/develop/models/sipnet/R/write.configs.SIPNET.R#L546. So you suggested we should just randomly choose and write one ensemble member in the script "soil_params_ensemble.R" and read that specific path in "write.configs.SIPNET.R"?

So your soil_params_ensemble should generate a bunch of ensembles and load them ALL into the overall settings object. What I'm saying is that write.configs.SIPNET should only end up seeing ONE of those ensembles because run.write.configs should do the joint sampling of all the different types of ensembles (met, soils, IC, etc), record which ensemble member was assigned which inputs, and then ensure that the settings object reaching write.configs.SIPNET is only for a specific ensemble member. If you're finding that write.configs.SIPNET is being handed all of the soil ensemble members then there's something wrong upstream in the pecan.xml (e.g. not including soils in the section) or in run.write.configs

@mdietze It seems that "run.write.configs" does not play a role in our current SDA workflow but I did find ways to do the sampling of soil ensemble members through the codes in "ensemble.R" by providing "soilinitcond" tag under "ensemble" in the XML file: https://github.com/PecanProject/pecan/blob/develop/modules/uncertainty/R/ensemble.R#L401-L406. However, it is only for a new fresh run and seems not applicable when it is a restart. The current restart components only include met inputs and parameters as https://github.com/PecanProject/pecan/blob/develop/modules/uncertainty/R/ensemble.R#L422C5-L424. What is your suggestion on that?

write.ensemble.configs is what's called by run.write.configs, so if that's what SDA is calling too we should be fine. But as noted earlier, it's important for the code to confirm that it's only receiving on ensemble member for each input, and to throw an error if it's not rather than resampling.

Restart is grabbing inputs and soils should be part of inputs (as should phenology, initial conditions, etc), not just met. You should verify this is working correctly, as it is is critical that ensemble sampling is preserved (not repeated) both iteratively within a SDA run and across SDA runs (e.g., when running a forecast today that starts from yesterday's forecast). Resampling things (params, inputs, IC, etc) that have already been samples is going to seriously mess up the covariance structures in our products

@mdietze Based on the current argument passing of restart in "sda.enkf_MultiSite.R": https://github.com/PecanProject/pecan/blob/develop/modules/assim.sequential/R/sda.enkf_MultiSite.R#L358-L368, it seems we didn't grab soil, phenology and IC other than met now for inputs (note only "met" is specified in "input.ens.gen"). There is also a comment: #TODO: incorporate Phyllis's restart work. #sample all inputs specified in the settings$ensemble not just met". So I guess it is what we should add here?

mdietze · 2025-01-02T21:15:48Z

models/sipnet/R/write.configs.SIPNET.R

+      soilWHC_0.8 <- unlist(soil_IC_list$vals["volume_fraction_of_water_in_soil_at_saturation"])[which(soil_IC_list$dims$depth == 0.8)] * 40
+      soilWHC_1.5 <- unlist(soil_IC_list$vals["volume_fraction_of_water_in_soil_at_saturation"])[which(soil_IC_list$dims$depth == 1.5)] * 100
+      # Calculate the soilWHC for the whole soil profile in cm
+      soilWHC_total <- soilWHC_0.025 + soilWHC_0.1 + soilWHC_0.225 + soilWHC_0.45 + soilWHC_0.8 + soilWHC_1.5


This bit is hard-coding the specific depths associated with a specific product and will fail if different depths are used (e.g. from a different product). The logic here should be easy to generalize to extracting the vector of soilWHC's and the vector of depths, then calculating the layer thicknesses, then performing a vector sum

mdietze · 2025-01-02T21:23:27Z

models/sipnet/R/write.configs.SIPNET.R

-      #LitterWHC
-      #param[which(param[, 1] == "litterWHC"), 2] <- unlist(soil_IC_list$vals["volume_fraction_of_water_in_soil_at_saturation"])[1]*100
-    }
-    if("soil_hydraulic_conductivity_at_saturation"%in%names(soil_IC_list$vals)){


Modified code seems to loose the litter drainage rate. Also not sure why the litterWHC was commented out, but doesn't seem like we want to loose that either.

mdietze · 2025-01-02T21:26:09Z

modules/data.land/R/extract_soil_nc.R

-                                      "soil_albedo","1"
+                                      "soil_albedo","1",
+                                      "slpotwp","1", #intermediate variable, meaningless and unit "1" is used for convenience
+                                      "slpotcp","1", #intermediate variable, meaningless and unit "1" is used for convenience


If these two are meaningless intermediate values, why are we adding them to the output?

Same applies to slcpd below, which is used to calculate soil_thermal_capacity, and slden, which seems to be an unused duplicate of soil_bulk_density. Indeed, in my original code all 4 variables were dropped from the output, and then someone later commented this out. I suspect the commented out part needs to be restored:

pecan/modules/data.land/R/soil_utils.R

Line 238 in 882766d

#mysoil$slcpd <- NULL

Yeah I agree we should restore the change. I added them to fix the bug due to the missing definition of these variables in "extract_soil_nc.R" but they are only meaningful internally.

mdietze · 2025-01-02T21:40:00Z

modules/data.land/R/soil_params_ensemble.R

+      samples <- MCMCpack::rdirichlet(10000, alpha) # Generate samples
+      sim_q5 <- apply(samples, 2, quantile, probs = 0.05, na.rm = TRUE)
+      sim_q50 <-apply(samples, 2, quantile, probs = 0.50, na.rm = TRUE)
+      sim_q95 <-apply(samples, 2, quantile, probs = 0.95, na.rm = TRUE)


would be simpler to do these three in one line by specifying probs = c(0.05, 0.5, 0.95).

mdietze · 2025-01-02T21:47:47Z

modules/data.land/R/soil_params_ensemble.R

+  if (alpha0 <= 0) {
+    stop("Estimated alpha0 is non-positive, which is invalid.")
+  }
+  alphas <- means * alpha0


Would be good to provide more comments in the code about the logic of what's being done here (both this specific line and the more general approach)

mdietze · 2025-01-02T21:52:33Z

modules/data.land/R/soil_params_ensemble.R

+         # Generate the ensemble soil texture data based on the ensemble size (ens_n) defined in the settings
+         samples <- MCMCpack::rdirichlet(10000, alpha_est)
+         colnames(samples) <-c("fraction_of_sand_in_soil","fraction_of_clay_in_soil","fraction_of_silt_in_soil")
+         samples_ens_new <-list(samples[sample(1:10000, ens_n), ]) %>% setNames(depths)


I'm not following why you take more samples than you need, then subsample down to the number needed. Why not just ask rdirichlet to generate ens_n samples in the first place?

mdietze · 2025-01-02T21:57:36Z

modules/data.land/R/soil_utils.R

+    } else {
+      silt[i] <- 100. - sand[i] - clay[i]
+    }
+  }


simpler would have been silt = pmax(100. - sand - clay, 0)

mdietze · 2025-01-02T22:01:28Z

modules/data.land/R/soil_params_ensemble.R

+##'  @importFrom magrittr %>%
+##'  
+
+soil_params_ensemble <- function(settings,sand,clay,silt,outdir,write_into_settings=TRUE){


Conceptually, it would be nice to have a more clean separation between things that are specific to one product vs general. Right now this function is specific to SoilGrids. I think you might want to either generalize the function a bit more (e.g. so that it would still work with a different soil map with different layer thicknesses and different data structures) or modify the function name to make it clear that it is only for this one product.

mdietze · 2025-01-02T22:03:56Z

modules/data.land/R/soil_params_ensemble.R

+      "fraction_of_clay_in_soil",
+      "fraction_of_silt_in_soil"))
+
+   # Substitute the depth range with the middle depth value in meter


Why middle? I'm pretty sure earlier soil code was working with the bottoms of each layer (with the assumption that the first layer's top is always at 0 by definition). Using the middles will generate ambiguity about layer thicknesses.

infotroph · 2025-01-06T13:33:20Z

modules/data.land/R/soilgrids_texture_extract.R

+  clay_data_url <-
+    "/vsicurl?max_retry=30&retry_delay=60&list_dir=no&url=https://files.isric.org/soilgrids/latest/data/clay/clay_"
+  silt_data_url <-
+    "/vsicurl?max_retry=30&retry_delay=60&list_dir=no&url=https://files.isric.org/soilgrids/latest/data/silt/silt_"


Maybe a bigger change than you're interested in doing here, but have you looked into grabbing these from the soilgrids Web Coverage Service? I noticed for the SOC function that the vsicurl approach winds up pulling the entire ~80MB map for every layer.

I think this is on purpose. I think at this point Cherry and Dongchen have found that grabbing the entire map is faster than trying to get an API to grab each point one-at-a-time. We're grabbing >6400 points for model execution and all of North America for spatial downscaling.

infotroph · 2025-01-06T13:53:39Z

modules/data.land/R/soil_params_ensemble.R

+  }
+
+  # A function to reformat the nested list as inputs to "soil2netcdf" function
+  reformat_soil_list <- function(samples_all_depth) {


Nit: these helper functions are long enough it's hard to scan for where their definitions stop and the parent function resumes. I'd find it more readable if they were defined at the bottom of this file outside of soil_params_ensemble().

Add codes to extract SoilGrids soil texture data and derive ensemble …

bbb795d

…soil parameter files

github-actions bot added Modules Models labels Jan 2, 2025

mdietze requested changes Jan 2, 2025

View reviewed changes

infotroph reviewed Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add codes to extract SoilGrids soil texture data and derive ensemble … #3406

Add codes to extract SoilGrids soil texture data and derive ensemble … #3406

Qianyuxuan commented Jan 2, 2025

mdietze Jan 2, 2025

Qianyuxuan Jan 3, 2025

mdietze Jan 3, 2025

Qianyuxuan Jan 9, 2025

mdietze Jan 9, 2025

Qianyuxuan Jan 10, 2025

mdietze Jan 2, 2025

mdietze Jan 2, 2025

mdietze Jan 2, 2025

mdietze Jan 2, 2025

Qianyuxuan Jan 3, 2025

mdietze Jan 2, 2025

mdietze Jan 2, 2025

mdietze Jan 2, 2025

mdietze Jan 2, 2025

mdietze Jan 2, 2025

mdietze Jan 2, 2025

infotroph Jan 6, 2025

mdietze Jan 6, 2025

infotroph Jan 6, 2025

Add codes to extract SoilGrids soil texture data and derive ensemble … #3406

Are you sure you want to change the base?

Add codes to extract SoilGrids soil texture data and derive ensemble … #3406

Conversation

Qianyuxuan commented Jan 2, 2025

Description

Motivation and Context

Review Time Estimate

Types of changes

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment