diff --git a/how-tos/find-data/find-r.qmd b/how-tos/find-data/find-r.qmd index 82e937de4..19ad26ac7 100644 --- a/how-tos/find-data/find-r.qmd +++ b/how-tos/find-data/find-r.qmd @@ -4,31 +4,244 @@ execute: eval: false --- +*This is a work in progress, and may be split up into different modules/tutorials +as we continue to work on it.* + Here are our recommended approaches for finding data with R, from the command line or a notebook. -To find data in R, we'll also use the `earthaccess` python package - we can do so from R using the [`reticulate`](https://rstudio.github.io/reticulate/) package ([cheatsheet](https://www.rstudio.com/resources/cheatsheets/)). Note below that we import the python library as an R object we name `earthaccess`, as well as the `earthaccess$` syntax for accessing functions from the `earthaccess` library. The `granules` object has a list of JSON dictionaries with some extra dictionaries. +### Using the web interface + +See [**How do I find data using Earthdata Search?**](earthdata_search.md) - in that tutorial, we found the dataset *ECCO Sea Surface Height - Monthly Mean 0.5 Degree (Version 4 Release 4)*, with the shortname `ECCO_L4_SSH_05DEG_MONTHLY_V4R4`. + +### Searching programmatically + +The NASA cloud data is searchable programmatically via two methods - NASA's own +search service, and the NASA Spatio-Temporal Asset Catalog (STAC) API. To find +data in R, we'll mainly rely on the +[rstac](https://brazil-data-cube.github.io/rstac/) package. This enables us +interact with the NASA [STAC](https://stacspec.org/en) API to search for our +data, and at the same time learn about STAC more generally. This will be useful +as it is a common standard for distributing spatial data. + +We will also search for data using the +[NASA Earthdata search API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html), +which is the service that powers the +[Earthdata web search portal](https://search.earthdata.nasa.gov/search). + +For both of these services, the +[earthdatalogin](https://boettiger-lab.github.io/earthdatalogin/) package will +help us get set up and provide a few handy functions to smooth some rough edges. + +### Authentication for NASA Earthdata + +An Earthdata Login account is required to access data from the NASA Earthdata +system. Please visit to register and manage +your Earthdata Login account. This account is free to create and only takes a +moment to set up. + +Once you have created your Earthdata Login account, you can use the +[earthdatalogin](https://github.com/boettiger-lab/earthdatalogin/) R package to help you manage your authentication within R. + +There is some functionality currently only available in the development version +of earthdatalogin, so we will install it from GitHub using the [pak](https://pak.r-lib.org/) package: + +``` r +install.packages("pak") +pak::pak("boettiger-lab/earthdatalogin") +``` + +The easiest and most portable method of access is using the `netrc` basic +authentication protocol for HTTP. Call `edl_netrc()` to set this up given your +username and password. The easiest way to store your credentials for use in +`edl_netrc()` is to set them as environment variables in your `~/.Renviron` file. +The usethis package has a handy function to find and open this file for editing: ```r +# install.packages("usethis") +usethis::edit_r_environ() +``` + +Then add the following lines to the file, save the file, and restart R. + +``` +EARTHDATA_USER="your_user_name" +EARTHDATA_PASSWORD="your_password" +``` + +Now, when you call `edl_netrc()`, it will consult these environment variables to +create a `.netrc` file that will be used to authenticate with the NASA Earthdata +services. If you don't have your credentials saved as environment variables, +`earthdatalogin` will provide its own credentials, but you may experience rate +limits more readily. You can also manually type in your credentials to the `username` +and `password` arguments to `edl_netrc()` but this is not recommended as it +is too easy to accidentally share these in your code. + +```{r} +library(earthdatalogin) + +edl_netrc() +``` + +Once `edl_netrc()` has been called in your R session, then most spatial +packages in R can seamlessly access NASA Earthdata over HTTP links. + +### Finding data in NASA STACs with rstac + +All of the NASA STAC catalogues can be viewed here: +. + +We will use the rstac package to first browse the collections in the PO DAAC +catalogue (POCLOUD): + +```{r} ## In R ## load R libraries -library(tidyverse) # install.packages("tidyverse") -library(reticulate) # install.packages("reticulate") +# install.packages("pak") +# pak::pak(c("tidyverse", "rstac", "boettiger-lab/earthdatalogin")) + +library(rstac) +library(earthdatalogin) + +po_collections <- stac("https://cmr.earthdata.nasa.gov/stac/POCLOUD/") |> + collections() |> + get_request() + +po_collections +``` + +This only gives us the first 10 collections in the catalogue; to get the rest +we can use the `collections_fetch()` function: + +```{r} +all_po_collections <- collections_fetch(po_collections) +all_po_collections + +length(all_po_collections$collections) + +head(all_po_collections$collections) + +# Just look at the titles: +sapply(all_po_collections$collections, `[[`, "title") + +# Get shortnames from the 'id' field and search for a match: +grep("ECCO_L4_SSH", sapply(all_po_collections$collections, `[[`, "id"), value = TRUE) +``` + +Once we have searched through the collections, we can choose the one we are +interested in and query it to get the items (granules) we are interested in: + +```{r} +start <- "2015-01-01" +end <- "2015-12-31" + +items <- stac("https://cmr.earthdata.nasa.gov/stac/POCLOUD") |> + stac_search(collections = "ECCO_L4_SSH_05DEG_MONTHLY_V4R4", + datetime = paste(start,end, sep = "/")) |> + post_request() |> + items_fetch() + +items +``` + +There are 13 'items' representing the same 13 granules [we found when we searched using EarthData Search](https://search.earthdata.nasa.gov/search/granules?p=C1990404799-POCLOUD&pg%5B0%5D%5Bv%5D=f&pg%5B0%5D%5Bgsk%5D=-start_date&q=ECCO%20monthly%20SSH&qt=2015-01-01T00:00:00.000Z,2015-12-31T23:59:59.999Z&ff=Available%20in%20Earthdata%20Cloud&tl=1713291266.503!3!!&long=0.0703125). + +We can see more details of the items (granules) by printing the `features` list +item: -## load python library -earthaccess <- reticulate::import("earthaccess") +```{r} +items$features + +# And the urls: +edl_stac_urls(items) +``` -## use earthaccess to access data # https://nsidc.github.io/earthaccess/tutorials/search-granules/ -granules <- earthaccess$search_data( - doi = "10.5067/SLREF-CDRV3", - temporal = reticulate::tuple("2017-01", "2017-02") # with an earthaccess update, this can be simply c() or list() +### Finding data in NASA EarthData Search using earthdatalogin + +Once we know the shortname of a collection (usually by looking in the EarthData Search portal), we can supply it to `edl_search()` to get the metadata and file urls of the individual granules: + +```{r} +granules <- edl_search( + short_name = "ECCO_L4_SSH_05DEG_MONTHLY_V4R4", + temporal = c(start, end), + parse_results = FALSE ) -## Granules found: 72 +granules -## exploring -granules # this is the result of the get request. +granules[[1]] -class(granules) # "list" -## granules <- reticulate::py_to_r(granules) # Object to convert is not a Python object +# See the granule titles +sapply(granules, `[`, "title") +# Note these are the same urls obtained via the rstac workflow demonstrated above +granule_urls <- edl_extract_urls(granules) +granule_urls ``` + +#### Accessing data using the `{terra}` package + +We can read any of these urls using `terra::rast()`. We supply the `vsi = TRUE` +argument, which prepends `"/vsicurl/"` to the url, indicating that the file +should be opened as a "virtual" remote dataset. This allows random partial +reading of files without prior download of the entire file. This will vastly speed up +most operations as only the subset of the data required is ever actually downloaded. + +```{r} +library(terra) + +rast <- terra::rast(granule_urls[1], vsi = TRUE) + +# This does not come with an extent and CRS embedded so we can supply it manually +granules[[1]]$boxes +ext(rast) <- c(-180, 180, -90, 90) +crs(rast) <- "EPSG:4326" + +plot(rast[["SSH"]]) + +# We notice that this plot is upside-down - it is likely that the NetCDF file +# does not conform properly to the conventions. But we can flip it: +rast <- flip(rast, direction = "vertical") +plot(rast) +``` + +If we want to crop the raster, we can define an extent and crop it to that area. +Because we previously called `edl_netrc()`, we are not only authenticated with +the server so we can access the data, but we have also set up `terra` and `stars` +to us the underlying `GDAL` library to access the data in such a way that only \the subset of the data we have requested is actually downloaded. + +```{r} +crop_box <- rast(extent = c(-150, -120, 35, 60), crs = "EPSG:4326") + +rast_cropped <- crop(rast, crop_box) + +plot(rast_cropped[["SSH"]]) +``` + +#### Accessing data using the `{stars}` package + +The `read_*` functions in stars do not have the `vsi` argument, but we can do +the same thing simply by prepending `"/vsicurl/"` to the url ourselves. Here +we will use the `read_mdim()` function. + +```{r} +library(stars) +ssh_stars <- read_mdim(paste0("/vsicurl/", granule_urls[1])) + +plot(ssh_stars) +``` + +We can again crop this using the same bounding box as we did with terra: + +```{r} +st_crop( + ssh_stars, + st_bbox(c(xmin = -150, xmax = -120, ymin = 35, ymax = 60)) +) |> + plot() +``` + + + +### Accessing data using gdalcubes + +*Coming soon!*