Skip to content

Commit

Permalink
Merge pull request #320 from NASA-Openscapes/r-how-to
Browse files Browse the repository at this point in the history
Start to update how to find data in R
  • Loading branch information
jules32 authored Apr 17, 2024
2 parents ef56723 + 21dfc60 commit 4d1210a
Showing 1 changed file with 227 additions and 14 deletions.
241 changes: 227 additions & 14 deletions how-tos/find-data/find-r.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,31 +4,244 @@ execute:
eval: false
---

*This is a work in progress, and may be split up into different modules/tutorials
as we continue to work on it.*

Here are our recommended approaches for finding data with R, from the command line or a notebook.

To find data in R, we'll also use the `earthaccess` python package - we can do so from R using the [`reticulate`](https://rstudio.github.io/reticulate/) package ([cheatsheet](https://www.rstudio.com/resources/cheatsheets/)). Note below that we import the python library as an R object we name `earthaccess`, as well as the `earthaccess$` syntax for accessing functions from the `earthaccess` library. The `granules` object has a list of JSON dictionaries with some extra dictionaries.
### Using the web interface

See [**How do I find data using Earthdata Search?**](earthdata_search.md) - in that tutorial, we found the dataset *ECCO Sea Surface Height - Monthly Mean 0.5 Degree (Version 4 Release 4)*, with the shortname `ECCO_L4_SSH_05DEG_MONTHLY_V4R4`.

### Searching programmatically

The NASA cloud data is searchable programmatically via two methods - NASA's own
search service, and the NASA Spatio-Temporal Asset Catalog (STAC) API. To find
data in R, we'll mainly rely on the
[rstac](https://brazil-data-cube.github.io/rstac/) package. This enables us
interact with the NASA [STAC](https://stacspec.org/en) API to search for our
data, and at the same time learn about STAC more generally. This will be useful
as it is a common standard for distributing spatial data.

We will also search for data using the
[NASA Earthdata search API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html),
which is the service that powers the
[Earthdata web search portal](https://search.earthdata.nasa.gov/search).

For both of these services, the
[earthdatalogin](https://boettiger-lab.github.io/earthdatalogin/) package will
help us get set up and provide a few handy functions to smooth some rough edges.

### Authentication for NASA Earthdata

An Earthdata Login account is required to access data from the NASA Earthdata
system. Please visit <https://urs.earthdata.nasa.gov> to register and manage
your Earthdata Login account. This account is free to create and only takes a
moment to set up.

Once you have created your Earthdata Login account, you can use the
[earthdatalogin](https://github.com/boettiger-lab/earthdatalogin/) R package to help you manage your authentication within R.

There is some functionality currently only available in the development version
of earthdatalogin, so we will install it from GitHub using the [pak](https://pak.r-lib.org/) package:

``` r
install.packages("pak")
pak::pak("boettiger-lab/earthdatalogin")
```

The easiest and most portable method of access is using the `netrc` basic
authentication protocol for HTTP. Call `edl_netrc()` to set this up given your
username and password. The easiest way to store your credentials for use in
`edl_netrc()` is to set them as environment variables in your `~/.Renviron` file.
The usethis package has a handy function to find and open this file for editing:

```r
# install.packages("usethis")
usethis::edit_r_environ()
```

Then add the following lines to the file, save the file, and restart R.

```
EARTHDATA_USER="your_user_name"
EARTHDATA_PASSWORD="your_password"
```

Now, when you call `edl_netrc()`, it will consult these environment variables to
create a `.netrc` file that will be used to authenticate with the NASA Earthdata
services. If you don't have your credentials saved as environment variables,
`earthdatalogin` will provide its own credentials, but you may experience rate
limits more readily. You can also manually type in your credentials to the `username`
and `password` arguments to `edl_netrc()` but this is not recommended as it
is too easy to accidentally share these in your code.

```{r}
library(earthdatalogin)
edl_netrc()
```

Once `edl_netrc()` has been called in your R session, then most spatial
packages in R can seamlessly access NASA Earthdata over HTTP links.

### Finding data in NASA STACs with rstac

All of the NASA STAC catalogues can be viewed here:
<https://radiantearth.github.io/stac-browser/#/external/cmr.earthdata.nasa.gov/stac/>.

We will use the rstac package to first browse the collections in the PO DAAC
catalogue (POCLOUD):

```{r}
## In R
## load R libraries
library(tidyverse) # install.packages("tidyverse")
library(reticulate) # install.packages("reticulate")
# install.packages("pak")
# pak::pak(c("tidyverse", "rstac", "boettiger-lab/earthdatalogin"))
library(rstac)
library(earthdatalogin)
po_collections <- stac("https://cmr.earthdata.nasa.gov/stac/POCLOUD/") |>
collections() |>
get_request()
po_collections
```

This only gives us the first 10 collections in the catalogue; to get the rest
we can use the `collections_fetch()` function:

```{r}
all_po_collections <- collections_fetch(po_collections)
all_po_collections
length(all_po_collections$collections)
head(all_po_collections$collections)
# Just look at the titles:
sapply(all_po_collections$collections, `[[`, "title")
# Get shortnames from the 'id' field and search for a match:
grep("ECCO_L4_SSH", sapply(all_po_collections$collections, `[[`, "id"), value = TRUE)
```

Once we have searched through the collections, we can choose the one we are
interested in and query it to get the items (granules) we are interested in:

```{r}
start <- "2015-01-01"
end <- "2015-12-31"
items <- stac("https://cmr.earthdata.nasa.gov/stac/POCLOUD") |>
stac_search(collections = "ECCO_L4_SSH_05DEG_MONTHLY_V4R4",
datetime = paste(start,end, sep = "/")) |>
post_request() |>
items_fetch()
items
```

There are 13 'items' representing the same 13 granules [we found when we searched using EarthData Search](https://search.earthdata.nasa.gov/search/granules?p=C1990404799-POCLOUD&pg%5B0%5D%5Bv%5D=f&pg%5B0%5D%5Bgsk%5D=-start_date&q=ECCO%20monthly%20SSH&qt=2015-01-01T00:00:00.000Z,2015-12-31T23:59:59.999Z&ff=Available%20in%20Earthdata%20Cloud&tl=1713291266.503!3!!&long=0.0703125).

We can see more details of the items (granules) by printing the `features` list
item:

## load python library
earthaccess <- reticulate::import("earthaccess")
```{r}
items$features
# And the urls:
edl_stac_urls(items)
```

## use earthaccess to access data # https://nsidc.github.io/earthaccess/tutorials/search-granules/
granules <- earthaccess$search_data(
doi = "10.5067/SLREF-CDRV3",
temporal = reticulate::tuple("2017-01", "2017-02") # with an earthaccess update, this can be simply c() or list()
### Finding data in NASA EarthData Search using earthdatalogin

Once we know the shortname of a collection (usually by looking in the EarthData Search portal), we can supply it to `edl_search()` to get the metadata and file urls of the individual granules:

```{r}
granules <- edl_search(
short_name = "ECCO_L4_SSH_05DEG_MONTHLY_V4R4",
temporal = c(start, end),
parse_results = FALSE
)
## Granules found: 72
granules
## exploring
granules # this is the result of the get request.
granules[[1]]
class(granules) # "list"
## granules <- reticulate::py_to_r(granules) # Object to convert is not a Python object
# See the granule titles
sapply(granules, `[`, "title")
# Note these are the same urls obtained via the rstac workflow demonstrated above
granule_urls <- edl_extract_urls(granules)
granule_urls
```

#### Accessing data using the `{terra}` package

We can read any of these urls using `terra::rast()`. We supply the `vsi = TRUE`
argument, which prepends `"/vsicurl/"` to the url, indicating that the file
should be opened as a "virtual" remote dataset. This allows random partial
reading of files without prior download of the entire file. This will vastly speed up
most operations as only the subset of the data required is ever actually downloaded.

```{r}
library(terra)
rast <- terra::rast(granule_urls[1], vsi = TRUE)
# This does not come with an extent and CRS embedded so we can supply it manually
granules[[1]]$boxes
ext(rast) <- c(-180, 180, -90, 90)
crs(rast) <- "EPSG:4326"
plot(rast[["SSH"]])
# We notice that this plot is upside-down - it is likely that the NetCDF file
# does not conform properly to the conventions. But we can flip it:
rast <- flip(rast, direction = "vertical")
plot(rast)
```

If we want to crop the raster, we can define an extent and crop it to that area.
Because we previously called `edl_netrc()`, we are not only authenticated with
the server so we can access the data, but we have also set up `terra` and `stars`
to us the underlying `GDAL` library to access the data in such a way that only \the subset of the data we have requested is actually downloaded.

```{r}
crop_box <- rast(extent = c(-150, -120, 35, 60), crs = "EPSG:4326")
rast_cropped <- crop(rast, crop_box)
plot(rast_cropped[["SSH"]])
```

#### Accessing data using the `{stars}` package

The `read_*` functions in stars do not have the `vsi` argument, but we can do
the same thing simply by prepending `"/vsicurl/"` to the url ourselves. Here
we will use the `read_mdim()` function.

```{r}
library(stars)
ssh_stars <- read_mdim(paste0("/vsicurl/", granule_urls[1]))
plot(ssh_stars)
```

We can again crop this using the same bounding box as we did with terra:

```{r}
st_crop(
ssh_stars,
st_bbox(c(xmin = -150, xmax = -120, ymin = 35, ymax = 60))
) |>
plot()
```



### Accessing data using gdalcubes

*Coming soon!*

0 comments on commit 4d1210a

Please sign in to comment.