forked from ciesin-geospatial/TOPSTSCHOOL-water
-
Notifications
You must be signed in to change notification settings - Fork 0
/
wsim-gldas-acquisition.qmd
296 lines (213 loc) · 19.1 KB
/
wsim-gldas-acquisition.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
---
title: "Acquiring and Pre-Processing the WSIM-GLDAS Dataset"
author:
- "Josh Brinks"
- "Elaine Famutimi"
date: "April 6, 2024"
bibliography: wsim-gldas-references.bib
---
## Overview
In this lesson, you will acquire the data set called **Water Security Indicator Model - Global Land Data Assimilation System(WSIM-GLDAS)** from the [NASA Socioeconomic Data and Applications Center (SEDAC)](https://sedac.ciesin.columbia.edu/) website and will also retrieve a global administrative boundary data set called [geoBoundaries](https://www.geoboundaries.org/api.html) data directly from an application programming interface (API). You will learn how to subset the data for a region of interest and save the result data as a new file.
## Learning Objectives
After completing this lesson, you should be able to:
- Retrieve WSIM-GLDAS data from SEDAC.
- Retrieve Administrative Boundary using the geoBoundaries API.
- Subset WSIM-GLDAS data for a region and time period of interest.
- Visualize geospatial data to highlight precipitation deficit patterns.
- Write a pre-processed NetCDF-formatted file to disk.
## Introduction
The water cycle is the constant process of circulation of water on, above, and under the Earth's surface [@NOAA2019]. Human activities produce greenhouse gas emissions, land use changes, dam and reservoir development, and groundwater extraction which have affected the natural water cycle in recent decades [@intergovernmentalpanelonclimatechange2023]. The influence of these human activities on the water cycle have consequential impacts on oceanic, groundwater, and land processes, influencing phenomena such as droughts and floods [@Zhou2016].
Precipitation deficits can lead to drought, characterized by prolonged periods of little to no rainfall and resulting water shortages. Droughts often trigger environmental stresses and can create cycles of reinforcement, impacting both ecosystems and people [@Rodgers2023]. For example, while California frequently experiences drought, the combination of prolonged dry spells and sustained high temperatures prevented the replenishment of cool fresh water to the Klamath river, which led to severe water shortages in 2003 and again from 2012 to 2014. Additonally, the Central Valley is an agricultural area that grows almonds, one of California’s most important crops, with the state producing 80% of the world’s almonds. These severe droughts, coupled with competition for limited fresh water resources, resulted in declining populations of [Chinook salmon](https://www.fisheries.noaa.gov/species/chinook-salmon) due to heat stress and gill rot disease disrupting the food supply for Klamath basin tribal groups [@guillen2002; @Bland2014].
![](images/watercycle_rc.png)[^1]
[^1]: Photo Credit, Dennis Cain/NWS.
::: column-margin
::: {.callout-tip style="color: #5a7a2b;"}
## Data Science Review
A [raster](https://docs.qgis.org/2.18/en/docs/gentle_gis_introduction/raster_data.html) is a type of geographic data in image format which has numerical information stored in each pixel. (Rasters are often referred to as grids because of their regularly-shaped matrix data structure.) Rasters can store many types of information, and they usually have dimensions that include latitude, longitude, and time. NetCDF is one format for raster data; others include Geotiff, ASCII and many more. Several raster formats like NetCDF can store multiple raster layers, or a "raster stack," which can useful for storing and analyzing a series of rasters.
:::
:::
The **Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948 - 2014)** dataset can be download from the [NASA SEDAC](https://sedac.ciesin.columbia.edu/data/set/water-wsim-gldas-v1) website [@isciences2022]. The dataset abstract describes these data saying that WSIM-GLDAS “identifies and characterizes surpluses and deficits of freshwater, and the parameters determining these anomalies, at monthly intervals over the period January 1948 to December 2014.”
Downloads are organized by a combination of thematic variables (composite surplus/deficit, temperature, PETmE, runoff, soil moisture, precipitation) and integration periods (a temporal aggregation) (1, 3, 6, 12 months). Each variable-integration combination consists of a **NetCDF raster** (.nc) file ( with a time dimension that contains a raster layer for each of the 804 months between January, 1948 and December, 2014. Some variables also contain multiple attributes each with their own time series. Hence, this is a large file that can take a lot of time to download and may cause computer memory issues on certain systems. This is considered BIG data.
## Acquiring the Data
::: {.callout-tip style="color: #5a7a2b;"}
## Data Science Review
The **Water Security (WSIM-GLDAS) Monthly Grids dataset** used in this lesson is hosted by [NASA's Socioeconomic Data and Applications Center (SEDAC](https://sedac.ciesin.columbia.edu/)), one of several [Distributed Active Archive Centers (DAACs)](https://www.earthdata.nasa.gov/eosdis/daacs). SEDAC hosts a variety of data products including geospatial population data, human settlements and infrastructure, exposure and vulnerability to climate change, and satellite-based data on land use, air, and water quality. In order to download data hosted by SEDAC, you are required to have a free NASA EarthData account. You can create an account here: [NASA EarthData](https://urs.earthdata.nasa.gov/users/new).
:::
For this lesson, we will work with the WSIM-GLDAS data set **Composite Anomaly Twelve-Month Return Period** NetCDF file. This represents the variable "Composite Anomaly" for the integration period of twelve-month. Let's download the file directly from the SEDAC website. The [data set documentation](https://sedac.ciesin.columbia.edu/downloads/docs/water/water-wsim-gldas-v1-documentation.pdf) describes the composite variables as key features of WSIM-GLDAS which combine “the return periods of multiple water parameters into composite indices of overall water surpluses and deficits [@isciences2022a]”. The composite anomaly files represent these model outputs in terms of the rarity of their return period, or how often they occur. Please go ahead and download the file.
- First, go to the SEDAC website at <https://sedac.ciesin.columbia.edu/>. You can explore the website by themes, data sets, or collections. We will use the search bar at the top to search for "water security wsim". Find and click on the Water Security (WSIM-GLDAS) Monthly Grids, v1 (1948–2014) data set. Take a moment to review the dataset's overview and documentation pages.
- When you're ready, click on the Data Download tab. You will be asked to sign in using your NASA EarthData account.
- Find the Composite Class, and find and click on the Variable **Composite Anomaly Twelve-Month Return Period**.
## Reading the Data
::: column-margin
::: {.callout-tip style="color: #5a7a2b;"}
## Data Science Review
This lesson uses the [`stars`](https://r-spatial.github.io/stars/), [`sf`](https://r-spatial.github.io/sf/), [`lubridate`](https://lubridate.tidyverse.org/), and [cubelyr](https://cran.r-project.org/web/packages/cubelyr/index.html) packages. Make sure they are installed before you begin working with the code in this document. If you'd like to learn more about the functions used in this lesson you can use the help guides on their package websites.
:::
:::
Once you have downloaded the file to your local computer, install and load the R packages required for this exercise. This is accomplished by defining the list of packages and assigning them to the new variable called “packages_to_check”. Next we loop (iterate) through each of the packages in the list to see if they are already installed. If they are we continue to the next item, and if they aren’t then we go ahead and install them.
```{r eval=FALSE}
packages_to_check <- c("stars", "sf", "lubridate", "cubelyr")
# Check and install packages
for (package_name in packages_to_check) {
if (!package_name %in% rownames(installed.packages())) {
install.packages(package_name)
cat(paste("Package", package_name, "installed.\n"))
} else {
cat(paste("Package", package_name, "is already installed.\n"))
}
library(package_name, character.only = TRUE)
}
```
Once you've completed the download and placed the .nc file into your working directory read the file with the `stars::read_stars()` function.
```{r}
# read in the 12 month integration WSIM-GLDAS file with stars
wsim_gldas_anoms <- stars::read_stars("composite_12mo.nc", proxy = TRUE)
# check the basic info
print(wsim_gldas_anoms)
```
Initializing R (reading) the file with the *argument* `proxy = TRUE` allows you to inspect the basic elements of the file without reading the whole file into memory. Multidimensional raster datasets can be extremely large and bring your computing environment to a halt if you have memory limitations.
Now we can use the *print* command to view basic information. The output tells us we have 5 attributes (deficit, deficit_cause, surplus, surplus_cause, both) and 3 dimensions. The first 2 dimensions are the spatial extents (x/y–longitude/latitude) and time is the 3rd dimension.
This means that the total number of individual raster layers in this NetCDF is 4020 (5 attributes x 804 time steps/months). Again, BIG data.
## Attribute Selection
The WSIM-GLDAS data is quite large with many variables available. We can manage this large file by selecting a single variable; in this case “deficit” (drought). Read the data back in; this time with `proxy = FALSE` and only selecting the deficit layer.
```{r}
# read in just the deficit layer using proxy = false
wsim_gldas_anoms <- stars::read_stars("composite_12mo.nc", sub = 'deficit', proxy = FALSE)
```
## Time Selection
Specifying a temporal range of interest will make the file size smaller and therefore more manageable. We’ll select every year for the range 2000-2014. This can be accomplished by generating a sequence for every year between December 2000 and December 2014, and then passing that list of dates to `filter`.
```{r}
# generate a vector of dates for subsetting
keeps<-seq(lubridate::ymd("2000-12-01"),
lubridate::ymd("2014-12-01"),
by = "year")
# filter the dataset using the vector of dates
wsim_gldas_anoms <- dplyr::filter(wsim_gldas_anoms, time %in% keeps)
# re-check the basic info
print(wsim_gldas_anoms)
```
Now we're down to a single attribute ("deficit") with 15 time steps. We can take a look at the first 6 years by passing the object through `slice` and then into `plot`.
```{r warning=FALSE}
wsim_gldas_anoms |>
# slice out the first 6 time steps
dplyr::slice(index = 1:6, along = "time") |>
# plot them with some test breaks for the legend
plot(key.pos = 1, breaks = c(0, -5, -10, -20, -30, -50), key.lab = "Deficit")
```
Although we have now reduced the data to a single attribute with a restricted time of interest, we can take it a step further and limit the spatial extent to a country or state of interest.
## Spatial Selection
::: column-margin
::: {.callout-tip style="color: #5a7a2b;"}
## Data Science Review
[GeoJSON](https://geojson.org/) is a format for encoding, storing and sharing geographic data. It's commonly used in web mapping applications to represent geographic features with their associated attributes. GeoJSON files are easily readable by both humans and computers, making them a popular choice for sharing geographic data over the internet.
:::
:::
We can spatially crop the raster stack with a few different methods. Options include using a bounding box in which the outer geographic coordinates (xmin, ymin, xmax, ymax) are specified, or using another raster object, or a vectorized boundary like a shapefile or GeoJSON to set the clipping extent.
::: column-margin
::: {.callout-tip style="color: #5a7a2b;"}
## Data Science Review
Built by the community and [William & Mary geoLab](https://github.com/wmgeolab), the geoBoundaries Global Database of Political Administrative Boundaries is an online, open license (CC BY 4.0 / ODbL) resource of information on administrative boundaries (i.e., state, county) for every country in the world. Since 2016, this project has tracked approximately 1 million spatial units within more than 200 entities, including all UN member states.
:::
:::
In this example we use a vector boundary to accomplish the geoprocessing task of clipping the data to an administrative or political unit. First we acquire the data in GeoJSON format for the United States from the geoBoundaries API. (Note it is also possible to download the vectorized boundaries directly from <https://www.geoboundaries.org/> in lieu of using the API).
To use the geoBoundaries’ API, the root URL below is modified to include a 3 letter code from the International Standards Organization used to identify countries (ISO3), and an administrative level for the data request. Administrative levels correspond to geographic units such as the Country (administrative level 0), the State/Province (administrative level 1), the County/District (administrative level 2) and so on:
"https://www.geoboundaries.org/api/current/gbOpen/**ISO3**/**LEVEL**/"
For this example we adjust the bolded components of the sample URL address below to specify the country we want using the ISO3 Character Country Code for the United States (**USA**) and the desired Administrative Level of State (**ADM1**).
```{r}
# make the request to geoboundarie's website for the USA boundaries
usa <- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM1/")
```
In the line of code above, we used a function called httr:GET to obtain metadata from the URL. We assign the result to a new variable called “usa”. Next we will examine the `content`.
```{r}
# parse the content into a readable format
usa <- httr::content(usa)
# look at the labels for available information
names(usa)
```
The parsed content contains 32 components. Item 29 is a direct link to the GeoJSON file (gjDownloadURL) representing the vector boundary data. Next we will obtain those vectors and visualize the results.
```{r}
# directly read in the geojson with sf from the geoboundaries server
usa <- sf::st_read(usa$gjDownloadURL)
# check the visuals
plot(sf::st_geometry(usa))
```
Upon examination, shown in the image above, one sees that it includes all US states and overseas territories. For this demonstration, we can simplify it to the contiguous United States. (Of course, it could also be simplified to other areas of interest simply by adapting the code below.)
We first create a list of the geographies we wish to remove and assign them to a variable called “drops”. Next, we reassign our “usa” variable to include only those geographies in the continental US and finally, we plot the results.
```{r}
# create a vector of territories we don't want in our CONUSA boundary
drops<-
c("Alaska", "Hawaii",
"American Samoa",
"Puerto Rico",
"Commonwealth of the Northern Mariana Islands",
"Guam",
"United States Virgin Islands")
# select all the states and territories not in the above list
usa<-usa[!(usa$shapeName %in% drops),]
# check the visuals
plot(sf::st_geometry(usa))
```
We can take this a step further and select a single state for analysis. Here we use a slightly different method by creating a new variable called “texas” by calling the state out by name.
```{r}
# extract just texas from the CONUSA boundary
texas <- usa[usa$shapeName == "Texas",]
# check the visuals
plot(sf::st_geometry(texas))
```
From here we can clip the WSIM-GLDAS raster stack by indexing it with the stored boundary of Texas.
::: {.callout-tip style="color: #5a7a2b;"}
## Drought in the News
Texas experienced a severe drought in 2011 that caused rivers to dry up and lakes to reach historic low levels [@StateImpact]. The drought was further exacerbated by high temperatures related to climate change in February of 2013. Climate experts discovered that the drought was produced by “La Niña”, a weather pattern that causes the surface temperature of the Pacific Ocean to be cooler than normal. This, in turn, creates drier and warmer weather in the southern United States. La Niña can occur for a year or more, and returns once every few years [@NOAA2023].
It is estimated that the drought cost farmers and ranchers about \$8 billion in losses. Furthermore, the dry conditions fueled a series of wildfires across the state in early September of 2011, the most devastating of which occurred in Bastrop County, where 34,000 acres and 1,300 homes were destroyed [@Roeseler2011].
:::
```{r}
# crop the wsim-gldas file to the extent of te texas boundary
wsim_gldas_anoms_tex <- wsim_gldas_anoms[texas]
```
Finally, we visualize the last time-step in the WSIM-GLDAS dataset (15/December, 2014) and render it with an overlay of the Texas boundary, using the following steps.
```{r warning = FALSE}
# slive out the first 15 timesteps of wsim-gldas and plot them
wsim_gldas_anoms_tex |>
dplyr::slice(index = 15, along = "time") |>
plot(reset = FALSE, breaks = c(0, -5, -10, -20, -30, -50))
# add the texas boundary on top
plot(sf::st_geometry(texas),
add = TRUE,
lwd = 3,
fill = NA,
border = 'purple')
```
At this point, you may want to ask, does the data look plausible? That is, are the values being rendered in your map of interest? This simple check is helpful to make sure your subsetting has worked as expected. (You will want to use other methods to systematically evaluate the data.) If the results are acceptable, the subsetted dataset may be written to disk as a NetCDF file, and saved for future modules.
```{r}
# write the processed wsim-gldas file to disk as a netcdf
stars::write_mdim(wsim_gldas_anoms_tex, "wsim_gldas_tex.nc")
```
The size of the pre-processed dataset is 1.6 MB compared to the original dataset of 1.7 GB. This is much more manageable in cloud environments, workshops, and git platforms.
If you want to share an image of the plot that you created you can save it to your computer as a .png file. You can open the `png()` device, producing the plot, and closing the device with `dev.off()`:
```{r eval=FALSE}
# Save the map plot as a PNG file
# Specify file name and dimensions
png("map_plot.png", width = 4, height = 4, units="in", res=300)
wsim_gldas_anoms_tex |>
dplyr::slice(index = 15, along = "time") |>
plot(reset = FALSE, breaks = c(0, -5, -10, -20, -30, -50))
plot(sf::st_geometry(texas),
add = TRUE,
lwd = 3,
fill = NA,
border = 'purple')
#close png() device
dev.off()
```
Once you run this code you can find the file in the file location… This allows you to share your findings.
## In this Lesson, You Learned...
Congratulations! Now you should be able to:
- Navigate the SEDAC website to find and download datasets.\
- Access administrative boundaries from geoBoundaries data using API.
- Temporally subset a NetCDF raster stack using R packages such as dplyr and lubridate.
- Crop a NetCDF raster stack with a spatial boundary.
- Write a subsetted dataset to disk and create an image to share results.
## Lesson 2
In the next lesson, we will create more advanced visualizations and extract data of interest.
[Lesson 2: WSIM-GLDAS Visualizations and Data Extraction](https://ciesin-geospatial.github.io/TOPSTSCHOOL-module-1-water/wsim-gldas-vis.html){.btn .btn-primary .btn role="button"}
# References