wsim-gldas-vis.qmd

---
title: "Water Security Indicator Model - Global Land Data Assimilation System (WSIM-GLDAS) Dataset Exploration and Visualizations"
author: 
  - "Joshua Brinks"
  - "Elaine Famutimi"
date: "April 6, 2024"
bibliography: wsim-gldas-references.bib
---

## Overview

In our previous lesson, *Acquiring and Pre-Processing the Water Security Indicator Model - Global Land Data Assimilation System* (WSIM-GLDAS) Dataset, we downloaded components of WSIM-GLDAS from SEDAC, subset the data using vector boundaries from geoBoundaries, performed a visual check, and wrote the file to disk. In this lesson, we will extend our work with WSIM-GLDAS by introducing an additional integration (temporal aggregation) period, calculating simple summary statistics, integrating WSIM-GLDAS with the Gridded Population of the World (GPW) data, and developing more complex visualizations.

## Learning Objectives

After completing this lesson, you should be able to:

-   Subset the WSIM-GLDAS raster data for a region and time period of interest.
-   Perform visual exploration with histograms.
-   Integrate gridded population with WSIM-GLDAS data to perform analyses and construct visualizations.
    -   Make choropleth maps visualizing WSIM-GLDAS data by administrative vector boundaries.
    -   Summarize WSIM-GLDAS and population raster data using zonal statistics.

## Introduction

::: column-margin
::: {.callout-tip style="color: #5a7a2b;"}
## Coding Review

This lesson uses the [`stars`](https://r-spatial.github.io/stars/), [`sf`](https://r-spatial.github.io/sf/), [`dplyr`](https://dplyr.tidyverse.org/), [`lubridate`](https://lubridate.tidyverse.org/) [`exactextractr`](https://isciences.gitlab.io/exactextractr/), [`ggplot2`](https://ggplot2.tidyverse.org/), [`terra`](https://rspatial.org/pkg/), [`data.table`](https://rdatatable.gitlab.io/data.table/), and [kableExtra](https://cran.r-project.org/web/packages/kableExtra/index.html) packages. If you'd like to learn more about the functions used in this lesson you can use the help guides on their package websites.
:::
:::

## Load Data

We’ll begin with the **WSIM-GLDAS Composite Anomaly Twelve-Month Return Period** (an anomaly is defined as a rare observation) file from SEDAC and we will subset it spatially to the Continental United States (CONUSA). This will reduce our memory footprint. We can further reduce our memory overhead by reading in just the `'deficit'`—that is, the variable we want to analyze— from the WSIM-GLDAS Composite Anomaly Twelve-Month Return Period file, rather than reading the entire NetCDF with all of its attributes.

```{r warning=FALSE}
# read in the wsim-gldas layer from SEDAC
wsim_gldas <- stars::read_stars("composite_12mo.nc", proxy = FALSE, sub = 'deficit')
# check the basic info.
print(wsim_gldas)
```

For this exercise, we want to explore droughts in the continental United States for the years 2000-2014. In the 12-month return period dataset, each monthly time step is an average of the previous 12 months. (This is called a moving average.) Therefore, if we wish to view a snapshot of drought for a given calendar year we need to use the December timestep which would include the 12 months of only that calendar year. We can create a data set of annualized monthly averages starting with December 2000 and ending in December 2014 using the `ymd()` function of the *lubridate* package and the `filter()` function of the *dplyr* package as demonstrated below.

```{r warning=FALSE}
# generate a vector of dates for subsetting
keeps<-seq(lubridate::ymd("2000-12-01"),
           lubridate::ymd("2014-12-01"), 
           by = "year")

#change data type to POSIXct
keeps <- as.POSIXct(keeps)

# filter using that vector
wsim_gldas <- dplyr::filter(wsim_gldas, time %in% keeps)
print(wsim_gldas)
```

Next, we can clip the WSIM-GLDAS dataset using the USA country boundary from geoBoundaries. As in lesson 1, we acquire the boundary vector data using the geoBoundaries API.

```{r warning=FALSE}
#directly acquire the boundary from geoBoundaries API
# request the usa data and metadata
usa <- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM1/")
# parse the content of the request to find the geojson download link
usa <- httr::content(usa)
# read in the geojson directly from geoBoundaries 
usa <- sf::st_read(usa$gjDownloadURL)

# remove everything not part of CONUSA
drops<-
  c("Alaska", "Hawaii", 
    "American Samoa",
    "Puerto Rico",
    "Commonwealth of the Northern Mariana Islands", 
    "Guam", 
    "United States Virgin Islands")
usa<-usa[!(usa$shapeName %in% drops),]

# rename te time dimension to something more friendly
wsim_gldas<-wsim_gldas[usa] |>
stars::st_set_dimensions("time", values = as.character(seq(2000,2014)))
```

In the preprocessing step of the data science lifecycle, it is important to periodically check the results. Now we’ll verify the pre-processing steps with the `print()` function. Note that the raster object now contains 15 timesteps from December 2000 to December 2014.

```{r}
# check the basic information again
print(wsim_gldas)
```

You will want to review the printout to make sure it looks okay.

-   Does it contain the variables you were expecting?

-   Do the values for the variables seem plausible?

Other basic descriptive analyses are useful to verify and understand your data. One of these is to produce a frequency distribution (also known as a histogram), which is reviewed below.

## Annual CONUSA Time Series

The statistical properties reviewed in the previous step are useful for exploratory data analysis, but we should also inspect the data’s spatial characteristics. We can start our visual exploration of annual drought in the CONUSA by creating a map visualization depicting the deficit return period for each of the years in the subset dataset we loaded in the previous step.

```{r warning = FALSE, message = FALSE}
# load the base data of the usa boundary
ggplot2::ggplot(usa)+
  # plot the stars wsim object
  stars::geom_stars(data = wsim_gldas)+
  # set equal coordinates for axes
  ggplot2::coord_equal()+
  # create multiple panels for each time step
  ggplot2::facet_wrap(~time)+
  # plot the usa boundary with just the outline
  ggplot2::geom_sf(fill = NA)+
  # add the palette for wsim-gldas 
  ggplot2::scale_fill_stepsn(
    colors = c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#FFF4C7',
    # -3 to 0
    '#FFFFFF'), 
    # set the breaks on the palette
    breaks = c(-60, -50, -40, -20, -10,-5,-3, 0))+
  # add plot labels
  ggplot2::labs(
    title="Annual Mean Deficit Anomalies for the CONUSA",
    subtitle = "Using Observed 12 Month Integrated WSIM-GLDAS Data for 2000-2014",
    fill = "Deficit Return Period"
  )+
  # set the minimal theme
  ggplot2::theme_minimal()+
  # turn off some extra graphical elements
  ggplot2::theme(
    axis.title.x=ggplot2::element_blank(),
    axis.text.x=ggplot2::element_blank(),
    axis.ticks.x=ggplot2::element_blank(),
    axis.title.y=ggplot2::element_blank(),
    axis.text.y=ggplot2::element_blank(),
    axis.ticks.y=ggplot2::element_blank())
```

This visualization shows that there were several significant drought events (as indicated by return-period values) throughout 2000-2014. The southeast in 2000, the southwest in 2002, the majority of the western 3rd in 2007, Texas-Oklahoma in 2011, Montana-Wyoming-Colorado in 2012, and the entirety of the California coast in 2014. The droughts of 2012 and 2011 are particularly severe and widespread with return periods greater than 50 years covering multiple states. Based on historical norms, we should only expect droughts this strong every 50-60 years!

## Monthly Time Series

We can get a more detailed look at these drought events by using the 1-month composite WSIM-GLDAS dataset and clipping the data to a smaller spatial extent. Let’s examine the 2014 California drought.

::: {.callout-tip style="color: #5a7a2b;"}
## Drought in the News

The California drought of 2012-2014 was the worst in 1,200 years [@WHOI2014]. This drought caused problems for homeowners, and even conflicts between farmers and wild salmon! Governor Jerry Brown declared a drought emergency and called on residents to reduce water intake by 20%. Water use went up by 8% in May of 2014 compared to 2013, in places like coastal California and Los Angeles. Due to the water shortages, the state voted to fine water-wasters up to $500 dollars. The drought also affected residents differently based on economic status. For example, in El Dorado County, located in a rural area east of Sacramento, residents were taking bucket showers and rural residents reported wells, which they rely on for fresh water, were drying up. The federal government eventually announced a $9.7 million emergency drought aid for those areas [@Sanders2014].

:::

In order to limit the amount of computing memory required for the operation, we will first clear items from the in-memory workspace and then reload a smaller composite file, we’ll start by removing the 12-month composite object.

```{r}
# remove the large wsim object from the environemnt
rm(wsim_gldas)
```

Now let’s load into the in-memory workspace the composite 1-month file from SEDAC.

```{r}
# clear the memory
gc()
# read in the 1 month wsim-gldas data
wsim_gldas_1mo <- stars::read_stars("composite_1mo.nc", sub = 'deficit', proxy = FALSE)
# check the basic info
print(wsim_gldas_1mo)
```

Once again, we'll subset the time dimension for our period of interest. This time we want every month for 2014.

```{r}
# generate a vector of dates for subsetting
keeps<-seq(lubridate::ymd("2014-01-01"),
           lubridate::ymd("2014-12-01"), 
           by = "month")
#change data type to POSIXct
keeps <- as.POSIXct(keeps)
# filter using that vector
wsim_gldas_1mo <- dplyr::filter(wsim_gldas_1mo, time %in% keeps)
# check the info
print(wsim_gldas_1mo)
```

Now we have 12 rasters with monthly data for 2014. Let's zoom in on California and see how this drought progressed over the course of the year.

```{r warning = FALSE, message = FALSE}
# isolate only the california border
california<-usa[usa$shapeName=="California",]
# subset wsim-gldas to the extent of the california boundary
wsim_gldas_california <- wsim_gldas_1mo[california]
# give the time dimension pretty labels
wsim_gldas_california <-
  wsim_gldas_california |>
    stars::st_set_dimensions("time", values = month.name)

# monthly plots of california
# load the base california data
ggplot2::ggplot(california)+
  # load the wsim stars object
  stars::geom_stars(data = wsim_gldas_california)+
  # equal aspect ratio for lat/long axe
  ggplot2::coord_equal()+
  # create a subplot for each time step
  ggplot2::facet_wrap(~time)+
  # only plot the outline of california
  ggplot2::geom_sf(fill = NA)+
  # set the wsim palette
  ggplot2::scale_fill_stepsn(
    colors = c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#FFF4C7',
    # -3 to 0
    '#FFFFFF'), 
    # set the wsim breaks
    breaks = c(-60, -50, -40, -20, -10,-5,-3, 0))+
  # add labels
  ggplot2::labs(
    title="Deficit Anomalies for California",
    subtitle = "Using Observed Monthly WSIM-GLDAS Data for 2014",
    fill = "Deficit Return Period"
  )+
  # start with a base minimal theme
  ggplot2::theme_minimal()+
  # remove additional elements to make a cleaner map
  ggplot2::theme(
    axis.title.x=ggplot2::element_blank(),
    axis.text.x=ggplot2::element_blank(),
    axis.ticks.x=ggplot2::element_blank(),
    axis.title.y=ggplot2::element_blank(),
    axis.text.y=ggplot2::element_blank(),
    axis.ticks.y=ggplot2::element_blank())
```

This series of maps shows a startling picture. California faced massive water deficits throughout the state in January and February. This was followed by water deficits in the western half of the state in May-August. Although northern and eastern California saw some relief by September, southwest California continued to see deficits through December.

## Monthly Histograms

::: column-margin
::: {.callout-tip style="color: #5a7a2b;"}
A [data frame](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame) is a data structure used for storing tabular data. It organizes data in rows and columns. Each column can have a different type of data (numeric, character, factor, etc.), and rows represent individual observations or cases. Data frames provide a convenient way to work with structured data, making them essential for data analysis and statistics projects.
:::
:::

We explore the data further by creating a frequency distribution (also called a histogram) of the deficit anomalies for any given spatial extent; here we are still looking at the distributions in California. We extract the data from the raster time series and create a data frame of values that are easier to manipulate into a histogram. [R data frames](https://www.w3schools.com/r/r_data_frames.asp) are data displayed in table format, which can be plotted on graphs or charts.

```{r}
# extract the raster values into a dataframe
deficit_hist <-  
  wsim_gldas_california |>
  as.data.frame(wsim_gldas_california$deficit)

# remove the NA values
deficit_hist<-stats::na.omit(deficit_hist)

# create the histogram
ggplot2::ggplot(deficit_hist, ggplot2::aes(deficit))+
  # style the bars
  ggplot2::geom_histogram(binwidth = 6, fill = "#325d88")+
  # subplot for each timestep
  ggplot2::facet_wrap(~time)+
  # use the minimal theme
  ggplot2::theme_minimal()
```

This starts to quantify what our eyes were telling us with the time series maps. Whereas the map shows where the deficits occur, the frequency distribution indicates the number of raster cells for each return period of the deficit range. The number of raster cells under a 60-year deficit (return period) is very high in most months, far exceeding any other value in the range.

## Zonal Summaries

The previous section describeser into the 2014 California drought, examining the state as a whole. Although we have a sense of what’s happening in different cities or counties by looking at the maps, the maps do not provide quantitative summaries of those local areas.

Zonal statistics are one way to summarize the cells of a raster layer that lie within the boundary of another data layer (which may be in either raster or vector format). . For example, aggregating deficit return periods with another raster depicting land cover type or a vector boundary (shapefile) of countries, states, or counties, will produce descriptive statistics by that new layer. These statistics include: sum, mean, median, standard deviation, and range.

For this lesson, we begin by calculating the mean deficit return period by California counties. First we retrieve a vector data set of California counties from the geoBoundaries API. Since geoBoundaries does not attribute which counties belong to which states, we utilize a spatial operation called intersect in order to select only those counties in California.

```{r}
# get the adm2 (county) level data for the USA
# request the adm2 usa information
cali_counties <- httr::GET("https://www.geoboundaries.org/api/current/gbOpen/USA/ADM2/")
# parse the content to find the direct download link
cali_counties <- httr::content(cali_counties)
# read in the geojson directly from geoboundaries
cali_counties <- sf::st_read(cali_counties$gjDownloadURL)

# geoBoundaries does not list which counties belong to which state so you need to run an intersection
# intersect the usa adm2 data with just the california boundary
cali_counties<-sf::st_intersection(cali_counties, california)
# plot the result
plot(sf::st_geometry(cali_counties))
```

The output of that intersection looks as expected. As noted above, in general a visual and/or tabular check on your data layers is always a good idea. If you expect 50 counties in a given state, you should see 50 counties resulting from your intersection of your two layers, etc. You may want to be on the look out for too few (such as an island area that may be in one layer but not the other) or too many counties (such as those that intersect with a neighboring state).

::: column-margin
::: {.callout-tip style="color: #5a7a2b;"}
## Coding Review

The [exactextractr](https://github.com/isciences/exactextractr) [@Baston2023] R package summarizes raster values over groupings, or zones, also known as zonal statistics. Zonal statistics help in assessing the statistical characteristics of a certain region.

The [terra](https://cran.r-project.org/web/packages/terra/index.html) R package processes raster geospatial data, offering functionalities such as data manipulation, spatial analysis, modeling, and visualization, with a focus on efficiency and scalability.
:::
:::

We will perform our zonal statistics using the `exactextractr` package [@Baston2023]. It is the fastest, most accurate, and most flexible zonal statistics tool for the R programming language, but it currently has no default methods for the `stars` package, so we'll switch to `terra` for this portion of the lesson.

```{r}
# create a terra collection with the wsim-gldas 1 month file from SEDAC
wsim_gldas_1mo<-terra::sds("composite_1mo.nc")
# pull out just the deficit layer
wsim_gldas_1mo<-wsim_gldas_1mo["deficit"]
# create a sequence of dates to keep
keeps<-seq(lubridate::ymd("2014-01-01"), lubridate::ymd("2014-12-01"), by = "month")
# subset the terra object for only those time steps
wsim_gldas_1mo<-wsim_gldas_1mo[[terra::time(wsim_gldas_1mo) %in% keeps]]
# label the time steps
names(wsim_gldas_1mo) <- keeps
# check the info
print(wsim_gldas_1mo)
```

```{r warning=FALSE, message=FALSE}
# run the extraction
cali_county_summaries<-
  exactextractr::exact_extract(
    # the raster data
    wsim_gldas_1mo, 
    # the boundary data
    cali_counties, 
    # the calculation
    'mean', 
    # don't show progress
    progress = FALSE)
# give the timestep prettier labels
names(cali_county_summaries)<-lubridate::month(keeps, label = TRUE, abbr = FALSE)
```

*exactextractr* will return summary statistics in the same order of the input boundary file, therefore we can join the California county names to the exactextractr output and join the summary statistics for visualization. We also make a version to view as a table to inspect the raw data. We can take a quick look at the first 10 counties to see their mean deficit return period for January-June.

```{r}
# bind the extracted means with the california boundary
cali_counties<-cbind(cali_counties, cali_county_summaries)
# prepare a version to create a table
cali_county_table<-cbind(County=cali_counties$shapeName,
                         round(cali_county_summaries))
# create the table with only the first 10 rows and 7 columns
kableExtra::kbl(cali_county_table[c(1:10),c(1:7)]) |>
    kableExtra::kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"))
```

This confirms the widespread distribution of high deficit values (all the bright red) in our exploratory maps. The data is currently in wide format, which makes for easy viewing of a time series, but more advanced programmatic visualization typically requires data to be in a normalized, or long, format (more on that later).

## County Choropleths

Now that we've inspected the raw data we can make a choropleth out of the mean deficit return period data.

```{r warning=FALSE, message=FALSE}
# plot the data for a check using only the 12 monthly summaries in columns 11 through 23
plot(cali_counties[c(11:23)],
     pal = c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#FFF4C7',
    # -3 to 0
    '#FFFFFF'), 
    breaks = c(-61, -50, -40, -20, -10,-5,-3, 5), 
    key.pos = 1)
title("Deficit Return Period", cex.main = 2, line =-77, adj =0)
```

Due to the widespread water deficits in the raw data, the mean values do not appear much different from the raw deficit raster layer, however, thematic (also called choropleth) maps can make it easier for users to survey the landscape by visualizing familiar geographies (like counties) that place themselves and their lived experiences alongside the data.

While this paints a striking picture of widespread water deficits, how many people are affected by this drought? Although the land area appears rather large, if one is not familiar with the distribution of population and urban centers in California it can be difficult to get a sense of the direct human impact. (This is partly because more populous locations are usually represented by smaller land areas and the less populous locations are usually represented by large administrative boundaries containing much more land area. Normalizing a given theme by land area may be something an analyst wants to do but we cover another approach below.)

## Integrating Population Data

**Gridded Population of the World** (GPW) is a dataset collection in SEDAC that models the distribution of the global human population as counts and densities in a raster format [@CIESIN2018]. We will take full advantage of exactextractr to integrate across WSIM-GLDAS, geoBoundaries, and GPW. To begin, we need to download the 15 minute 2010 population *density* GPWv4. This most closely matches our time period (2014) and the resolution of WSIM-GLDAS. Although it may seem more intuitive to use GPW's population *count* data layers, you can achieve more accurate results (especially along coastlines) by using population density in conjunction with land area estimates derived from exactextractr.

::: {.callout-tip style="color: #5a7a2b;"}
## Data Review

The Gridded Population of the World Version 4 is available in multiple target metrics (e.g. counts, density), time periods (2000, 2005, 2010, 2015, 2020), and spatial resolutions (30 sec, 2.5 min, 15 min, 30 min, 60 min). Read more about GPW at the [collection home page on SEDAC](https://sedac.ciesin.columbia.edu/data/collection/gpw-v4). GPW is one of four global datasets available in raster format: Data sets vary in the degree to which they use additional information as ancillary variables to model the spatial distribution of population from the administrative units (vector polygons) in which they originate. A review of these data sets and their underlying models is found in a paper by Leyk and colleagues [@leyk2019]. Fitness-for-use is an important principle in determining the best dataset to use for a specific analysis. Because the question we ask here is — what is the population exposure to different levels of water deficit in California? — uses spatially coarse inputs and is for a place with high-quality data inputs, GPW is a good choice for this analysis. Users with vector-format census data (at county or sub-county level) could also adapt this approach for those data. In the case of California, the US Census data and GPW will produce nearly identical estimates because GPW is based on the census inputs.
:::

Load in the population density layers.

```{r}
# read in GPW with terra
pop_dens<-terra::rast("gpw_v4_population_density_rev11_2015_15_min.tif")
# check the basic information
print(pop_dens)
```

For this example we’ll classify the WSIM-GLDAS deficit return period raster layer into eight categories. Binning the data will make it easier to manage the output and interpret the results.

```{r}
# set the class breaks row-wise (from, to, new label)
# e.g. first row states: "all return periods from 0 to 5 will now be labeled 0
m <- c(0, 5, 0,
       -3, 0, -3,
       -5, -3, -5,
       -10, -5, -10,
       -20, -10, -20,
       -40, -20, -40,
       -50, -40, -50,
       -65, -50, -60)
# convert the vector into a matrix
rclmat <- matrix(m, ncol=3, byrow=TRUE)
# classify the data
wsim_gldas_1mo_class <-
  terra::classify(wsim_gldas_1mo, rclmat, include.lowest = TRUE)
```

In our previous example, we used *exactextractr*’s built-in `'mean'` function, but we can pass other custom functions to *exactextractr* that will carry out several operations at once as well. The following code could be combined into a single function passed to *exactextractr*, but it is presented here as multiple functions in order to follow along more easily. You can read more about *exactextractr* arguments in the package [help guide](https://cran.r-project.org/web/packages/exactextractr/exactextractr.pdf). The key arguments to be aware of are the calls to:

1.  `weights = pop_dens`: summarizes each WSIM-GLDAS cell’s deficit return period with the corresponding population density value.
2.  `coverage_area = TRUE`: calculates the corresponding area of the WSIM-GLDAS raster cell that is covered by the California boundary.

```{r}
# run the extraction
pop_by_rp <-
  exactextractr::exact_extract(wsim_gldas_1mo_class, california, function(df) {
    df <- data.table::setDT(df)
    }, 
  # convert output to data frame
  summarize_df = TRUE, 
  # specify the weights we're using
  weights = pop_dens, 
  # return the coverage area (m^2)
  coverage_area = TRUE,
  # return the county name with the output data
  include_cols = 'shapeISO', 
  # don't show progress
  progress = FALSE)
```

This returns a `data.frame` with a row for every raster cell in the WSIM-GLDAS layer that is overlapped by the California boundary. Let's take a look at the first 6 rows.  

```{r}
head(pop_by_rp)
```
  *   `shapeISO`: The label of the polygon boundary where the cell is located. This was passed on from the California geojson boundary as specified in the `include_cols = 'shapeISO'` argument. In this instance, it’s not very helpful because we used the state-level California boundary, but if we passed the ADM2 boundary with counties it would provide the name of the county where the cell is located. 
  *   `2014-01-01` to `2014-12-01`: The next 12 columns list the deficit return period classification value for the cell in each of the 12 months corresponding to the time dimension of the `wsim_gldas_1mo` raster layer.
  *   `weight`: The `weight` column lists the corresponding population density value (persons per km\^2) for that WSIM-GLDAS cell. The WSIM-GLDAS and GPW raster layers have the same projection and resolution. Therefore, because they are perfectly aligned, each WSIM return period cell has a corresponding GPW population weight "right on top of it".
  *   `coverage_area`: The total area (m\^2) of the WSIM-GLDAS cell that is covered by the California boundary layer. Given the total area of the WSIM cell that is covered, and the GPW persons per unit area weight, we can calculate the number of people estimated to be living within this cell under this WSIM deficit return period.
  
We will need to perform a few more processing steps to prepare this `data.frame` for a time series visualization integrating all of the data. We will use the melt function to transform the data from wide format to long format in order to produce a visualization in *ggplot2.* Specifically, we need to "melt" the 12 month columns (`2014-01-01` to `2014-12-01`) into 2 new columns: 1) specifying the WSIM-GLDAS deficit return period value and 2) the month it came from.

::: column-margin
::: {.callout-tip style="color: #5a7a2b;"}
## Coding Review

Converting data from wide to long or long to wide formats is a key component to data processing, however, it can be confusing. To read more about melting/pivoting longer (wide to long) and casting/pivoting wider (long to wide) check out the *data.table* vignette  [Efficient reshaping using data.tables](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html) and the *dplyr* [`pivot_longer`](https://tidyr.tidyverse.org/reference/pivot_longer.html) and [`pivot_wider`](https://tidyr.tidyverse.org/reference/pivot_wider.html) reference pages. 
:::
:::

```{r}
# convert the dataset from wide to long (melt)
pop_by_rp <-
  data.table::melt(
    # data we're melting
    pop_by_rp,
    # the columns that need to stay columns/ids
    id.vars = c("shapeISO", "coverage_area", "weight"),
    # name for the month columns we're melting into a single column
    variable.name = "month",
    # name for the values being converted into a single column
    value.name = "return_period")

# check the first 5 rows of melted data
head(pop_by_rp)
```

Each row lists the land area (coverage_area) covered by the zone, the population density value (weight) for the zone, the month the deficit return period corresponds to, and the actual deficit return period class for the zone.

Next, we’ll summarize the data by return-period class.

```{r}
# create new column with population totals for each month and return period combination
# divide by 1000000
pop_by_rp <-
  pop_by_rp[, .(pop_rp = round(sum(coverage_area * weight) / 1e6)), by = .(month, return_period)]
# some cells do not have deficit return period values and result in NaN--just remove
pop_by_rp <- na.omit(pop_by_rp)
# check the first 5 rows
head(pop_by_rp)
```

Now we have a row for every unique combination of month, return period class, and the total population. We can calculate the percent of the total population represented by this return period class with 2 more lines.

```{r}
# create a new column with the total population for that month (will be the same for each month but needed to create wsim class fraction)
pop_by_rp[, total_pop := sum(pop_rp), by = month]
# calculate the fraction of that month-wsim class combination relative to the total population
pop_by_rp[, pop_frac := pop_rp / total_pop][, total_pop := NULL]
# check the first rows
head(pop_by_rp)
```

Before plotting we'll make the month labels more legible for plotting, convert the WSIM-GLDAS return period class into a factor, set the WSIM-GLDAS class palette.

::: column-margin
::: {.callout-tip style="color: #5a7a2b;"}
## Coding Review

Factors are the most common way to handle categorical data in R. Although converting your categorical variables into factors is not not always the best choice, in many instances (especially plotting with *ggplot2*) the benefits will out way any annoyances. To learn more about factors and R check out Hadley Wickham's chapter on factors in [**R for Data Science 2nd Edition**.](https://r4ds.hadley.nz/factors.html)
:::
:::

```{r warning=FALSE}
# ggplot is easier with factors
pop_by_rp$return_period<-
  factor(pop_by_rp$return_period, 
         levels = c("0", "-3", "-5", "-10", "-20", "-40", "-50", "-60"))
# create the palette to pass to ggplot
leg_colors<-c(
    '#9B0039',
    # -50 to -40
    '#D44135',
    # -40 to -20
    '#FF8D43',
    # -20 to -10
    '#FFC754',
    # -10 to -5
    '#FFEDA3',
    # -5 to -3
    '#fffdc7',
    # -3 to 0
    '#FFF4C7',
    # 0-3
    "#FFFFFF")

# pretty month labels
pop_by_rp[,month:=lubridate::month(pop_by_rp$month, label = TRUE)]
```

Now we can put it all together into a visualization.

```{r}
# add the base data
ggplot2::ggplot(pop_by_rp, 
                # set plot wide aesthetics
                ggplot2::aes(x = month, 
                             y = pop_frac,
                             group = return_period, 
                             fill = return_period))+
  # determine the overlay and positioning of each wsim-class' bar
  ggplot2::geom_bar(stat = "identity", 
                    position = "stack", 
                    color = "darkgrey")+
  # set palettes on the wsim class groups
  # rev() the order of the order of the colors so we can have -60 on the bottom
  ggplot2::scale_fill_manual(values = rev(leg_colors))+
  # limit y axis to 0/1
  ggplot2::ylim(0,1)+
  # labels
  ggplot2::labs(title = "Monthly Fraction of Population Under Water Deficits in California During 2014",
                subtitle = "Categorized by Intensity of Deficit Return Period",
                x = "",
                y = "Fraction of Population*",
                caption = "*Population derived from Gridded Population of the World (2015)",
                color = "Return Period", fill = "Return Period", group = "Return Period", alpha = "Return Period")+
  # use basic theme
  ggplot2::theme_minimal()
```

This figure really illustrates the human impact of the 2014 drought. Nearly 100% of the population was under a 60+ year deficit in January followed by 66% in May and approximately 40% for the remainder of the summer. That is a devastating drought!

## Congratulations! In this Lesson You Learned How To...

-   Identify hot spots of drought and select these hotspots for further analysis.
-   Summarize data by county using the exactextractr tool.
-   Integrate WSIM-GLDAS deficit, GPW population, and geoBoundaries administrative data to create complex time series visualizations.

## Lesson 3
 
In this lesson we explored the California drought of 2014. In our next lesson, we will examine near real-time flood data in California using the MODIS data product.

[Lesson 3: Moderate Resolution Imaging Spectroradiometer (MODIS) Near-Real Time (NRT) flood data](https://ciesin-geospatial.github.io/TOPSTSCHOOL-module-1-water/lance-modis-nrt-global-flood-mcdwd-f3.html){.btn .btn-primary .btn role="button"}

# References