03-data_variety.Rmd

# Data variety {#ch-03}

```{r include=FALSE}
list_pkgs <- c("tidyverse", "MODISTools", "sf", "rgdal", "spData", "sp", "raster", "rasterVis", "RColorBrewer", "lubridate", "Metrics", "conflicted")
new_pkgs <- list_pkgs[!(list_pkgs %in% installed.packages()[, "Package"])]
if (length(new_pkgs) > 0) install.packages(new_pkgs)


lib_vec <- c("tidyverse", "MODISTools", "sf", "rgdal", "spData", "sp", "raster", "rasterVis", "RColorBrewer", "lubridate", "Metrics", "conflicted")
sapply(lib_vec, library, character.only = TRUE) 

# Preferences definition for conflicting packages when building bookdown
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
conflict_prefer("levelplot", "rasterVis")
conflict_prefer("origin", "raster")
conflict_prefer("extract", "raster")
conflict_prefer("partial", "pdp")
conflict_prefer("mse", "Metrics")
```

## Introduction

### Overview

In this chapter, we will look at data variety in environmental sciences. We will start with the data from one eddy covariance measurement site in Switzerland encountered in Chapter 1, and complement it with remote sensing data. Specifically, we will download remote sensing data that measures vegetation greenness data, quantified by the normalised difference vegetation index, NDVI. We are interested in remote sensing data, because we expect ecosystem photosynthesis (gross primary production), measured at the eddy covariance tower, to covary with NDVI. Such covariation can provide powerful information for modelling. For example, we can train a machine learning model with the observed covariation between measured GPP and NDVI at eddy covariance sites and use that model to predict GPP from data obtained in space (NDVI is available for the whole globe, while GPP can only be measured locally.) You will be doing this modelling in exercise 13. Here, we will focus on working with the spatial aspects of the data (since remote sensing data is inherently spatial) and introduce the tool set to work with geo-spatial data. 

To get familiar with spatial data types, we will use a variety of data from freely accessible sites. We can obtain abiotic and biotic conditions at the tower locations. We will be spatially plotting the eddy towers you have already encountered in the first two chapters and extracting values for these locations from other data. By doing this, we can show that the climate at these sites spans a gradient temperature and landcover. 

### Learning objectives

After this learning unit, you will be able to ...

* Explain the possible sources of data in environmental sciences.
* Understand the range of operations applied to data from basic to complex.
* Read various types of data in R and apply basic operations.
* Understand the structure of spatial data, including rasters and shapefiles.
* Apply a range of operations to prepare data for analyses.

### Key points from the lecture

Data operations range in complexity, here they are listed from simplest to extremely complex:  

* Statistical operations 
* Similarity metrics
* Ordinations
* Clustering 
* Classifications
* Regressions
* Neural network deep learning

Examples of methods for the computing differences between observations associated with a set of features:

* _Euclidean distance_ is used for the simple quantification of variation.
* _PCA_ is a valuable tool to explore variation and reduce dimension.
* _Hamming distance_ helps measure the difference between strings.
* _Jaccard index_ describes the similarity between two or more binary data sets.

Machine learning is loosely defined as any learning done by a computer. It can be divided into two main categories…

* _Unsupervised learning_: without explanatory data, which uses clustering to obtain a categorical output
* _Supervised learning_: with explanatory data, which uses either classification also resulting in a categorical output, or regression when a numerical output is required

Data errors are divided into errors of accuracy and precision. If data is inaccurate, it strays from the true value. If data is imprecise, individual data points are variable under the same conditions. Systematic errors and inaccurate data is much harder to correct.

Errors can arise throughout the data science workflow.

* _Inherent error_ refers to the error present in the source documents and data.
* _Operational error_ is created after data collection.

Data can be cleaned from errors by looking for:

* _Non-uniform data_ where the data is not uniformly or normally distributed.
* _Missing data_ where there is no value for the observed variable, often denoted as _NA_.
* _Duplicated data_ where a dataset contains multiple records of identical data.
* _Outliers_ which are data points that are significantly different to other data points or lie abnormally far from most other data points.
* _Measurement errors_ known as the discrepancy between the measured value and its true value.

Quantifying uncertainty is extremely important, as undetectable errors will persist even after data cleaning. Examples of uncertainty quantification metrics include:

* _Confidence interval (CI)_ is a range of values that the chosen parameter of the data variable will fall within with a certain probability or degree of certainty.
* _Prediction interval_ is the estimated interval within which observations will occur based on the previously observed data and with a certain degree of confidence.

Error propagation is used for datasets containing different uncertainties to determine the overall uncertainty. For example, if the variables in the dataset come from measurements, these values will have a level of uncertainty due to limitations in the measuring process, such as the precision of the instrument. This uncertainty or errors will add up across the data set and analysis due to the variables being combined within a function. Often this error or uncertainty is expressed in terms of the standard deviation _σ_.

## Tutorial

### Overview

In this practical, we will be downloading and getting to know spatial data. There are three types of spatial data:

* *points*: Spatial points are data points consist of x and y coordinates representing a specific geographic location.
* *shapefiles*: Shapefiles are a format for storing geospatial data in the form of points, lines or polygons. They contain geographic locations and attribute information on the features stored within them.
* *rasters*: Rasters are gridded data stored in pixels or cells. Each grid cell represents a section of the geographic area the raster covers and contains a value of the variable of interest.

This practical will start with a little introduction on downloading remote sensing data for our tower sites and seeing how this information can be used for analyses. Then we will get into the nitty-gritty aspects of the aforementioned spatial data types. 


### MODIS remote download

We will be starting off this tutorial with an example on using remote sensing data. We will focus on a fluxnet site already presented in the first practical: CH-Lae. The Lägern site is located on a mountain in the Swiss Plateau NW of Zürich and is surrounded by managed mixed deciduous mountain forests. The forest is highly diverse and dominated by beech. In the following figures, you can get an impression of the tower and its surroundings.  

```{r, echo=FALSE, out.width="32%", out.height="49%", fig.show='hold', fig.cap="Images of Fluxnet sites and measurement station at the Lägern site."}
knitr::include_graphics(c("./figures/site1.jpg",
                          "./figures/site2.jpg",
                          "./figures/site3.jpg"))
```

As always, we start by loading `tidyverse`and our fluxnet site data.

```{r warning=F, message=F}
library(tidyverse)
df_sites <- read_csv("./data/fluxnet_site_info_reduced.csv")
```

Since we will only be working with the Lägern site for this section, we can extract its latitude and longitude for further use.

```{r}
lon_lae <- df_sites %>% 
  filter(site == "CH-Lae") %>% 
  pull(lon)

lat_lae <- df_sites %>% 
  filter(site == "CH-Lae") %>% 
  pull(lat)
```

In this first analysis, we want to complement temporal data measured by the Fluxnet tower with remote sensing data. _Remote sensing_ is the detection or monitoring of physical characteristics of the Earth’s surface by measuring reflected and emitted radiation at a distance. Data is collected in the form of images by special cameras typically from aircrafts or satellites. With this method, a large amount of global data at large spatial scales gets easily accessible and therefore represents a useful method for the monitoring of ecosystems. However, this remotely acquired data requires validation from ground measurements, which can be done for example using the eddy towers from the previous practicals.

One of these satellite remote sensing instruments is MODIS (Moderate Resolution Imaging Spectroradiometer). Terra MODIS and Aqua MODIS are viewing the entire Earth's surface every 1 to 2 days, acquiring data in 36 spectral bands, or groups of wavelengths. 

We have access to tools that make it more straightforward to download remote sensing data from satellites. It provides Atmosphere, Ocean, Cryosphere and Land data series. The `MODISTools` package in R provides a programmatic interface to the [MODIS Land Products Subsets web services](https://modis.gsfc.nasa.gov/) allows for easy downloads of "MODIS" time series (of single pixels or small regions of interest) directly to your R workspace or your computer. 

* Using the `mt_...()` functions of the MODISTools, we can check which products are available in MODIS, how the product we are searching for is called, and what temporal and spatial resolutions are available. 
* `t_products()` lists all available products from MODIS with their temporal and spatial resolution. Products are parent categories of variables measured by the satellites, such as vegetation indices. 
* `mt_bands()` requires the entry of a product chosen from "products" and list the available data variables that are represented by the different actually measured wavelengths. 
* Finally, `mt_dates()` lists all dates of which data from the chosen product and band are available.

We will make use of this by loading NDVI data around our towers directly from MODIS.

```{r}
library(MODISTools)
products <- mt_products() %>% as_tibble()
bands <- mt_bands(product = "MOD13Q1") %>% as_tibble()
dates <- mt_dates(product = "MOD13Q1", lat = lat_lae, lon = lon_lae) %>% as_tibble()
```

Next, we will get the NDVI data for 6 years for a square of 2km x 2km around the tower site. We can use the function `mt_subset()` for this. The parameters of this function download a subset of data within the chosen product (here 'MOD13Q1'). We specify the location as latitude and longitude, and the band we chose, which is at 250m spatial resolution and 16-day temporal resolution. The time period can be chosen as a subset of “dates” and must be provided in the form of start and end date. We are looking at the beginning of 2009 to the end of 2014. `km_lr` and `km_ab` define the kilometers to the left and right of the location and kilometers above and below, respectively. Since these values are being rounded to the nearest integer, we choose this minimum of 1, corresponding to 2 km^2. Sitename is used in writing data to file, internal gives the command whether the data should be returned as an internal structure. The progress setting defines whether the download progress should be shown or not.

```{r eval = F}
df_ndvi <- mt_subset(product = "MOD13Q1",            # the chosen product
                  lat = lat_lae,                     # desired lat/lon
                  lon = lon_lae,
                  band = "250m_16_days_NDVI",        # chosen band defining spatial and temporal scale
                  start = "2009-01-01",              # start date: 1st Jan 2009
                  end = "2014-12-19",                # end date: 19th Dec 2014
                  km_lr = 1,                         # kilometers left & right of the chosen location (lat/lon above)
                  km_ab = 1,                         # kilometers above and below the location
                  site_name = "CH-Lae",              # the site name we want to give the data
                  internal = TRUE,
                  progress = FALSE
                  ) %>% 
  as_tibble()
```

```{r echo=F}
df_ndvi <- readRDS("./data/df_ndvi.rds")
```

```{r paged.print=TRUE}
head(df_ndvi) %>% select(2:7) # Only displaying 6 of 21 features
```

This data frame appears complicated, but we can transform the data into a useful spatial format. For that it needs to be converted to a raster. This is a key spatial data type and will be described in more detail in the section Raster below. 
The `MODISTools` packages provides a handy function, called `mt_to_raster()`, for converting such data frames produced by `mt_subset()` into a raster.

```{r echo=F, message=FALSE, warning=FALSE}
library(sf)
library(rgdal)
library(raster)
raster_ndvi <- mt_to_raster(df_ndvi, reproject=TRUE)
```

By plotting the data we can verify that we loaded a square of NDVI data around our tower for every 16-day time step. 
We plot four images in the winter season.

```{r}
plot(raster_ndvi[[1:4]], zlim=c(-0.1, 0.95))
```

We can derive a variety of statisctics and for instance calculate the minimum and maximum of these rasters.

```{r}
min_winter <- min(minValue(raster_ndvi[[1:4]]))
max_winter <- max(maxValue(raster_ndvi[[1:4]]))
cat(" Minimum Winter: ", min_winter, "\n",
    "Maximum Winter: ", max_winter)
```

The same can be done for the summer season.

```{r}
plot(raster_ndvi[[13:16]], zlim=c(-0.1, 0.95))
```

We again compute minimum and maximum.

```{r}
min_summer <- min(minValue(raster_ndvi[[13:16]]))
max_summer <- max(maxValue(raster_ndvi[[13:16]]))
cat(" Minimum Summer: ", min_summer, "\n",
    "Maximum Summer: ", max_summer)
```

From the two plots, we can clearly distinguish summer and winter time steps due to the differences in NDVI. Summer NDVI is higher since our tower site is located in the Northern hemisphere where leaves get lost during winter for many vegetation types. The range in winter is also larger which might be due to conifers that do not lose their leaves during this season.

We are also interested in the mean across all (spatial) pixels for each date. For this we can collapse the pixel dimension into a single mean value for each date. This is done using `group_by()` in combination with `summarise()`, which were explained in Chapter \@ref(ch-02). 

```{r, warning=F, message=F}
library(lubridate) # lubridate is loaded to work with dates

## first determine the scaling factor that is to be applied to the NDVI values. This has practical reasons: reduces disk space
scale_factor <- bands %>% 
  filter(band == "250m_16_days_NDVI") %>% 
  pull(scale_factor) %>% 
  as.numeric()
  
df_ndvi_spatialmean <- df_ndvi %>% 
  mutate(calendar_date = ymd(calendar_date)) %>%  # make the dates into comprehensible values not X.yyyy.mm.dd
  group_by(calendar_date) %>%                     # group the data by day 
  summarise(mean = mean(value), min = min(value), max = max(value)) %>%   # calculate mean, min and max across pixels
  mutate(mean = mean * scale_factor, min = min * scale_factor, max = max * scale_factor)  # apply scale_factor (see )
```

We can now plot our NDVI time series across our years (2009 to 2014) using the mean NDVI values from the rasters calculated.

```{r}
df_ndvi_spatialmean %>% 
  ggplot(aes(x = calendar_date)) +
  geom_ribbon(aes(ymin = min, ymax = max), fill = "grey70") +
  geom_line(aes(y = mean))
```

After having aggregated the data to the mean NDVI across all 81 pixels for each date, we have reduced the dimensions of the data frame to one unique dimension. It now has a structure that can easily be combined with the time series data from the eddy covariance tower (each site is a point in space).

```{r message=F}
# this can now be combined to the data frame with flux data that also has dates along rows
ddf_ch_lae <- read_csv("./data/ddf_ch_lae.csv")
```

```{r, message=F}
# now we can combine the two data frames along dates. 
ddf_ch_lae_ndvi <- ddf_ch_lae %>%
  dplyr::rename(date = TIMESTAMP) %>% 
  dplyr::mutate(year = year(date)) %>% 
  dplyr::filter(year %in% 2009:2014) %>%  # NDVI data was downloaded only for these years
  left_join(df_ndvi_spatialmean %>% rename(date = calendar_date), by = "date") %>% 
  dplyr::select(-year)
```

NDVI data is provided every 16 days, but the eddy towers provide continuous measurements. In order to predict values at other time steps, we can fit a cubic smoothing spline to our daily values using `smooth.spline()`. In a nutshell, this is a way to fit a curve between two data points using cubic functions.

Cubic smoothing splines embody a curve fitting technique that blends the ideas of cubic splines and curvature minimization to create an effective data modeling tool for noisy data. Traditional interpolating cubic splines represent the tabulated data as a piece-wise continuous curve which passes through each value in the data table. The curve spanning each data interval is represented by a cubic polynomial, with the requirement that the endpoints of adjacent cubic polynomials match in location and in their first and second derivatives. For more, have a look at this [blogpost](https://towardsdatascience.com/numerical-interpolation-natural-cubic-spline-52c1157b98ac). 

```{r}
ddf_ch_lae_ndvi <- ddf_ch_lae_ndvi %>% 
  mutate(date_dec = decimal_date(date))     # adds a column with the date converted into a decimal of its year

df_nona <- ddf_ch_lae_ndvi %>% 
  drop_na()                                 # removes any rows containing NAs

out_spline <- smooth.spline(df_nona$date_dec, df_nona$mean, spar=0.1 )   # fits a cubic smoothing spline to df
vals_spline <- predict(out_spline, ddf_ch_lae_ndvi$date_dec )$y          # applies the cubic smoothing spline to our data

ddf_ch_lae_ndvi <- ddf_ch_lae_ndvi %>% 
  mutate(ndvi_splined = vals_spline)        # adds the spline values for our data to ddf_ch_lae_ndvi
```

Now, we can plot this relationship.

```{r warning=F}
ddf_ch_lae_ndvi %>% 
  ggplot() +
  geom_point(aes(x = date, y = mean)) +
  geom_line(aes(x = date, y = ndvi_splined), color = "red")
```
Note how the red line is smoothed and closely follows the black data points.

  
To see how well this data correlates with measurements from the fluxnet tower, we look at the relationship of GPP (gross primary productivity) vs. NDVI (Normalized Difference Vegetation Index) * PPFD (photosynthetic photon flux density).  

```{r}
ddf_ch_lae_ndvi <- ddf_ch_lae_ndvi %>% 
  mutate(ppfd_abs = ndvi_splined * PPFD_IN)
```

In order to investigate whether NDVI is associated with GPP and PPFD, we can fit a linear model.

```{r}
linmod1 <- lm(GPP_NT_VUT_REF ~ PPFD_IN, data = ddf_ch_lae_ndvi)
summary(linmod1)
```

The summary contains the following information:

* Call: this is the formula R uses to fit the data.
* Residuals: residuals are essentially the difference between the actual observed response values and the response values that the model predicts. 
* Coefficients:  the coefficients are the two unknown constants that represent the intercept and slope terms in the linear model. Besides the estimates of the two unknown constants (-5.078 and 15.412), the standard error of the estimates and measures of statistical significance (t value and Pr(>|t|)) are given. The coefficient standard error measures the average amount that the coefficient estimates vary from the actual average value of our response variable. The coefficient t-value is a measure of how many standard deviations our coefficient estimate is away from zero. The larger the absolute value of t, the smaller gets the probability of observing any value larger than |t| by chance, which is expressed by Pr(>|t|). The signif. codes or p-values (stars behind the coefficients) are associated to each estimate and categorize the probabilities Pr(>|t|) by significance thresholds.
* Residual standard error: the Residual Standard Error is the average amount that the response will deviate from the true regression line. It is a measure of the quality of a linear regression fit.
* Multiple R-squared : the R^2 quantifies what proportion of the variance found in the response variable can be explained by the predictor variable. It is a measure of how well the model is fitting the actual data.
* F-statistic: F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistics from 1 the better it is. However it depends on the number of data points and the number of predictors how much larger the F-statistics has to be in order to ascertain a relationship between the predictor and the response variables.

```{r}
linmod2 <- lm(GPP_NT_VUT_REF ~ ppfd_abs, data = ddf_ch_lae_ndvi)
summary(linmod2)
```

R2 increased from 0.5229 to 0.603 when including considering NDVI * PPFD, instead of just PPFD. Hence, it also matters whether leaves are actually out and green for modelling the relationship between GPP and PPFD. NDVI * PPFD approximates the amount of *absorbed* incoming light, not just the incoming light, and is therefore a physiologically more relevant quantity. 

### Points on the globe

Documenting the geographic coordinates of data allows us to assign a precise location to that data observation, which can be important for further analysis when, for example, comparing location. Coordinates come with a latitude (defining the position N or S of the equator) and longitude (defining the position E or W of the meridian). With the coordinates, we can plot data points, remembering that the latitude is equivalent to the y-axis and longitude to the x-axis.

In this section, we look at the position of the Fluxnet sites. The Fluxnet network measures land-atmosphere exchanges of greenhouse gases and energy for sites across the globe using the eddy covariance technique. High-frequency measurements of vertical wind velocity and a scalar mixing ratio (CO2, H2O, temperature, etc.) provides estimates of the net exchange of the scalar. Eddy covariance is currently the standard method to measure fluxes of trace gases between ecosystems and the atmosphere (sources:
[Fluxnet](https://fluxnet.org/about/), Nature Article by @Pastorello2020 on [Fluxnet Dataset](https://www.nature.com/articles/s41597-020-0534-3.pdf).

At the beginning of this tutorial we loaded the dataset containing information on sites in Europe. We treat site locations as a geographical position. So we will extract the name, longitude and latitude of our sites.

```{r}
head(df_sites)
```

Our dataset _'df_sites'_ contains two sites in Greenland and Siberia. We'll focus on a smaller spatial domain, restricted to "mainland" Europe. Therefore, we exclude sites RU-Cok and DK-ZaH. We will also remove DE-Obe since this was removed in previous chapters.

```{r}
df_sites <- df_sites %>% 
  filter(!(site %in% c("RU-Cok", "DK-ZaH", "DE-Obe")))
```

Now we can plot our data points without any further adjustments.

```{r}
ggplot() +
  geom_point(data = df_sites, aes(x = lon, y = lat), color = "red") +
  labs(title = "Selected FLUXNET sites", x = "Longitude", y = "Latitude")+
  theme_bw()
```

We can make an educated guess where the points are but this plot does not tell us much by itself without a map as a reference.

To visualize the location of our points we plot them on a map. We use the packages **sf** [@R-sf] and **SpData** [@R-SpData] to get the world map and crop it to Europe.

```{r message=FALSE, warning=FALSE}
library(sf)
library(spData)

ggplot() +
  geom_sf(data = world) +
  geom_point(data = df_sites, aes(x = lon, y = lat), color = "red") +
  labs(x = "Longitude", y = "Latitude") +
  coord_sf(xlim = c(-30,40), ylim = c(35,80), expand = FALSE) +
  theme_bw()
```

We can carry on using the data points as they are (for plotting or analysis) but for more detailed spatial analysis we will transform them into a _SpatialPoints_ format. 

_SpatialPoints_ consist of a matrix with _n_ rows and 2 columns, one for each coordinate (latitude & longitude). _n_ is the number of points in the data. These points also have a so-called _'projection string'_ or _'crs'_ indicating the coordinate reference system in which coordinates of points are expressed. 
There are different ways (formulas) to project the earth (an ellipsoid) onto a 2-dimensional map. We can only perform calculations, if the same method (projection, coordinate reference system (crs)) is used. Having points in different coordinate reference systems can cause data to look skewed. It is essential to know which coordinate reference system your data is in. The most commonly used one is **WGS84 (EPSG: 4326)**. There are some useful resources to convert help you find the correct [spatial reference](https://spatialreference.org/ref/epsg/4326/).

To transform points into _SpatialPoints_, we use the function `SpatialPoints` in the R package **sp** [@Bivand2013;@Pebesma2005].

```{r message=FALSE, warning=FALSE}
library(sp)
sp_sites <- SpatialPoints(dplyr::select(df_sites, -site)) # remove character column
```

Below you can see how the R-base call `plot()`, plots the _SpatialPoints_ directly. 

```{r warning=F, message=F}
plot(sp_sites, col = "red", pch = 16)
```

If, however, we want to make the point's locations visually meaningful, we have to provide a map to plot as a background. To load a raster file in R, we will use the package **rgdal** [@R-rgdal]. The shapefile of Europe we will load, will also be used in the next section called Shapefiles.

```{r}
library(rgdal)
europe_shape <- readOGR(dsn="./data/shapefiles", layer="europe_map") # imports the shapefile of the European map

plot(europe_shape, col="grey")                                       # plots the European map as the background
plot(sp_sites, col = "red", add = TRUE, pch = 16)                    # plots the points (here tower locations) on to the map
```
Now, we can see the tower locations as red circles plotted on a grey European map with country boarders, which puts the data points in a meaningful and comprehendable visual context. 

### Shapefiles

Above we looked at points on the globe and transformed our data to _SpatialPoints_. This was already the first example of what forms a shapefile can come in. Shapefiles store geographic information, including location and any additional information, in shape objects. Shape objects in R are defined by _SpatialPoints_, _SpatialLines_, and _SpatialPolygons_ classes of the **sp** package . The corresponding _SpatialPointsDataFrame_, _SpatialLinesDataFrame_, and _SpatialPolygonsDataFrame_ classes allow storing shape objects together with a data frame. The number of rows in the data frame corresponds to the number of points (number of rows in coordinate matrix), lines (size of the list of Lines), or polygons (size of the list of Polygons). This allows us to associate a vector of variables (a row of the data frame) to each shape object.
In the next step, we load a shapefile containing the countries of the world provided by [Natural Earth](www.naturalearthdata.com), as a reference with which we can contextualize our spatial data. Since we only are looking at towers in Europe, we have cropped the raster to Europe for you.

```{r warning=F, message=F}
library(rgdal)
europe_shape <- readOGR(dsn="./data/shapefiles", layer="europe_map")
```

We can look at what class _europe_shape_ is:

```{r}
class(europe_shape)
```

Above we mentioned that _SpatialPolygonsDataFrame_ contains both a list of _SpatialPolygons_ and a  with information on each polygon. Here is how we can access the information in the . We will start by displaying the header of some of the 94 features for now.

```{r}
europe_shape@data %>% select(c(NAME, SOVEREIGNT, SUBREGION, POP_EST))%>% head()
```

Now, we want to plot our towers on this map of Europe. First, we need to extract the projection from _europe_shape_ using `proj4string()` and add it to our _SpatialPoints_. Thereby ensuring they are in the same reference system  to plot them together.

```{r message=F}
geo.proj <- sp::proj4string(europe_shape)
pts <- sp::SpatialPoints(sp_sites, proj4string = sp::CRS(geo.proj))
```


By using the _SpatialPoints_, now in the correct coordinate reference system, and the data from _europe_shape_, we can bind them to our tower sites in the data frame _df_sites_. The function `over`from the package `sp`, retrieves the indexes or attributes from the specified spatial object (_europe shape_) at the spatial locations of a specified object (_pts_).

```{r paged.print=TRUE}
df_sites_country_info <- sp::over(pts, europe_shape) %>%  # retrieves info from 'europe_shape' at the 'pts' locations
    as_tibble() %>%                                       # makes a tibble
    bind_cols(df_sites, .)                                # bind the information gathered to the df_sites

head(df_sites_country_info) %>% select(1:6)
```
In the header above, is the new data frame _df_sites_country_info_ consisting of three columns ('site', 'lat', 'lon') from _df_sites_ and then 91 columns containing information from _europe_shape_ at the spatial locations of _pts_ (the tower sites).


For the next part, we will be using a European map with country boarders. For this, we will make an object called _shp_df_ using the column _SOVEREIGNT_, which we saw when looking at `europe_shape@data`. The package **broom** [@R-broom] converts objects from R into 'tidy tibbles', this will allow us to preserves only the data needed for further analysis and plotting. 

```{r message=F}
shp_df <- broom::tidy(europe_shape, region = "SOVEREIGNT") # transforms the information in 'SOVREIGNT' into a tidy df

head(shp_df)
```
The data frame above is the tidied version of the information in the _SOVEREIGNT_ column of _europe_shape_. It contains the coordinates, country (sovereignty) and order of each of the polygons from the _SpatialPolygonsDataFrame_. Plotting the information in this data frame looks as follows:

```{r message=F}
ggplot() + 
  geom_polygon(data = shp_df, aes(x = long, y = lat, group = group, fill = id), colour = "black") +
  theme_bw() +
  theme(legend.position = "none")
```
As we saw in the earlier plot of the tower locations only some countries in Europe contain towers. Therefore, we want to crop our map to only include those countries.

We first create a vector which is lists the country each towers is located in. We obtain this vector from the data frame _df_site_country_info_.  Then, we filter out only those countries with a tower in them. We can plot the countries and tower locations as a result.

```{r}
# character vector of the country each tower is located in from column 'SOVREIGNT'
countries_with_site <- df_sites_country_info %>% pull(SOVEREIGNT) %>% as.character()   

# resulting character vector (without NAs)
countries_with_site %>% na.omit()

# country with towers filtered
shp_df_sub <- shp_df %>% 
  filter(id %in% countries_with_site)

# plot result
ggplot() + 
  geom_polygon(data = shp_df_sub, aes(x = long, y = lat, group = group, fill = id), colour = "black") +
  geom_point(data = df_sites, aes(x = lon, y = lat)) +
  theme_bw()
```


**Checkpoint**

You just learnt how to plot _europe_shape_ coloured by country ("SOVREIGNT"), now plot it coloured by region in Europe.
Start by finding the column that gives you this information in `europe_shape@data`.

**Solution**

```{r message=F, warning=F}
# take a look at the first 6 column names
names(europe_shape@data)[1:10]

# make a df of the data needed from europe_shape
shp_df_region <- broom::tidy(europe_shape, region = "SUBREGION")
head(shp_df_region)

# plot europe coloured by region
ggplot() + 
  geom_polygon(data = shp_df_region, aes(x = long, y = lat, group = group, fill = id), colour = "black") +
  geom_point(data = df_sites, aes(x = lon, y = lat)) +
  theme_bw()
```
We now have a map showing the tower's locations as black circles within the different regions in Europe.

### Rasters
 
In this section we will learn how to use rasters. A raster is a matrix or grid of cells each of which contains information in the form of a value.

A raster object consists primarily of:

* A grid of cells;  
* A coordinate reference system (CRS) for the grid and its cells so that we know the location to which the grid refers;
* A variable of interest for which each cell in the grid has a value, and;
* Other information relating to the CRS, projection, resolution, etc.

We will load and check some rasters and get familiar with functions to analyse them.

Since our goal ultimate goal is to predict productivity, we will consider different factors that might explain productivity, for example, landcover and temperature. 
To start off we'll look at some landcover data. With this data, we can investigate the surroundings of the Fluxnet towers. This data gives us the different classes of physical coverage of the Earth’s surface, such as forests, grasslands, croplands, lakes, etc. Landcover types affect fluxes measured by the tower through the vegetation that characterizes them, or by influencing the radiation budget through albedo, or water availability. Although we have some information from the overview file of the towers, we can still check whether we can confirm this information. We will load landcover data from GlobCover provided by the [European space agency (ESA)](http://due.esrin.esa.int/page_globcover.php). This project provides landcover maps observations of surface reflectance from the 300m MERIS sensor on board the ENVISAT satellite mission as their input.

```{r}
# library(raster) -> we have already done this but remember you need this to load rasters!
raster_landcover <- raster("./data/Globcover_EU.tif")
```

This landcover data consists of different categories, these GlobCover categories are based on the Land Cover Classification System (LCCS) which was
developed by the Food and Agricultural Organization of the United Nations (FAO) to provide a consistent framework for the classification of mapping land cover. They include various types of forest, shrub- and grasslands, cropland and artificial surfaces.

Before we look at the data, we first reduce the spatial extent that the raster covers. We crop it to a rectangle around the site location of CH-Lae (+/- 2 degrees in longitudinal direction, +/- 1 degree in latitudinal direction).

```{r}
bounding_box <- extent(lon_lae-2, lon_lae+2, lat_lae-1, lat_lae+1)
raster_landcover_crop <- crop(raster_landcover, bounding_box)
```

We can plot our raster as the region around our tower. To indicate the position of our tower, we add a red dot.

```{r eval = F}
plot(raster_landcover_crop, legend=FALSE, xlab="longitude", ylab="latitude")
points(lon_lae, lat_lae, pch = 16, col = "red")
```

If we want to conduct analysis using two different rasters, we have to make sure they have the same resolution and are aligned on the same grid. We can read a second raster and compare the grids. We load spatio-temporal temperature rasters available from [CHELSA](https://chelsa-climate.org) to complement our flux tower measurements for further analysis.
The temperature raster data from CHELSA provides free high-resolution climate data (temperature and precipitation) for the past and the future. The past data we use here are downscaled model output temperature and precipitation estimates of the ERA-Interim climatic reanalysis (forecast models and data assimilation systems reanalysing archived observations). While the temperature algorithm is based on statistical downscaling of atmospheric temperatures, the precipitation algorithm incorporates orographic predictors including wind fields, valley exposition, and boundary layer height, with subsequent bias correction. The data we use consists of a monthly temperature and precipitation time series over the area of Europe from 2006 to 2012 in January. To compare this new raster with the previous one we need to crop it to the same region as the tower. Cropping rasters can be done using the R package **raster** [@R-raster].

```{r}
library(raster)
raster_chelsa <- raster("./data/Chelsa_t_mean_2006-2012.tif")
raster_chelsa_crop <- crop(raster_chelsa, bounding_box)
```

We now plot our raster.

```{r}
pal <- colorRampPalette(c("purple","blue","cyan","green","yellow","red"))
plot(raster_chelsa_crop, col = pal(20), xlab="longitude", ylab="latitude")
points(lon_lae, lat_lae, pch = 16, col = "red")
```

We see that the temperature units are not in a format that is straightforward to read. To change the unit from Kelvin x10 to degree Celsius, we have to divide the data by 10 and subtract 273.15. 

```{r}
raster_chelsa_crop <- raster_chelsa_crop/10-273.15
```

We plot our raster again, with this temperature transformation.

```{r}
pal <- colorRampPalette(c("purple","blue","cyan","green","yellow","red"))
plot(raster_chelsa_crop, col = pal(20), xlab="longitude", ylab="latitude")
points(lon_lae, lat_lae, pch = 16, col = "red")
```

We have seen some of the basic functions in the package **raster**. Now we will load one more package to work with raster files. We use the package **rasterVis** [@R-rasterVis]. This package contains methods for enhanced visualization and interaction with raster data. It implements visualization methods for quantitative data and categorical data, both for univariate and multivariate rasters. It also provides methods to display spatiotemporal rasters, and vector fields. To enhance the visualisation of our raster data, we will use the colour palettes from the package **RColorBrewer** [@R-RColorBrewer]

```{r message=F}
library(rasterVis)
library(RColorBrewer)

mapTheme <- rasterTheme(region = rev(brewer.pal(8,"RdYlBu")))
plt <- levelplot(raster_chelsa_crop, margin=FALSE, par.settings = mapTheme)
plt
```

**Checkpoint**

Create your own small raster with 100 grid cells. Give it values, a projection (e.g. '+proj=utm +zone=48 +datum=WGS84') and plot it.
    
If you chose the example projection provided, add _europe_shape_ to your plot.

**Solution**

```{r message=F, warning=F}
# create a raster
# short compact version
myraster <- raster(ncol=10, nrow=10)
# OR 
# longer version
myraster <- raster()
ncol(myraster) <- 10
nrow(myraster) <- 10

# add values to the raster, this can be done in many ways here are two examples:
values(myraster) <- 1:ncell(myraster)
# OR 
values(myraster) <- runif(ncell(myraster))

# add a projection
projection(myraster) <- "+proj=utm +zone=48 +datum=WGS84"

# plot it
plot(myraster)
# Since the raster has a projection we can visualise Europe on this new raster.
plot(europe_shape, add=TRUE)
```

#### Aggregating

A **raster grid** is uniquely defined by an _origin_, a point that one of the intersections of grid lines, and its resolution, the magnitude of each cell-side. In the R package `raster`, the origin is defined as the point closest to (0,0) that is still an intersection of grid lines. The functions `origin()` and `res()` from the R package `raster` return the origin and resolution of a raster object. As an example, we will see what the resolution and origin of the rasters we loaded before are.

```{r}
cat(" Resolution Raster Landover: ", res(raster_landcover), "\n",
    "Resolution Raster Chelsa: ", res(raster_chelsa))
```

```{r}
cat(" Origin Raster Landover: ", origin(raster_landcover), "\n",
    "Origin Raster Chelsa: ", origin(raster_chelsa))
```

Landcover has a much higher resolution. We can re-grid the Landcover raster to match the grid of the Chelsa raster. Aggregating means resampling an input raster to a coarser resolution based on a specified aggregation strategy. We now determine the aggregation factor by dividing the resolution of the landcover raster by that of the Chelsa raster. We do it both for the longitude and latitude to make sure they match.

```{r}
res_landcover <- res(raster_landcover)
res_chelsa <- res(raster_chelsa)
factor_agg_lon <- res_chelsa[1] / res_landcover[1]
factor_agg_lat <- res_chelsa[2] / res_landcover[2]


cat(" Longitudinal Aggregation Factor: ", factor_agg_lon, "\n",
    "Latitudinal Aggregation Factor: ", factor_agg_lon)
```

We observed that the landcover raster has a higher resolution than the temperature raster, we have deduced it is 3 times higher. So we need to transform the landcover raster to make sure it has the same grid (same _origin_ and _resolution_) as the temperature raster. This will allow us to combine them in a potential analysis. We achieve this using `aggregate()` from the R package `raster` to obtain a sum of edges raster at 1000m x 1000m resolution. 

The aggregate function takes three main arguments. 
* **_x_** : the raster object to aggregate,
* **_fact_** : aggregation factor, i.e. the number of current cells that will make up the side of one of the new cells, and, 
* **_fun_** : the function to apply to summarize the raster values of the fact^2 (in this case 3ˆ2=9) cells.

We aggregate the landcover raster at 1km using the aggregation factor calculated above.

```{r}
raster_landcover_crop_agg <- aggregate(raster_landcover_crop, fact = factor_agg_lon, fun = modal)
```

We check the resolutions and origin of the two rasters and plot our the new landcover one.

```{r}
cat(" Resolution Landcover: ", res(raster_landcover_crop_agg), "\n",
    "Resolution Chelsa:    ", res(raster_chelsa), "\n",
    "Origin Landcover:    ", origin(raster_landcover_crop_agg), "\n",
    "Origin Chelsa:       ", origin(raster_chelsa), "\n")

plot(raster_landcover_crop_agg, legend=FALSE, xlab="longitude", ylab="latitude")
```

We can see that the resolutions are now the same but the origin still differs. To get matching origins we must do another step: align the rasters.

#### Aligning

Aligning rasters is done using the function `resample()`. There are two main resampling methods: _Nearest Neighbour_ or _Bilinear Interpolation_, which method is used depends upon the input data and its use after the operation is performed.

**_Nearest Neighbour_** is best used for categorical data like land-use classification or slope classification. The values that go into the grid stay exactly the same, a 2 comes out as a 2, and 99 comes out as 99. The value of the output cell is determined by the nearest cell center on the input grid. Nearest Neighbor can be used on continuous data but the results can be blocky.

**_Bilinear Interpolation_** uses a weighted average of the four nearest cell centers. The closer an input cell center is to the output cell center, the higher the influence of its value is on the output cell value. This means that the output value could be different than the nearest input, but is always within the same range of values as the input. Since the values can change, Bilinear is not recommended for categorical data. Instead, it should be used for continuous data like elevation and raw slope values.

```{r eval = F}
raster_landcover_crop_resampl <- resample(raster_landcover_crop, raster_chelsa_crop, method="ngb")
plot(raster_landcover_crop_resampl, legend=FALSE, xlab="longitude", ylab="latitude")
```

```{r include=FALSE}
raster_landcover_crop_resampl <- resample(raster_landcover_crop, raster_chelsa_crop, method="ngb")
```

[ ](./figures/04_output-crop.png)

Now, we can check whether the two rasters are aligned.

```{r}
cat(" Resolution Landcover: ", res(raster_landcover_crop_resampl), "\n",
    "Resolution Chelsa:    ", res(raster_chelsa), "\n",
    "Origin Landcover:   ", origin(raster_landcover_crop_resampl), "\n",
    "Origin Chelsa:      ", origin(raster_chelsa), "\n")
```

As we can see in the output of the code above, both the resolution and origin are identical. This means our rasters are now aligned. Remember, the _resolution_ is the size of the cells or pixels in a raster and the _origin_ is the coordinates of the point point closest to _(0,0)_ that you could get if you moved towards _that point_(0,0) in steps of the x and y resolution of the rasters being analysed.

#### Point extraction

In this section, we will extract data from our rasters or spatial data. We will be extracting the values for the locations of our towers. The function to extract values is aptly named `extract()`. First, we specify the raster from which the values should be extracted and then the exact sites at which the values should be extracted. Here, we combine all this into a .

```{r message=F}
df_sites_temp <- read_csv("./data/fluxnet_site_info_reduced.csv")

df_sites_temp <- extract(raster_chelsa, sp_sites, sp = TRUE) %>% 
  as_tibble() %>% 
  right_join(df_sites_temp, by = c("lon", "lat")) %>% 
  dplyr::rename(temp_chelsa = Chelsa_t_mean_2006.2012)
```

In the lectures, we have seen that we must always be aware of modelled data and be aware of its limitations. Therefore, we do not just want to use CHELSA temperature data. It is important to consider different models and compare them. Hence, we will look at another climate model: [WorldClim](https://worldclim.org/). The WorldClim data website is a database containing global weather and climate data at a high spatial resolution. The data can be downloaded for historic ("near current") and future climate and weather conditions. We will be using the historical monthly weather data. Due to the size of the rasters downloaded from WorldClim, we have done this already and made a raster of the mean annual temperature for the years 2006 to 2012. Through this we lose some details and information but for our analysis it suffices.

Next, we load and plot the provided rasters showing the mean annual temperatures for the WorldClim data for the years 2006-2012. WorldClim is a set of global climate layers (gridded climate data) with a spatial resolution between 1-340 km^2. 
We want to extract values from WorldClim at the same tower sites, as we have used previously, and add them to the _df_sites_ data frame. 

This will give us a tidy  with the site, latitude and longitude, temperatures for Chelsa and WorldClim at each tower location.

```{r}
raster_worldclim <- raster("./data/WC_mean_t_2006-2012.tif")

df_sites_temp <- extract(raster_worldclim, sp_sites, sp = TRUE) %>% 
  as_tibble() %>% 
  right_join(df_sites_temp, by = c("lon", "lat")) %>% 
  dplyr::rename(temp_wc = WC_mean_t_2006.2012)

head(df_sites_temp)
```

We obtain the monthly mean temperature of January 2006 from FLUXNET data.

```{r message=F}
library(lubridate)
load("./data/ddf_allsites_nested_joined.RData")

df_sites_temp <- ddf_allsites_nested_joined %>% 
  dplyr::select(site = siteid, data) %>% 
  unnest(data) %>% 
  dplyr::select(site, TIMESTAMP, TA_F) %>% 
  mutate(month = month(TIMESTAMP), year = year(TIMESTAMP)) %>% 
  filter(month == 1, year == 2006) %>% 
  group_by(site) %>% 
  summarise(temp_fluxnet = mean(TA_F)) %>% 
  
  # add temp data extracted from spatial files (chelsa and worldclim)
  left_join(df_sites_temp, dplyr::select(site, temp_wc, temp_chelsa),
            by = "site")
```

We compare the climate models by showing the correlations with the tower's data. The dotted line shows what the linear model would look like if the values of our model and the tower data overlapped. The red line shows the actual linear model of the relationship between the climate model and the tower data. The closer the red line to the dotted line the better the model is at predicting the temperatures measured by the towers.

```{r message=F, warning=F}
df_sites_temp$temp_chelsa <- df_sites_temp$temp_chelsa/10 -273.15

# we remove the sites we removed before
df_sites_temp <- df_sites_temp %>% 
  filter(!(site %in% c("RU-Cok", "DK-ZaH")))

df_sites_temp %>% 
  ggplot(aes(x = temp_fluxnet, y = temp_chelsa), xlim=c(-10,5)) +
  geom_point() +
  geom_abline(intercept=0, slope=1, linetype="dotted") +
  geom_smooth(method='lm', color="red", size=0.5, se=FALSE) +
  theme_classic()
```

We compute the mean squared error (MSE). The MSE calculates the average of the squares of errors of all the data points from a fitted line or model. Put simply this is the difference between the data points (the actual data) and the estimated points given by the line or model. The values are squared to get positive values even if the difference is negative. Smaller values reflect less variation of the data. In a model you want to minimise the MSE, as this reflects that the model's predicted or estimated points are closer to your actual data. The R package **Metrics** [@R-Metrics], has a function `mse()` to calculate the mean squared error.

```{r}
library(Metrics)
mse(df_sites_temp$temp_fluxnet, df_sites_temp$temp_chelsa)
```

```{r message=F}
df_sites_temp %>% 
  ggplot(aes(x = temp_fluxnet, y = temp_wc), xlim=c(-10,5)) +
  geom_point() +
  geom_abline(intercept=0, slope=1, linetype="dotted") +
  geom_smooth(method='lm', color="red", size=0.5, se=FALSE) +
  theme_classic()
```

We compute the mean squared error.

```{r}
mse(df_sites_temp$temp_fluxnet, df_sites_temp$temp_wc)
```

We see that CHELSA correlates better, since the mean squared error is smaller, with the temperature values at the towers. We also see that WorldClim compared to CHELSA tends to predict colder temperatures.
It is important to remember that CHELSA and WorldClim are both 'modelled data' and as such come with errors and uncertainties. When using environmental layers for analyses, the data source should be well researched and possible uncertainty or errors carefully considered to understand the weaknesses and limitations of the data.
Th main differences between CHELSA and WorldClim are down to the input data of their models. As mentioned earlier CHELSA predicts its variable on down-scaled ERA-Interim data (global atmospheric reanalysis) and implements bias correction analyses.
WorldClim's predicted variables are based on point data from weather station, which is then interpolated based on elevation, longitude and latitude. Due to the dependence on data from weather stations, the accuracy of WorldClim is more dependent on station density. For a more detailed comparison of the two climate data and why such models must be used with care, refer to the @atmos12050543 paper [here](https://www.mdpi.com/2073-4433/12/5/543).


**Checkpoint**

Above, we compared the CHELSA temperatures with the tower temperatures. Now we want to visually see which of the two data had the higher mean for each tower location.
To achieve this, attempt to plot the map of Europe and add the tower locations as points. Then colour the points so that for a higher CHESLA mean they are one colour and a higher tower temperature another.

**Solution**

```{r message=F, warning=F}
# There are more way to do this but we used the function ifelse(). 
# It takes 3 components: a condition, what to do if the condition applies and what to do if it doesn't apply.
# So here if CHELSA temperatures are higher we want the point to be purple otherwise orange:
cols_chelsa_flux <- ifelse(df_sites_temp$temp_chelsa > df_sites_temp$temp_fluxnet,"purple","orange")

# We could also do different shapes if we wanted...
pch_chelsa_flux <- ifelse(df_sites_temp$temp_chelsa > df_sites_temp$temp_fluxnet, 19 ,17)

## Plot the result using base R:
# plot(europe_shape)
# points(df_sites_temp$lon,df_sites_temp$lat, pch=pch_diff, col = col_checkpoint)

## Plot the result using ggplot:
ggplot() +
  geom_polygon(data = europe_shape, aes(x = long, y = lat, group = group), fill=NA, colour = "black") +
  geom_point(data = df_sites_temp, aes(x = lon, y = lat), color = cols_chelsa_flux, shape = pch_chelsa_flux, size=2) +
  labs(x = "Longitude", y = "Latitude") +
  theme_classic()+ 
  coord_quickmap()
```
From the resulting map of Europe, we can deduce that for any points (here representing the towers) that are purple (circles) the mean temperature of CHELSA is higher than the measured temperature. Any orange points (triangles) the towers measured a higher temperature than CHELSA predicted. There seem to be more purple circles suggests CHELSA predicts higher temperatures than are measured by the fluxnet towers.

#### Correlations with GPP

We already had a look at GPP in Chapter \@ref(ch-02). In this section we want to correlate CHELSA temperature and landcover with GPP. 

In models we often take multiple variables and correlate them to another variable (here GPP). By seeing if they correlate well we can see if any of the variables make good so-called _predictors_.  Below we will explore how well landcover and temperature correlate with GPP.

We start off by extracting the mean GPP at the tower sites. To do this we group our data by the _siteid_ (or tower) and the year. This way we can get the GPP across a year and then an average across all years for each site. We will also consider the median to see if this changes the correlation.

To compare GPP and landcover, we will use both the landcover from the raster used above and the landcover given in the fluxnet data for each tower site. There is a column in the the fluxnet data set called `umd_land_cover`, which refers to the University of Maryland landcover classification scheme.

```{r message=F}
GPP_ann <- ddf_allsites_nested_joined %>% 
  unnest(data) %>% 
  mutate(year = year(TIMESTAMP)) %>% 
  group_by(siteid, year) %>% 
  summarise(gpp_ann = sum(GPP_NT_VUT_REF)) %>% 
  ungroup() %>% 
  group_by(siteid) %>% 
  summarise(gpp_meanann = mean(gpp_ann),  gpp_medianann = median(gpp_ann)) %>% 
  dplyr::select(siteid, gpp_meanann, gpp_medianann)

GPP_ann <- cbind(GPP_ann, 'umd_land_cover' = ddf_allsites_nested_joined$umd_land_cover)

# for the correlations we remove the sites we don't have in the other datasets
GPP_ann <- GPP_ann %>%
  filter(!siteid %in% c("DE-Obe","DK-ZaH", "RU-Cok"))
```

We then extract landcover data from the GlobCover raster at the tower sites.

```{r message=F, warning=F}
df_sites_landcover <- extract(raster_landcover, sp_sites, sp = TRUE)
df_sites_landcover <- as.tibble(df_sites_landcover)

```

To make the correlations, we create a data frame with all the extracted temperature, landcover and GPP data from the tower sites.

```{r}
landcover <- df_sites_landcover$Globcover_EU
chelsa <- df_sites_temp$temp_chelsa

df_corr_ann <- cbind(landcover, chelsa, GPP_ann)
df_corr_ann <- as.data.frame(df_corr_ann)

head(df_corr_ann)
```

We will start by correlating the CHELSA temperatures with GPP. We do this by building a simple linear model using the function `lm()`. We do this for both the mean and the median.

```{r}
# linear model of temperature and mean annual GPP:
reg_mean_1 <- lm(gpp_meanann~chelsa, data = df_corr_ann) 
summary(reg_mean_1)

# linear model of temperature and meadian annual GPP:
reg_median_1 <- lm(gpp_medianann~chelsa, data = df_corr_ann) 
summary(reg_median_1)

```

To visualise the correlation we plot our temperature data with the two linear models.

```{r message=F, warning=F}
p_mean <- df_corr_ann %>% 
  ggplot(aes(x = chelsa, y = gpp_meanann)) +
  geom_point() +
  geom_smooth(method='lm', color="red", size=0.5, se=FALSE) +
  xlab("Chelsa temperature") +
  ylab("GPP mean [µmolCO2 m-2 s-1]") +
  theme_classic()

p_mean

p_median <- df_corr_ann %>% 
  ggplot(aes(x = chelsa, y = gpp_medianann)) +
  geom_point() +
  geom_smooth(method='lm', color="blue", size=0.5, se=FALSE) +
  xlab("Chelsa temperature") +
  ylab("GPP median [µmolCO2 m-2 s-1]") +
  theme_classic()

p_median
```
We can see a slight positive correlation between the mean and median annual GPP and the CHELSA temperatures. To see if this varies across the whole year, we will also investigate the correlation between mean seasonal GPP and CHELSA temperatures. The same steps apply as for considering mean annual GPP, except that the data is grouped into seasons according to the equinoxes.

```{r}
WS <- as.Date("2012-12-21", format = "%Y-%m-%d") # Winter Solstice
SE <- as.Date("2012-3-21", format = "%Y-%m-%d") # Spring Equinox
SS <- as.Date("2012-6-20", format = "%Y-%m-%d") # Summer Solstice
AE <- as.Date("2012-9-23", format = "%Y-%m-%d") # Autumn Equinox

GPP_seasonal <- ddf_allsites_nested_joined %>%
  unnest(data) %>%
  mutate(site = siteid) %>%
  group_by(site) %>%
  mutate(season = case_when(TIMESTAMP >= WS | TIMESTAMP < SE ~ "Winter",
                            TIMESTAMP >= SE & TIMESTAMP < SS ~ "Spring",
                            TIMESTAMP >= SS & TIMESTAMP < AE ~ "Summer", 
                            TIMESTAMP >= AE & TIMESTAMP < WS ~ "Autumn")) %>%
  group_by(site, season) %>%
  summarise(gpp_season = mean(GPP_NT_VUT_REF)) %>%
  ungroup()

# we remove the towers we don't have in the other datasets for the correlations
GPP_seasonal <-  GPP_seasonal %>%
  filter(!site %in% c("DE-Obe","DK-ZaH", "RU-Cok"))

# here we aren't looking t landcover so we only add the CHELSA temperatures to the correlations data frame.
chelsa_season <- df_sites_temp[,c(1,6)]

df_corr_season <- merge(GPP_seasonal, chelsa_season, by = "site")
df_corr_season <- as.data.frame(df_corr_season)

head(df_corr_season)
```
Once, the data frame for the seasonal correlations is ready, we order the plots by season and plot the results.
```{r}
# re-order by season as they should appear in plots
order_by_season <- c("Spring", "Summer","Autumn", "Winter")

df_corr_season <- arrange(transform(df_corr_season, season=factor(season, levels=order_by_season)), season)

p_mean_season <- df_corr_season %>% 
  ggplot(aes(x = temp_chelsa, y = gpp_season)) +
  geom_point() +
  geom_smooth(method='lm', color="red", size=0.5, se=FALSE) +
  xlab("Chelsa temperature") +
  ylab("GPP mean [µmolCO2 m-2 s-1]") +
  facet_wrap(vars(season)) +
  theme_classic()

p_mean_season
```
From the graph above we can see that the correlation between GPP and temperature differs among the seasons. Summer temperatures show a negative correlation with GPP. Spring, Autumn and Winter temperatures, on the other hand, are slighlty positively correlated with GPP. Spring and Summer seem to have more variability in GPP at different temperatures.


Back to landcover and GPP. The landcover column from the GlobCover raster contains numbers, even though it is actually categorical data. Each number represents a specific landcover type, so we can add a column with the corresponding the landcover category name.

```{r}
df_corr_ann <- df_corr_ann %>%
  mutate(landcover_cat = case_when(landcover == 14 ~ "cropland",
        landcover == 20 ~ "mosaic cropland",
        landcover == 50 ~ "deciduous forest",
        landcover == 70 ~ "evergreen forest",
        landcover == 90 ~ "needleleaved forest",
        landcover == 110 ~ "mosaic forest",
        landcover == 120 ~ "mosaic forest/shrub/grass",
        landcover == 130 ~ "mosaic shrubland",
        landcover == 140 ~ "mosaic grassland",
        landcover == 150 ~ "sparse vegetation")) %>%
  droplevels()
```

Again, we can plot it to visualise the correlation. However, since landcover is categorical data we will make a boxplot instead. 
A boxplot is a figure that gives a graphical representation of the variability or spread of the data, by providing information beyond the central tendency of the data. A boxplot will show the _median_ (the mid-pont of the data or 50th percentile, shown as a bold line in the middle of the box), the _interquartile range_ (the _box_ shows the range of the data with the edges being the 25th percentile and 75th percentile limits), whiskers (the lines coming out of the box) and any outliers as points beyond the whiskers.

We make a boxplot for mean GPP and the GlobCover landcover raster, as well as mean GPP and the provided fluxnet tower site landcover for comparison.
```{r warning=F}
ggplot(data = df_corr_ann, aes(x = landcover_cat, y = gpp_meanann, group = as.factor(landcover_cat)))+ 
        geom_boxplot() + 
        labs(x = "Landcover Category (raster)", y = "GPP mean [µmolCO2 m-2 s-1]") +
        theme_classic() +
        theme(axis.text.x = element_text(angle = 90, size = 10))

ggplot(data = df_corr_ann, aes(x = umd_land_cover, y = gpp_meanann, group = as.factor(umd_land_cover)))+ 
        geom_boxplot() + 
        labs(x = "Landcover Category (tower - umd)", y = "GPP mean [µmolCO2 m-2 s-1]") +
        theme_classic() +
        theme(axis.text.x = element_text(angle = 90, size = 10))
```

We don't see a particularly strong correlation with the two plotted landcover categories, as the mean GPP varies depeding on the landcover type. Thus we need to consider other predictors and more complex models to find better predictors of GPP. 

### Key points from the tutorial
    
**Integrating remote sensing data**: We extracted NDVI in the area around our eddy flux towers from the MODISTools package and converted it into a raster format.

* Then collapse NDVI measured within an area around the eddy flux towers to calculate the mean across years.
* Merge this temporal data with our eddy flux data.
* Apply a linear regression to test whether the addition of NDVI data to our model of GPP and PPFD_IN improves its accuracy.
    
    
**Points**: The eddy flux tower sites were plotted as points on a map of Europe.

* Extract coordinates of the tower sites from the metadata. 
* Transform the points into SpatialPoints with the sp package and plot them on a map of Europe.
  
  
**Shapefiles**: We added country and border information to our tower site plots.

* Extract the information about the spatial polygons in the europe_shape shapefile and apply one coordinate reference system.
* Add the SpatialPoints matrix and shapefile data europe_shape to the tower site data frame.
* Filter out the countries with towers and visualize only the countries containing a tower.
    
    
**Rasters**: We investigated whether the variables landcover and temperature can explain GPP.

* Load landcover data from Globcover temperature data from CHELSA in the form of rasters and plot them around our tower sites with the raster and rasterVis packages.
* Apply the same resolution (aggregate) and origin (align) to the climate and landcover rasters.
* Compare modelled climate data from CHELSA and WorldClim to our fluxnet data by extracting the temperatures at the tower sites and calculating the MSE (mean square error).
* Finally, landcover and CHELSA modelled temperatures are plugged into a linear regression to see if they are significantly correlated with GPP.

## Exercise

**Part 1: Plotting Elevation differences**

In this first part of the exercise, we consider elevation data from the ETOPO dataset. ETOPO is a 1 arc-minute global relief model of the Earth's surface that integrates land topography and ocean bathymetry. Since it is modelled date there is a certain level of uncertainty associated with it. In this exercise, you will look at the differences between the elevations from the tower data, which is directly measured at the site, and the values you will extract from this modelled ETOPO data.

_Note:_ think about which packages you need to load to complete each of the tasks.

a. Load the _'dataset_ex3.csv'_ file containing the elevations of the towers.

b. Load both the _'ETOPO.tif'_ raster and the _'europe_shape'_ shapefile. Plot the shape of Europe over the raster. 
(Since we used the shapefile in the tutorial too, it isn't located in the same folder as the data for this exercise. You'll have to use the correct pathway to load it, if you are unsure about how to find this use the 'Files' tab on the right can help you figure it out.)

c. Extract the elevations from the raster at the tower sites.

d. Calculate the difference between the elevations of the towers and the one extracted from the ETOPO.

e. Plot the difference in elevation calculated for each site on a map of Europe.
_Hint:_ there are several ways to do this, one could be to using ifelse().

**Part 2: Temperature and Elevation Correlations**

Since we know temperature and altitude (or elevation) are correlated due to a decrease in air pressure at higher altitudes, we want to see this correlation in our data. Therefore, in this second part, we want to find out which of the elevations we gathered in the first part (ETOPO and tower elevations) correlates better with the temperatures measured at the towers. Since the temperatures and elevations for the tower were measured in the same place, we would expect those to be more correlated than the modelled ETOPO elevations and the tower temperatures.

a. Correlate the ETOPO elevations at the towers sites with the temperatures measured at the towers. 

b. Correlate the elevations measured at the tower sites with the temperatures measured at the towers.

c. What is the mean squared errors of the two models above. Which of the two correlates better?

d. The _'lapse rate'_ is known as the rate of decrease of an atmospheric variable, in our case temperature, with increased elevation (also altitude). This _lapse_ (gradual fall)  in temperature with increasing altitude at a given time and place, is the slope of your models above. Find and compare the _lapse rate_ of the two models. _Hint:_ if you are stuck take a closer look at your model with summary().

Now, that you've made it to the end of the exercise, remember to 'knit' this document so you can upload the html to moodle.