tue_clean_geographic_data.Rmd

---
title: "Cleaning geographic data"
output: html_document
---


```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, eval = FALSE,
                      echo=TRUE, warning=FALSE, message=FALSE,
                      tidy = TRUE, collapse = TRUE,
                      results = 'hold')
```

## Background
Biases and issues with data quality are a central problem hampering the use of publicly available species occurrence data in ecology and biogeography. Biases are hard to address, but improving data quality by identifying erroneous records is possible. Major problems are: data entry errors leading to erroneous occurrence records, imprecise geo-referencing mostly of pre-GPS records and missing metadata specifying data entry and coordinate precision. Manual data cleaning based on expert knowledge can mostly detect these issues, but is only applicable for small taxonomic or geographic scales and is difficult to reproduce. Automated clean procedures are more scalable alternative.

## Learning objectives
After this exercise you will be able to:
1. Visualize the data and identify potential problems 
2. Use  *CoordinateCleaner* to automatically flag problematic records
3. Use GBIF provided meta-data to improve coordinate quality, tailored to your downstream analyses

# Input data needed
- the .csv files with geo_referenced species occurrences from GBIF

## Exercises
Here we will use the data downloaded during the first exercise and look at common problems using automated flagging software and GBIF meta-data to identify some potential problems. For this tutorial we will exclude potentially problematic records, but in general we highly recommend to double check records to avoid losing data and introduce new biases. You can find a similar tutorial using illustrative data [here](https://azizka.github.io/CoordinateCleaner/articles/Tutorial_Cleaning_GBIF_data_with_CoordinateCleaner.html). AS in the first exercise necessary R functions for each question in the brackets. Get help for all functions with ?FUNCTIONNAME.

1. Load your occurrence data downloaded from GBIF in the first exercise and limit the data to columns with potentially useful information (`read_csv`, `names`, `select`).
2. Visualize the coordinates on a map (`borders`, `ggplot`, `geom_point`).
3. Clean the coordinates based on available meta-data. A good starting point is to plot continuous variables as histogram and look at the values for discrete variables. Remove unsuitable records and plot again (`table`, `filter`).
4. Apply automated flagging to identify potentially problematic records (`clean_coordinates`, `plot`).
5. Visualize the difference between the unclean and cleaned dataset (`plot`)
6. redo the worklow for the taxonomically broader occurrence data

## Possible questions for your research project
* How many records where potentially problematic?
* What were the major issues? 
* Were any of the records you collected in the field flagged as problematic? If so, what has happened?

# Tutorial
The following suggestion for data cleaning and are not comprehensive or a one-size-fits it all solution. You might want to change, omit, or add steps depending on your research question and scale. *Remember:* What  is 'good data' depends completely on the type of downstream analyses and their spatial scale. The cleaning here might be a good starting point for continental scale biogeographic analyses.

## Library setup
You will need the following R libraries for this exercise, just copy the code chunk into you R console to load them. You might need to install some of them separately.

```{r}
library(tidyverse)
library(rgbif)
library(sp)
library(countrycode)
library(CoordinateCleaner)
```


## 1. Load your occurrence data downloaded from GBIF
GBIF provides a large amount of information for each record, leading to a huge data.frame with many columns. However some of this information is only available for few records, and thus for most analyses most of the columns can be dropped. Here, we will only retain information to identify the record and information that is important for cleaning up the data.

```{r}
dat <- read_tsv("example_data/day2_data_quality_and_sampling_bias/raw_occurrences_malvaceae_sa.csv", guess_max = 25000, quote = "")

names(dat) #a lot of columns

dat <- dat %>%
  dplyr::select(species, decimalLongitude, decimalLatitude, countryCode, individualCount,
         gbifID, family, taxonRank, coordinateUncertaintyInMeters, year,
         basisOfRecord,occurrenceStatus)%>% # you might find other ones useful depending on your downstream analyses
  mutate(countryCode = countrycode(dat$countryCode, origin =  'iso2c', destination = 'iso3c'))
```

## 2. Visualize the coordinates on a map
Visualizing the data on a map can be extremely helpful to understand potential problems and to identify problematic records.

```{r, tidy = FALSE}
world.inp  <- map_data("world")

ggplot()+
  geom_map(data=world.inp, 
           map=world.inp, 
           aes(x=long, y=lat, map_id=region), 
           fill = "grey80")+
  xlim(min(dat$decimalLongitude, na.rm = T), 
       max(dat$decimalLongitude, na.rm = T))+
  ylim(min(dat$decimalLatitude, na.rm = T), 
       max(dat$decimalLatitude, na.rm = T))+
  geom_point(data = dat, 
             aes(x = decimalLongitude, y = decimalLatitude),
             size = 1)+
  coord_fixed()+
  theme_bw()+
  theme(axis.title = element_blank())
```


## 4. Clean the coordinates based on available meta-data
As you cans see there are a some unexpected occurrence locations, outside the current distribution range. We will find out the reasons for this in a minute. In this specific case we could relatively easily get rid of a large number of problematic records based on prior knowledge (we are only interested in records in South America) but we usually do not have this kind of knowledge when dealing with larger datasets, so we will try to get rid of those records in different ways. GBIF data often contain a good number of meta-data that can help to locate problems. First we'll remove data without coordinates, coordinates with very low precision and the unsuitable data sources. We will remove all records with a precision below 100km as this represent the grain size of many macro-ecological analyses, but the number is somewhat arbitrary and you best chose it based on your downstream analyses. We also exclude fossils as we are interested in recent distributions and records of unknown source, as we might deem them not reliable enough.

```{r}
# remove records without coordinates
dat_cl <- dat%>%
  filter(!is.na(decimalLongitude))%>%
  filter(!is.na(decimalLatitude))

#remove records with low coordinate precision
hist(dat_cl$coordinateUncertaintyInMeters/1000, breaks = 30)

dat_cl <- dat_cl %>%
  filter(coordinateUncertaintyInMeters/1000 <= 100 | is.na(coordinateUncertaintyInMeters))

#remove unsuitable data sources, especially fossils
table(dat$basisOfRecord)

dat_cl <- filter(dat_cl, basisOfRecord == "HUMAN_OBSERVATION" | basisOfRecord == "OBSERVATION" |
                         basisOfRecord == "PRESERVED_SPECIMEN" | is.na(basisOfRecord))
```


In the next step we will remove records with suspicious individual counts. GBIF includes few records of absence (individual count = 0 or occurrenceStatus = "ABSENT") and suspiciously high occurrence counts, which might indicate inappropriate data or data entry problems. 

```{r}
#Individual count
table(dat_cl$individualCount)

dat_cl <- dat_cl%>%
  filter(individualCount > 0 | is.na(individualCount))%>%
  filter(individualCount < 99 | is.na(individualCount)) %>% # high counts are not a problem
  filter(occurrenceStatus == "PRESENT")
```

We might also want to exclude very old records, as they are more likely to be unreliable. For instance, records from before the second world war are often very imprecise, especially if they were geo-referenced based on political entities. Additionally old records might be likely from areas where species went extinct (for example due to land-use change).

```{r}
#Age of records
table(dat_cl$year)

dat_cl <- dat_cl%>%
  filter(year > 1945) # remove records from before second world war
```

If available we can also use external information  in the cleaning process. For instance, in the case of the Bombacoideae, we can exclude all North American and European records, since the group is not occurring their natively.

```{r}
dat_cl <- dat_cl %>% 
  filter(decimalLatitude < 23) %>% 
  filter(decimalLongitude > -120)
```

On top of the geographic cleaning, we also want to make sure to only include species level records and records from the right taxon.  Taxonomic problems such as spelling mistakes in the names or synonyms can be a severe problem. We'll not treat taxonomic cleaning here, but check out the [taxize R package](https://ropensci.org/tutorials/taxize_tutorial.html) or the [taxonomic name resolution service](http://tnrs.iplantcollaborative.org/) for that.

```{r}
table(dat_cl$family) #that looks good


table(dat_cl$taxonRank) # We will only include records identified to species level or below
dat_cl <- dat_cl%>%
  filter(taxonRank %in% c("SPECIES", "SUBSPECIES", "VARIETY") | is.na(taxonRank))
```

We excluded almost `round((nrow(dat) - nrow(dat_cl)) / nrow(dat) * 100, 0)` of the initial data points based on metadata! Most of them due to incomplete identification. 

## 5. Apply automated flagging to identify potentially problematic records
To identify additional problems we will run the automatic flagging algorithm of the CoordinateCleaner package. The `clean_coordinates` function is a wrapper around a large set of automated cleaning steps to flag errors that are common to biological collections, including: sea coordinates, zero coordinates, coordinate - country mismatches, coordinates assigned to country and province centroids, coordinates within city areas, outlier coordinates and coordinates assigned to biodiversity institutions. You can switch on each test individually using logical flags, modify the sensitivity of most individual tests using the ".rad" arguments, and provide custom gazetteers using the ".ref" arguments. See `?clean_coordinates` for help. To use the country - coordinate mismatch test we need to convert the country from ISO2 to ISO3 format. Since we work in a coastal area, we use a buffered reference, to avoid flagging records close to the sea.


```{r, warning = F, results = 'hold', collapse = T}
#flag problems
dat_cl <- data.frame(dat_cl)
flags <- clean_coordinates(x = dat_cl, lon = "decimalLongitude", lat = "decimalLatitude",
                          countries = "countryCode", 
                          species = "species",
                          tests = c("capitals", "centroids", "equal","gbif", 
                                    "zeros", "countries", "seas"),
                          seas_ref = buffland) # most test are on by default
```

Here an additional `sum(flags$.summary)` records were flagged! A look at the test summary and plot reveal the major issues.

```{r}
plot(flags, lon = "decimalLongitude", lat = "decimalLatitude")
```

### For marine species
*Only for marine groups*

The `clean_coordinates` function with the settings above will flag all records occurring in the sea as potentially problematic. If your taxa of interest includes marine species you may want to keep these occurrences. ONe option is to run clean_coordinates and remove the seas test. However, you may be interested in ONLY retaining records that occur in the sea and REMOVE all records from land (that is the reverse operation as for terrestrial species). Do to so you can reverse the results of the seas test as follows:

```{r}
flags <- flags %>% 
  #reverse the results of the sea test
  mutate(.sea = !.sea) %>% 
  # since flags also contains a summary column for all test, we need to fix this as well. 
  mutate(.summary = ifelse(rowSums(.[, 13:20]) == 8, TRUE, FALSE))

plot(flags, lon = "decimalLongitude", lat = "decimalLatitude")
```


After this inspection we can safely remove the flagged records for this tutorial

```{r}
dat_cl <- dat_cl[flags$.summary, ]
```


## 6. Visualize the difference between the uncleaned and cleaned dataset (`plot`)
```{r, tidy = FALSE}
world.inp  <- map_data("world")

# first plot showing the retain records in green and the removed records in red
ggplot()+
  geom_map(data=world.inp, 
           map=world.inp, 
           aes(x=long, y=lat, map_id=region), fill = "grey80")+
  xlim(min(dat$decimalLongitude, na.rm = T),
       max(dat$decimalLongitude, na.rm = T))+
  ylim(min(dat$decimalLatitude, na.rm = T), 
       max(dat$decimalLatitude, na.rm = T))+
  geom_point(data = dat, 
             aes(x = decimalLongitude, y = decimalLatitude),
             colour = "darkred", size = 1)+
  geom_point(data = dat_cl, 
             aes(x = decimalLongitude, y = decimalLatitude),
             colour = "darkgreen", size = 1)+
  coord_fixed()+
  theme_bw()+
  theme(axis.title = element_blank())

# second plot showing only the retained species
ggplot()+
  geom_map(data=world.inp, map=world.inp,
           aes(x=long, y=lat, map_id=region), 
           fill = "grey80")+
  xlim(min(dat$decimalLongitude, na.rm = T), 
       max(dat$decimalLongitude, na.rm = T))+
  ylim(min(dat$decimalLatitude, na.rm = T), 
       max(dat$decimalLatitude, na.rm = T))+
  geom_point(data = dat_cl, 
             aes(x = decimalLongitude, y = decimalLatitude),
             size = 1)+
  coord_fixed()+
  theme_bw()+
  theme(axis.title = element_blank())
```

## 8. Write to disk
```{r}
write_csv(dat_cl, "cleaned_occurrences_malvaceae.csv")
````