This repository contains code template illustrating the workflow presented in the article:
Programs and initiatives aiming to protect biodiversity and ecosystems have increased over the last decades in response to their decline. Most of these are based on monitoring data to quantitatively describe trends in biodiversity and ecosystems. The estimation of such trends, at large scales, requires the integration of numerous data from multiple monitoring sites. However, due to the high heterogeneity of data formats and the resulting lack of interoperability, the data integration remains sparsely used and synthetic analyses are often limited to a restricted part of the data available. Here we propose a workflow, comprising four main steps, from data gathering to quality control, to better integrate ecological monitoring data and to create a synthetic dataset that will make it possible to analyse larger sets of monitoring data, including unpublished data. The workflow was designed and applied in the production of the Status of Coral Reefs of the World: 2020 report, where more than two hundred individual datasets were integrated to assess the status and trends of hard coral cover at the global scale. The workflow was applied to two case studies and associated R codes, based on the experience acquired during the production of this report. The proposed workflow allows for the integration of datasets with different levels of taxonomic and spatial precision, with a high degree of reproducibility. It provides a conceptual and technical framework for the integration of ecological monitoring data, allowing for the estimation of temporal trends in biodiversity and ecosystems or to test ecological hypotheses at larger scales.
The first case study correspond to the path 3A of the workflow and illustrate the integration of data from monitoring of benthic communities (sessile organisms) in coral reefs. Because the taxonomic identification is difficult, broad categories are often used (e.g. algae, hard living coral) in most of monitoring programs. Thus, during the data integration, a taxonomic re-categorization must be done to insure the use of common categories across the different datasets integrated. We highlight that the different datasets were created to illustrate the worklow and they do not correspond to any existing real datasets.
The folder path_3a
contains the folders data
and R
. The data
folder regroup all the data files with a numbering corresponding to their level of advancement in the workflow. Hence, the folder 01_raw
contains the different raw data files as received by data contributors, the folder 02_reformatted
contains the individually reformatted datasets, the file 03_synthetic-dataset correspond to the grouped data with taxonomic assignment done, and the file 04_final-synthetic-dataset correspond to the final synthetic dataset.
Those data files and folders numbering is used correspondingly in the R
folder where three scripts are present. The first one (step-2_individual-data-reformatting.Rmd) correspond to the individual data reformatting (step 2 of the workflow) with one or more chunk code by data contributor. The second one (step-3_data-grouping-tax-assignment.Rmd) correspond to the data grouping and taxonomic assignment (step 3 of the workflow), and the last one (step-4_quality-assurance-quality-control.Rmd) correspond to the quality assurance and quality control (step 4 of the workflow).
The .Rmd format (rmarkdown) was chosen for the different R scripts because it allows a better segmentation and annotation of the code and process (necessary for the step 2) and the exportation of code and output to an HTML file which may include interactive tables, plots and maps (necessary for steps 3 and 4). The HTML files can be opened with a search engine (e.g. Google Chrome) and an internet connection is necessary for the visualization of interactive maps.
The 01_raw
folder includes five folders corresponding to the data shared by five different data contributors. Each of them represent a specific case:
- π
: One .xlsx file containing two sheets, the first one with the main data and the second one with the substrate codes. - π
: One .xlsx file containing two sheets, the first one with the main data in wide format and the second one with site coordinates. - π
: One .xlsx file containing three sheets with same columns names corresponding to three different sites. - π
: Three .csv files where the two first files contains data for the same site but for two different years, and the third file contains substrate codes. - π
: Three .xlsx files where the two first files contains data in wide format with different columns names, and the third file contains site coordinates.
The first step of the workflow is to select the variables that will have to be present in the final synthetic dataset. The variables selected for the first case study are described in the Table 1.
Table 1. Variables selected for the benthic synthetic dataset. The icons for the variables categories (Cat.
) represents π = description variables, π = spatial variables, π = temporal variables, π = methodological variables, π¦ = taxonomic variables, π = metric variables. Variables names in parentheses correspond to the DarwinCore (DwC) terms.
Variable (DwC) | Cat. | Type | Unit | Description | |
1 | DatasetID (datasetID) | π | Factor | Dataset ID | |
2 | Area (higherGeography) | π | Factor | Biogeographic area | |
3 | Country (country) | π | Factor | Country | |
4 | Archipelago (islandGroup) | π | Factor | Archipelago | |
5 | Location (stateProvince) | π | Factor | Location or island within the country | |
6 | Site (locality) | π | Factor | Site within the location | |
7 | Replicate (parentEventID) | π | Integer | Replicate ID | |
8 | Zone (habitat) | π | Factor | Reef zone | |
9 | Latitude (decimalLatitude) | π | Numeric | Latitude of the site (decimal format) | |
10 | Longitude (decimalLongitude) | π | Numeric | Longitude of the site (decimal format) | |
11 | Depth (verbatimDepth) | π | Numeric | m | Mean depth |
12 | Year (year) | π | Integer | Year | |
13 | Date (eventDate) | π | Date | Date (YYYY-MM-DD) | |
14 | Method (samplingProtocol) | π | Factor | Description of the method used | |
15 | Observer | π | Factor | Name of the diver | |
16 | Category | π¦ | Factor | See Table 2 | |
17 | Group | π¦ | Factor | See Table 2 | |
18 | Family (family) | π¦ | Factor | Family | |
19 | Genus (genus) | π¦ | Factor | Genus | |
20 | Species (scientificName) | π¦ | Factor | Species | |
21 | Cover (measurementValue) | π | Numeric | % | Cover percentage |
Table 2. Factor levels of variables Category
and Group
used for the re-categorization.
Category | Group |
Abiotic | Rock |
Rubble | |
Sand | |
Silt | |
Algae | Coralline algae |
Cyanophyceae | |
Macroalgae | |
Turf algae | |
Hard bleached coral | |
Hard dead coral | |
Hard living coral | |
Other fauna | Actiniaria |
Alcyonacea | |
Antipatharia | |
Asteroidea | |
Bivalvia | |
Bryozoa | |
Corallimorpharia | |
Crinoidea | |
Echinoidea | |
Gastropoda | |
Holothuroidea | |
Hydrozoa | |
Ophiuroidea | |
Polychaeta | |
Porifera | |
Tunicata | |
Zoantharia | |
Seagrass |
The second case study correspond to the path 3B of the workflow and illustrate the integration of data from monitoring of fish communities (vagile organisms) in coral reefs. Because the monitoring of fish is based on true taxonomical levels (e.g. species, genus) instead of broad categories, a taxonomical verification must be assessed during the data integration to avoid misspelling names and include recent update in taxonomy. We highlight that the different datasets were created to illustrate the worklow and they do not correspond to any existing real datasets.
The folder path_3b
contains the folders data
and R
. The data
folder regroup all the data files with a numbering corresponding to their level of advancement in the workflow. Hence, the folder 01_raw
contains the different raw data files as received by data contributors, the folder 02_reformatted
contains the individually reformatted datasets, the file 03_synthetic-dataset correspond to the grouped data with taxonomic assignment done, and the file 04_final-synthetic-dataset correspond to the final synthetic dataset.
Those data files and folders numbering is used correspondingly in the R
folder where three scripts are present. The first one (step-2_individual-data-reformatting.Rmd) correspond to the individual data reformatting (step 2 of the workflow) with one or more chunk code by data contributor. The second one (step-3_data-grouping-tax-assignment.Rmd) correspond to the data grouping and taxonomic assignment (step 3 of the workflow), and the last one (step-4_quality-assurance-quality-control.Rmd) correspond to the quality assurance and quality control (step 4 of the workflow).
The .Rmd format (rmarkdown) was chosen for the different R scripts because it allows a better segmentation and annotation of the code and process (necessary for the step 2) and the exportation of code and output to an HTML file which may include interactive tables, plots and maps (necessary for steps 3 and 4). The HTML files can be opened with a search engine (e.g. Google Chrome) and an internet connection is necessary for the visualization of interactive maps.
The 01_raw
folder includes five folders corresponding to the data shared by five different data contributors. Each of them represent a specific case:
- π
: One .xlsx file containing two sheets, the first one with the main data and the second one with the site coordinates. - π
: Two .csv files, the first contains the main data in wide format and the second the sites coordinates. - π
: One .xlsx file containing three sheets, the first contains the main data, the second the site coordinates and the third the species codes. - π
: Two files, one in .xlsx and one in .csv. The .xlsx file contains two sheets with same column names, corresponding to two different sites. The .csv file contains the site coordinates. - π
: Four .xlsx files with one sheet, the first three contains the main data with same column names for three different sites, the fourth contains sites coordinates data.
The first step of the workflow is to select the variables that will have to be present in the final synthetic dataset. The variables selected for the second case study are described in the Table 3.
Table 3. Variables selected for the fish synthetic dataset. The factor levels of the variable Size_type
are Total length, Fork length and Standard length. The icons for the variables categories (Cat.
) represents π = description variables, π = spatial variables, π = temporal variables, π = methodological variables, π¦ = taxonomic variables, π = metric variables.Variables names in parentheses correspond to the DarwinCore (DwC) terms.
Variable (DwC) | Cat. | Type | Unit | Description | |
1 | DatasetID (datasetID) | π | Factor | Dataset ID | |
2 | Area (higherGeography) | π | Factor | Biogeographic area | |
3 | Country (country) | π | Factor | Country | |
4 | Archipelago (islandGroup) | π | Factor | Archipelago | |
5 | Location (stateProvince) | π | Factor | Location or island within the country | |
6 | Site (locality) | π | Factor | Site within the location | |
7 | Replicate (parentEventID) | π | Integer | Replicate ID | |
8 | Zone (habitat) | π | Factor | Reef zone | |
9 | Latitude (decimalLatitude) | π | Numeric | Latitude of the site (decimal format) | |
10 | Longitude (decimalLongitude) | π | Numeric | Longitude of the site (decimal format) | |
11 | Depth (verbatimDepth) | π | Numeric | m | Mean depth |
12 | Year (year) | π | Integer | Year | |
13 | Date (eventDate) | π | Date | Date (YYYY-MM-DD) | |
14 | Method (samplingProtocol) | π | Factor | Description of the method used | |
15 | Observer | π | Factor | Name of the diver | |
16 | Family (family) | π¦ | Factor | Family | |
17 | Genus (genus) | π¦ | Factor | Genus | |
18 | Species (scientificName) | π¦ | Factor | Species | |
19 | Density | π | Numeric | n. ind. 100 m-2 | Number of individuals |
20 | Size | π | Numeric | cm | Size of individuals |
21 | Size_type (measurementType) | π | Factor | Size type used to measure the size |
