db.qmd

# Dataset {#sec-dataset}

```{r,include=FALSE,echo=FALSE}
Echo=FALSE
Eval=TRUE
Cache=TRUE
Warng=FALSE
Msg=FALSE
Incl=FALSE
library(tidyverse)
library(cmdstanr)
library(posterior)
library(ggdist)
library(ggpubr)
library(ggrepel)
library(ggmap)
library(DT)
library(gt)
```

```{r,echo=Echo,eval=Eval,cache=Cache,warning=Warng,message=Msg}
data_path <- readLines('_data.path')
treeData <- readRDS(file.path(data_path, 'treeData.RDS'))
spIds <- read_csv(file.path(data_path, 'species_id.csv')) |>
	filter(sp_to_analyze)
```

This chapter outlines the two data sources and the steps to organize and create a tidy dataset for our analysis.
We focus on two open inventory data from eastern North America: the Forest Inventory and Analysis (FIA) dataset in the United States [@OConnell2007] and the Forest Inventory of Québec [@Naturelles2016].
From the FIA dataset, we utilized information collected from 37 states out of the total 50: Alabama, Arkansas, Connecticut, Delaware, Florida, Georgia, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Nebraska, New Hampshire, New Jersey, New York, North Carolina, North Dakota, Ohio, Oklahoma, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Vermont, Virginia, West Virginia, and Wisconsin.

At the plot level, we retained plots sampled at least twice, excluding those that had undergone harvesting to focus solely on natural dynamics.
Specifically, we selected surveys conducted for the FIA dataset using the modern standardized methodology implemented since 1999.
After applying these filters, our final dataset encompassed nearly 26,000 plots spanning a latitude range from 26° to 53° (@fig-plotCoverage).
Each plot within the dataset was measured between 1970 and 2021, with observation frequencies ranging from 2 to 7 times and an average of 3 measurements per plot.
The time intervals between measurements varied from 1 to 40 years, with a median interval of 7 years (@fig-plotCoverage).

```{r,echo=Echo,eval=Eval,cache=Cache,warning=Warng,message=Msg}
#| label: fig-plotCoverage
#| fig-width: 9
#| fig-height: 7
#| fig-cap: "Spatial (top left) and temporal (top right) coverage of the dataset incorporating data from the USA and Quebec. The top right panel shows the distribution of observations per class of latitude for the 31 species used in this study."

treeData |>
	filter(species_id %in% spIds$species_id_old) |>
	group_by(plot_id) |>
	slice_head(n = 1) |>
	mutate(
		db_origin = case_match(
			db_origin,
			'qc' ~ 'Quebec',
			'fia' ~ 'FIA'
		)
	) ->
plot_dt

# get Eas North America map
# brds <- c(left = -109, bottom = 25, right = -53, top = 53)
# my_map <- get_stamenmap(brds, zoom = 3, maptype = 'terrain-background')
# saveRDS(my_map, file.path('data', 'eastNorthAmericaMap.RDS'))
readRDS(file.path('data', 'eastNorthAmericaMap.RDS')) |>
	ggmap() +
	geom_point(
		data = plot_dt,
		aes(
			x = longitude, y = latitude,
			color = db_origin
		),
		size = 0.05,
		alpha = 0.6
	) +
	scale_color_manual(
		values = c(rgb(158, 35, 137, maxColorValue = 255), rgb(86, 43, 143, maxColorValue = 255))
	) +
	labs(x = '', y = '', color = '') +
	theme_minimal() +
	theme(
		legend.position = 'none',
		axis.title.x=element_blank(),
		axis.text.x=element_blank(),
		axis.ticks.x=element_blank(),
		axis.title.y=element_blank(),
		axis.text.y=element_blank(),
		axis.ticks.y=element_blank()
	) +
	guides(color = guide_legend(override.aes = list(size=2, alpha = 1))) ->
p1

treeData |>
	filter(species_id %in% spIds$species_id_old) |>	
	filter(!is.na(latitude)) |>
	mutate(
		latc = cut(
			latitude,
			seq(min(latitude), max(latitude), length.out = 15),
			include.lowest = TRUE
		),
		db_origin = case_match(
			db_origin,
			'qc' ~ 'Quebec',
			'fia' ~ 'FIA'
		)
	) |>
	ggplot() +
	aes(year_measured, latc) +
	aes(fill = db_origin) +
	ggridges::geom_density_ridges2(color = 'transparent', alpha = 0.8) +
	theme_classic() +
	labs(
		x = 'Year of measurement',
		y = 'Latitude classes',
		fill = ''
	) +
	scale_fill_manual(
		values = c(rgb(158, 35, 137, maxColorValue = 255), rgb(86, 43, 143, maxColorValue = 255))
	) +
	theme(legend.position = 'none') ->
p2

treeData |>
	group_by(plot_id) |>
	reframe(nbYear = length(unique(year_measured))) |>
	filter(nbYear > 1) |>
	pull(plot_id) ->
plots_to_keep

treeData |>
	filter(plot_id %in% plots_to_keep) |>
	group_by(plot_id) |>
	reframe(
		db_origin = unique(db_origin),
		nbYear = length(unique(year_measured))
	) |>
	mutate(
		db_origin = case_match(
			db_origin,
			'qc' ~ 'Quebec',
			'fia' ~ 'FIA'
		)
	) |>
	ggplot() +
	aes(nbYear) +
	aes(fill = db_origin) +
	geom_histogram(alpha = 0.8) +
	theme_classic() +
	labs(
		x = 'Number of measurements per plot',
		y = '',
		fill = ''
	) +
	scale_fill_manual(
		values = c(rgb(158, 35, 137, maxColorValue = 255), rgb(86, 43, 143, maxColorValue = 255))
	) ->
p3

readRDS(file.path(data_path, 'fec_dt.RDS')) |>
	group_by(plot_id) |>
	reframe(
		db_origin = unique(db_origin),
		uq = unique(deltaYear_plot)
	) |>
	ggplot() +
	aes(uq) +
	aes(fill = db_origin) +
	geom_histogram(alpha = 0.8) +
	theme_classic() +
	labs(
		x = 'Time interval between measurements (years)',
		y = ''
	) +
	scale_fill_manual(
		values = c(rgb(158, 35, 137, maxColorValue = 255), rgb(86, 43, 143, maxColorValue = 255))
	) +
	theme(legend.position = 'none') ->
p4

ggarrange(
	p1, p2, p3, p4,
	nrow = 2, ncol = 2
)
```

These datasets provide individual-level information on the diameter at breast height (dbh) and the status (dead or alive) of more than 200 species.
From this pool, we selected the 31 most abundant species (@tbl-tabSpecies).
This selection comprises 9 conifer species and 21 hardwood species.
We ensured an even distribution of species across the continuous shade tolerance trait, with 3 species classified as very intolerant, 9 as intolerant, 8 as intermediate, 8 as tolerant, and 5 as very tolerant [@burns1990silvics].
While our analysis focuses on these 31 selected species, we retained the data for the remaining species for potential future exploration and analysis.
We initially included the *Ostrya virginiana* species in our analysis; however, we encountered substantial challenges with several growth, survival, and recruitment parameters, with substantial confidence intervals.
As a result, we decided to remove this species from our analysis, even though the fit files are still accessible for potential future investigations.

```{r,echo=Echo,eval=Eval,cache=Cache,warning=Warng,message=Msg}
#| label: tbl-tabSpecies
#| tbl-cap: "List of species and their frequency across the dataset."

treeData |>
	filter(species_id %in% spIds$species_id_old) |>
	left_join(
		spIds,
		by = c('species_id' = 'species_id_old')
	) |>
	group_by(species_name) |>
	reframe(
		`Species ID` = unique(species_id),
		`Number of\n plots` = length(unique(plot_id)),
		`Number of\n individual` = length(unique(tree_id)),
		`Number of\n observation` = n()
	) |>
	rename(
		`Species` = species_name
	) |>
	mutate(
		Species = paste0('*', Species, '*')
	) |>
	arrange(desc(`Number of\n plots`)) |>
	gt() |>
	fmt_markdown(columns = Species) |>
	tab_style(
		style = cell_text(weight = "bold"),
		locations = cells_column_labels()
	)
```

## Covariates

### Competition index

To assess competition effects, we calculated the basal area (BA) of living individuals for each species-plot-year combination based on the dbh of each $i$.
We first calculated individual BA in $m^2$ as follows:

$$
	BA^{ind}_i = \pi * (\frac{dbh_i}{2 * 1000})^2
$$

where the dbh is measured in millimeters. 
For a given plot $j$, the total plot basal area in square meters per hectare ($m^2 ha^{-1}$) is defined as:

$$
	BA^{plot}_j = \sum_i^{n_j}{BA^{ind}_i} \times \frac{10000}{plot~area_j}
$$

To compute the basal area of larger individuals (BAL), we summed the total basal area of individuals whose dbh exceeded that of the focal individual $i$ ($n_j > BA_i$):

$$
	BAL^{plot}_{ij} = \sum_i^{n_j > BA_i}{BA^{ind}_i} \times \frac{10000}{plot~area_j}
$$

Note that BAL is now specific to the plot ($j$) and the individual ($i$).


### Climate

We obtained the 19 bioclimatic variables with a 10 $km^2$ (300 arcsec) resolution grid, covering the period from 1970 to 2018.
These climate variables were modeled using the ANUSPLIN interpolation method [@mckenney2011].
We used the plot's longitude and latitude coordinates to extract the mean annual temperature (MAT) and mean annual precipitation (MAP).
To incorporate the climate covariates in the dataset, we used each plot's latitude and longitude coordinates to extract the mean annual temperature (MAT) and mean annual precipitation (MAP).
In cases where plots did not fall within a valid pixel of the climate variable grid, we interpolated the climate condition using the eight neighboring cells.
Due to the transitional nature of the dataset, we considered both the average and standard deviation of MAT and MAP over the years within each time interval.
This approach allows us to account for climate's averaged and temporal variability effects on tree demography.
Finally, we normalized MAT and MAP within the 0 and 1 range, facilitating comparisons of optimal climate conditions and climate breadth among species.

## R objects

All the scripts for dataset preparation are available at [https://github.com/willvieira/TreesDemography](https://github.com/willvieira/TreesDemography) within the `R` folder.
These scripts detail the complete process of downloading, cleaning, and merging the data, resulting in the final datasets used in our study.
Metadata associated with each script is described in Table @tbl-tabMetadata.

- `treeData.RDS`: This file contains the complete dataset in a tidy format, with one observation per row.

From the `treeData.RDS` dataset, we derived four additional datasets, each representing the transition between time $t$ and time $t + \Delta t`:

- `growth_dt.RDS`: Contains growth rate information between time $t$ and $t + \Delta t$.
- `status_dt.RDS`: Contains individuals' status (alive or dead) at both time $t$ and time $t + \Delta t$.
- `fec_dt.RDS`: Contains ingrowth rate (number of individuals entering the population at 127 mm) between time $t$ and $t + \Delta t$.
- `sizeIngrowth_dt.RDS`: Contains the size distribution of all individuals that entered the population for each remeasurement.

You can access all the data objects mentioned in the ownCloud folder via this link: [doc.ielab.usherbrooke.ca/s/83YSAiLAGeLm682](https://doc.ielab.usherbrooke.ca/s/83YSAiLAGeLm682).
In addition to the data objects, this folder contains three extra files:

- `climate_scaleRange.RDS` contains the maximum and minimum values required to scale the climate covariates.
- `db_metadata.csv` contains the csv description used in the table below.
- `species_id.csv`  contains species descriptions and additional traits used in the analysis.

```{r,echo=Echo,eval=Eval,cache=Cache,warning=Warng,message=Msg}
#| label: tbl-tabMetadata
#| tbl-cap: "Description of metadata for each variable linked to its corresponding dataset object (R object). Note that replicated variables shared among the R objects have been excluded. Consequently, there is no detailed description for the `sizeIngrowth_dt` dataset."

read_csv(file.path(data_path, 'db_metadata.csv')) |>
		gt() |>
		tab_style(
			style = cell_text(weight = "bold"),
			locations = cells_column_labels()
		)
```