documenting new train_test_split module

earth-chris · Sep 19, 2022 · 16e6854 · 16e6854
1 parent f0f8b82
commit 16e6854
Show file tree

Hide file tree

Showing 4 changed files with 52 additions and 9 deletions.
diff --git a/docs/examples/geo.md b/docs/examples/geo.md
@@ -117,13 +117,13 @@ Return a 1-column geodataframe with pseudoabsences concatenated to presence reco
 
 ```python
 presence_points = gpd.read_file('/path/to/occurrence-records.gpkg')
-ela.stack_geometries(presence_points, pseudoabsence_points)
+point_stack = ela.stack_geometries(presence_points, pseudoabsence_points)
 ```
 
 Return 2 columns, with class labels assigned (1 for presences, 0 for pseudoabsences):
 
 ```python
-ela.stack_geometries(
+point_stack = ela.stack_geometries(
  presence_points,
  pseudoabsence_points,
  add_class_label=True,
@@ -133,9 +133,10 @@ ela.stack_geometries(
 If the geometries are in different crs, default is to reproject to the presence crs. Override this with target_crs="background":
 
 ```python
-ela.stack_geometries(
+point_stack = ela.stack_geometries(
  presence_points,
  pseudoabsence_points,
+ add_class_label=True,
  target_crs="background",
 )
 ```
@@ -149,9 +150,9 @@ Annotation refers to reading and storing raster values at the locations of a ser
 Once you have your species presence and pseudo-absence records, you can annotate these records with the covariate data from each location.
 
 ```python
-pseudoabsence_covariates = ela.annotate(
- pseudoabsence_points,
- list_of_raster_paths,
+covariates = ela.annotate(
+ point_stack,
+ list_of_rasters,
  drop_na = True,
 )
 ```
@@ -176,8 +177,8 @@ labels = [
  "TMP-mean",
 ]
 
-pseudoabsence_covariates = ela.annotate(
- pseudoabsence_points,
+covariates = ela.annotate(
+ point_stack,
  raster_paths,
  labels = labels
  drop_na = True,
@@ -199,7 +200,13 @@ One way to add spatial information to a model is to compute geographically-expli
 `elapid` does this by calculating sample weights based on the distance to the nearest neighbor. Points nearby other points receive lower weight scores; far-away points receive higher weight scores.
 
 ```python
-sample_weight = ela.distance_weights(pseudoabsence_points)
+sample_weight = ela.distance_weights(point_stack)
+```
+
+The default is to compute weights based on the distance to the nearest point. You can instead compute the average distance to `n` nearest points instead to compute sample weights using point densities instead of the distance to the single nearest point. This may be useful if you have small clusters of a few points far away from large, densely populated regions.
+
+```python
+sample_weight = ela.distance_weights(point_stack, n_neighbors=10)
 ```
 
 These weights can be passed to many many model fitting routines, typically via `model.fit(x, y, sample_weight=sample_weight)`. This is supported for `ela.MaxentModel()`, as well as many `sklearn` methods.
@@ -208,6 +215,34 @@ This function uses `ela.nearest_point_distance()`, a handy function for computin
 
 ---
 
+## Train/test splits
+
+Uniformly random train/test splits are generally discouraged in spatial modeling because of the strong spatial structure inherent in many datasets. The non-independence of these data is referred to as spatial autocorrelation. Using distance- or density-based sample weights is one way to mitigate these effects. Another is to split the data into geographically distinct train/test regions to try and prioritize model generalization.
+
+One method is to use a "checkerbox" system for creating train/test splits. Points are intersected along a regular grid, and every other grid is used to split the data into train/test sets.
+
+```python
+train, test = ela.checkerboard_split(point_stack, grid_size=1000)
+```
+
+The height and width of the grid used to split the data is controlled by the `grid_size` parameter. This should specify distance in the units of the point data's CRS. The above call would split data along a 1x1 km grid if the CRS units were in meters.
+
+The black and white structure of the checkerboard means this method can only generate one train/test split.
+
+Alternatively, you can create `k` geographically-clustered folds using the `GeographicKFold` cross validation strategy:
+
+```python
+gfolds = ela.GeographicKFold(n_folds=4)
+for train_idx, test_idx in gfolds.split(point_stack):
+ train_points = point_stack.iloc[train_idx]
+ test_points = point_stack.iloc[test_idx]
+ # split x/y data, fit models, evaluate, etc.
+```
+
+This method uses KMeans clustering, fit with the x/y locations of the point data, to group points into spatially distinct clusters. This cross-validation strategy is a good way to test how well models generalize outside of their training extents into novel geographic regions.
+
+---
+
 ## Zonal statistics
 
 In addition to the tools for working with Point data, `elapid` contains a routine for calculating zonal statistics from Polygon or MutliPolygon geometry types.

diff --git a/docs/index.md b/docs/index.md
@@ -71,6 +71,10 @@ Transform covariate data into derivative `features` to expand data dimensionalit
 
 Train and apply species distribution models based on annotated point data, configured with sensible defaults (like `elapid.MaxentModel()` and `elapid.NicheEnvelopeModel()`).
 
+:satellite: **Training spatially-aware models**
+
+Compute spatially-explicit sample weights, checkerboard train/test splits, or geographically-clustered cross-validation splits to reduce spatial autocorellation effects (with `elapid.distance_weights()`, `elapid.checkerboard_split()` and `elapid.GeographicKFold()`).
+
 :earth_asia: **Applying models to rasters**
 
 Apply any pixel-based model with a `.predict()` method to raster data to easily create prediction probability maps (like training a `RandomForestClassifier()` and applying with `elapid.apply_model_to_rasters()`).

diff --git a/docs/module/train_test_split.md b/docs/module/train_test_split.md
@@ -0,0 +1,3 @@
+# elapid.train_test_split
+
+::: elapid.train_test_split
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -36,6 +36,7 @@ nav:
  - elapid.geo: 'module/geo.md'
  - elapid.models: 'module/models.md'
  - elapid.stats: 'module/stats.md'
+ - elapid.train_test_split: 'module/train_test_split.md'
  - elapid.types: 'module/types.md'
  - elapid.utils: 'module/utils.md'
  - Contributing to elapid: 'contributing.md'