Skip to content

Commit

Permalink
documenting new train_test_split module
Browse files Browse the repository at this point in the history
  • Loading branch information
earth-chris committed Sep 19, 2022
1 parent f0f8b82 commit 16e6854
Show file tree
Hide file tree
Showing 4 changed files with 52 additions and 9 deletions.
53 changes: 44 additions & 9 deletions docs/examples/geo.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,13 +117,13 @@ Return a 1-column geodataframe with pseudoabsences concatenated to presence reco

```python
presence_points = gpd.read_file('/path/to/occurrence-records.gpkg')
ela.stack_geometries(presence_points, pseudoabsence_points)
point_stack = ela.stack_geometries(presence_points, pseudoabsence_points)
```

Return 2 columns, with class labels assigned (1 for presences, 0 for pseudoabsences):

```python
ela.stack_geometries(
point_stack = ela.stack_geometries(
presence_points,
pseudoabsence_points,
add_class_label=True,
Expand All @@ -133,9 +133,10 @@ ela.stack_geometries(
If the geometries are in different crs, default is to reproject to the presence crs. Override this with target_crs="background":

```python
ela.stack_geometries(
point_stack = ela.stack_geometries(
presence_points,
pseudoabsence_points,
add_class_label=True,
target_crs="background",
)
```
Expand All @@ -149,9 +150,9 @@ Annotation refers to reading and storing raster values at the locations of a ser
Once you have your species presence and pseudo-absence records, you can annotate these records with the covariate data from each location.

```python
pseudoabsence_covariates = ela.annotate(
pseudoabsence_points,
list_of_raster_paths,
covariates = ela.annotate(
point_stack,
list_of_rasters,
drop_na = True,
)
```
Expand All @@ -176,8 +177,8 @@ labels = [
"TMP-mean",
]

pseudoabsence_covariates = ela.annotate(
pseudoabsence_points,
covariates = ela.annotate(
point_stack,
raster_paths,
labels = labels
drop_na = True,
Expand All @@ -199,7 +200,13 @@ One way to add spatial information to a model is to compute geographically-expli
`elapid` does this by calculating sample weights based on the distance to the nearest neighbor. Points nearby other points receive lower weight scores; far-away points receive higher weight scores.

```python
sample_weight = ela.distance_weights(pseudoabsence_points)
sample_weight = ela.distance_weights(point_stack)
```

The default is to compute weights based on the distance to the nearest point. You can instead compute the average distance to `n` nearest points instead to compute sample weights using point densities instead of the distance to the single nearest point. This may be useful if you have small clusters of a few points far away from large, densely populated regions.

```python
sample_weight = ela.distance_weights(point_stack, n_neighbors=10)
```

These weights can be passed to many many model fitting routines, typically via `model.fit(x, y, sample_weight=sample_weight)`. This is supported for `ela.MaxentModel()`, as well as many `sklearn` methods.
Expand All @@ -208,6 +215,34 @@ This function uses `ela.nearest_point_distance()`, a handy function for computin

---

## Train/test splits

Uniformly random train/test splits are generally discouraged in spatial modeling because of the strong spatial structure inherent in many datasets. The non-independence of these data is referred to as spatial autocorrelation. Using distance- or density-based sample weights is one way to mitigate these effects. Another is to split the data into geographically distinct train/test regions to try and prioritize model generalization.

One method is to use a "checkerbox" system for creating train/test splits. Points are intersected along a regular grid, and every other grid is used to split the data into train/test sets.

```python
train, test = ela.checkerboard_split(point_stack, grid_size=1000)
```

The height and width of the grid used to split the data is controlled by the `grid_size` parameter. This should specify distance in the units of the point data's CRS. The above call would split data along a 1x1 km grid if the CRS units were in meters.

The black and white structure of the checkerboard means this method can only generate one train/test split.

Alternatively, you can create `k` geographically-clustered folds using the `GeographicKFold` cross validation strategy:

```python
gfolds = ela.GeographicKFold(n_folds=4)
for train_idx, test_idx in gfolds.split(point_stack):
train_points = point_stack.iloc[train_idx]
test_points = point_stack.iloc[test_idx]
# split x/y data, fit models, evaluate, etc.
```

This method uses KMeans clustering, fit with the x/y locations of the point data, to group points into spatially distinct clusters. This cross-validation strategy is a good way to test how well models generalize outside of their training extents into novel geographic regions.

---

## Zonal statistics

In addition to the tools for working with Point data, `elapid` contains a routine for calculating zonal statistics from Polygon or MutliPolygon geometry types.
Expand Down
4 changes: 4 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,10 @@ Transform covariate data into derivative `features` to expand data dimensionalit

Train and apply species distribution models based on annotated point data, configured with sensible defaults (like `elapid.MaxentModel()` and `elapid.NicheEnvelopeModel()`).

:satellite: **Training spatially-aware models**

Compute spatially-explicit sample weights, checkerboard train/test splits, or geographically-clustered cross-validation splits to reduce spatial autocorellation effects (with `elapid.distance_weights()`, `elapid.checkerboard_split()` and `elapid.GeographicKFold()`).

:earth_asia: **Applying models to rasters**

Apply any pixel-based model with a `.predict()` method to raster data to easily create prediction probability maps (like training a `RandomForestClassifier()` and applying with `elapid.apply_model_to_rasters()`).
Expand Down
3 changes: 3 additions & 0 deletions docs/module/train_test_split.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# elapid.train_test_split

::: elapid.train_test_split
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ nav:
- elapid.geo: 'module/geo.md'
- elapid.models: 'module/models.md'
- elapid.stats: 'module/stats.md'
- elapid.train_test_split: 'module/train_test_split.md'
- elapid.types: 'module/types.md'
- elapid.utils: 'module/utils.md'
- Contributing to elapid: 'contributing.md'
Expand Down

0 comments on commit 16e6854

Please sign in to comment.