-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLab01b_regression.Rmd
210 lines (151 loc) · 7.09 KB
/
Lab01b_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
title: "Lab 01B Species Distribution Modeling - Logistic Regression"
author: "Julia Parish"
date: "2022/01/19"
bibliography: bibliography.bib
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(scipen = 999)
```
```{r load pkgs data}
librarian::shelf(
DT, dplyr, dismo, GGally, here, readr, tidyr)
select <- dplyr::select # overwrite raster::select
options(readr.show_col_types = F)
dir_data <- here("data/sdm")
pts_env_csv <- file.path(dir_data, "pts_env.csv")
```
```{r}
redo <- TRUE
pts_env <- read_csv(pts_env_csv)
nrow(pts_env)
datatable(pts_env, rownames = F)
```
# Learning Objectives {-}
- **Exploratory Data Analysis** (cont'd):
- Pairs plot to show correlation between variables and avoid **multicollinearity** (see [8.2 Many predictors in a model](https://openintro-ims.netlify.app/model-mlr.html#many-predictors-in-a-model))
- **Logistic Regression** seen as an evolution of techniques
- **Linear Model** to show simplest multivariate regression, but predictions can be outside the binary values.
- **Generalized Linear Model** uses a logit transformation to constrain the outputs to being within two values.
- **Generalized Additive Model** allows for "wiggle" in predictor terms.
- **Maxent** (Maximum Entropy) is a presence-only modeling technique that allows for a more complex set of shapes between predictor and response.
# Explore (cont'd)
#### Pairs Plot
Show correlations between variables.
```{r ggpairs, fig.cap="Pairs plot with `present` color coded.", fig.width=11, fig.height=11, eval=T, warning=FALSE}
GGally::ggpairs(
select(pts_env, -ID),
aes(color = factor(present), alpha = 0.5))
```
# Logistic Regression
```{r fig.cap="Full model workflow with split, fit, predict and evaluate process steps.", echo=F}
# [model-workflow - Google Drawings](https://docs.google.com/drawings/d/1bnevzcNpkRtopo3jagpwBy0IyNN-uh_mVJXvQg4WRtU/edit)
knitr::include_graphics("https://docs.google.com/drawings/d/e/2PACX-1vSab7DR3sDS1aoRWNDCfb8NUDE5a1P701NIZrb5zb5ss3RAhvfLJxuRLhHpJpKAW8krtGRkwm-jKa1Y/pub?w=932&h=344")
```
## Setup Data
Let's setup a data frame with only the data we want to model by:
- Dropping rows with any NAs. Later we'll learn how to ["impute"](https://bbest.github.io/eds232-ml/glossary.html#imputation) values with guesses so as to not throw away data.
- Removing terms we don't want to model. We can then use a simplified formula $present \sim .$ to predict $present$ based on all other fields in the data frame (i.e. the $X$`s in $y \sim x_1 + x_2 + ... x_n$).
```{r setup d}
# setup model data
d <- pts_env %>%
select(-ID) %>% # remove terms we don't want to model
tidyr::drop_na() # drop the rows with NA values
nrow(d)
```
## Linear Model
Let's start as simply as possible with a linear model `lm()` on multiple predictors `X` to predict presence `y` using a simpler workflow.
```{r fig.cap="Simpler workflow with only fit and predict of all data, i.e. no splitting.", echo=F}
# [model-workflow_fit-all-predict-all - Google Drawings](https://docs.google.com/drawings/d/1JtWF-4dltWR0hSAZBGaqQdLShReLVjLhKZG8iRpVY2A/edit)
knitr::include_graphics("https://docs.google.com/drawings/d/e/2PACX-1vRRvQdrDELT9QRPJxj8ooPvABgjQ-qqjzC0Ri9Hhv6n1O7N31eb4o6c7ZhZd6QtQ7XgZybNhUBESK9L/pub?w=572&h=170")
```
```{r fit lm}
# fit a linear model
mdl <- lm(present ~ ., data = d)
summary(mdl)
y_predict <- predict(mdl, d, type="response")
y_true <- d$present
range(y_predict)
range(y_true)
```
The problem with these predictions is that it ranges outside the possible values of present `1` and absent `0`. (Later we'll deal with converting values within this range to either `1` or `0` by applying a cutoff value; i.e. any values > 0.5 become 1 and below become 0.)
## Generalized Linear Model
To solve this problem of constraining the response term to being between the two possible values, i.e. the **probability** $p$ of being one or the other possible $y$ values, we'll apply the logistic transformation on the response term.
$$
logit(p_i) = \log_{e}\left( \frac{p_i}{1-p_i} \right)
$$
We can expand the expansion of the predicted term, i.e. the probability $p$ of being either $y$, with all possible predictors $X$ whereby each coeefficient $b$ gets multiplied by the value of $x$:
$$
\log_{e}\left( \frac{p_i}{1-p_i} \right) = b_0 + b_1 x_{1,i} + b_2 x_{2,i} + \cdots + b_k x_{k,i}
$$
```{r fit glm}
# fit a generalized linear model with a binomial logit link function
mdl <- glm(present ~ ., family = binomial(link="logit"), data = d)
summary(mdl)
y_predict <- predict(mdl, d, type="response")
range(y_predict)
```
Excellent, our response is now constrained between 0 and 1. Next, let's look at the term plots to see the relationship between predictor and response.
```{r termplots glm}
# show term plots
termplot(mdl, partial.resid = TRUE, se = TRUE, main = F, ylim="free")
```
## Generalized Additive Model
With a generalized additive model we can add "wiggle" to the relationship between predictor and response by introducing smooth `s()` terms.
```{r fit gam}
librarian::shelf(mgcv)
# fit a generalized additive model with smooth predictors
mdl <- mgcv::gam(
formula = present ~ s(WC_alt) + s(WC_bio1) + s(WC_bio12) + s(WC_bio15) +
s(ER_PETseasonality) + s(ER_topoWet) + s(ER_climaticMoistureIndex) + s(lon) + s(lat),
family = binomial, data = d)
summary(mdl)
# show term plots
plot(mdl, scale=0)
```
## Maxent (Maximum Entropy)
Maxent is probably the most commonly used species distribution model ([Elith 2011](http://dx.doi.org/10.1111/j.1472-4642.2010.00725.x)) since it performs well with few input data points, only requires presence points (and samples background for comparison) and is easy to use with a Java graphical user interface (GUI).
```{r fit maxent}
# load extra packages
librarian::shelf(
maptools, sf)
mdl_maxent_rds <- file.path(dir_data, "mdl_maxent.rds")
# show version of maxent
if (!interactive())
maxent()
```
```{r}
# get environmental rasters
# NOTE: the first part of Lab 1. SDM - Explore got updated to write this clipped environmental raster stack
env_stack_grd <- file.path(dir_data, "env_stack.grd")
env_stack <- stack(env_stack_grd)
plot(env_stack, nc=2)
```
```{r}
# get presence-only observation points (maxent extracts raster values for you)
obs_geo <- file.path(dir_data, "obs.geojson")
obs_sp <- read_sf(obs_geo) %>%
sf::as_Spatial() # maxent prefers sp::SpatialPoints over newer sf::sf class
# fit a maximum entropy model
if (!file.exists(mdl_maxent_rds) | redo){
mdl <- maxent(env_stack, obs_sp)
readr::write_rds(mdl, mdl_maxent_rds)
}
mdl <- read_rds(mdl_maxent_rds)
# plot variable contributions per predictor
plot(mdl)
```
```{r}
# plot term plots
response(mdl)
```
```{r}
# predict
y_predict <- predict(env_stack, mdl) #, ext=ext, progress='')
plot(y_predict, main='Maxent, raw prediction')
data(wrld_simpl, package="maptools")
plot(wrld_simpl, add=TRUE, border='dark grey')
```
Notice how the `plot()` function produces different outputs depending on the class of the input object. You can view help for each of these with R Console commands: `?plot.lm`, `?plot.gam` and <code class="r">`plot,DistModel,numeric-method`</code>.