-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathfinal_project.Rmd
897 lines (659 loc) · 45 KB
/
final_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
---
title: "Predicting Wildfire Size"
author: "Iris Foxfoot, Seonga Cho, Vanessa Echeverri Figueroa"
date: "15/3/2022"
bibliography: bibliography.bib
csl: apa.csl
output:
html_document:
toc: true # table of content true
toc_depth: 3 # upto three depths of headings (specified by #, ## and ###)
number_sections: true # if you want number sections at each table header
theme: united # many options for theme, this one is my favorite.
highlight: tango # specifies the syntax highlighting style
toc_float: true #makes table of contents float while scrolling
code_folding: hide #enables code folding
---
```{r setup, include=T, message = FALSE}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
library(tidyverse) #used for data wrangling and viz
library(here) #simplifies file paths
library(rsample) #used to split data
library(janitor) #used to clean data names
library(corrplot) #for correlation viz
library(VIM) #for missing data viz
library(mice) #for imputing missing data
library(ggfortify) #for PCA viz
library(fastDummies) #for dummy coding
library(caret) #for cross validation
library(class) #for knn
library(gbm) #for boosted trees modeling
library(randomForest) #for random forest modeling
library(maptools)
library(RColorBrewer)
library(classInt)
library(ggplot2)
library(ggrepel)
library(mapproj)
library(viridis)
library(pander)
library(knitr)
#turns scientific notation off
options(scipen = 100)
#some of our workflow includes randomness. here we set a seed so our workflow can be reproduced without worrying about random variation
set.seed(123)
```
# Introduction
The purpose of this project is to use machine learning techniques to predict the size of a fire in the event that a wildfire occurs at a certain location, considering different characteristics for climate, vegetation, season, and more. Our most accurate model will then be used to assess how different physical and climate characteristics may influence wildfires predicted size and potentially support decisions to prevent damages to structures and injuries to vulnerable populations.
## Background
Wildfires have been intensifying in United States during the last decades, not only in occurrence, but in size. Their causes range between propensity due to physical characteristics [see @keeley20092007] to external factors, such as fires caused by Native American population practices [@stephens2007prehistoric], and climate change, among other reasons. They have not only caused several injuries to the population, and even deaths, but have also destroyed structures and caused damage that are costly to repair [@keeley20092007]. However, wildfires are necessary to provide nutrients to the forest and are part of its regular life cycle. Understanding wildfire intensification and causes may provide an insight to prevent wildfires from escalating to disproportionate sizes.
## Why this model is useful
This model will be used to predict fire size in the event of a wildfire. This information can be valuable to support policy makers in prevention of death and destruction, protection to vulnerable populations, and creation of emergency plans. Predicting wildfire size will also be useful for wild land fire fighters, who often must allocate fire fighting resources to multiple fires during the summer months. Thus, predicting fire size could help fire fighters prioritize resource allocation.
## Data Overview
This dataset is available on [Kaggle](https://www.kaggle.com/capcloudcoder/us-wildfire-data-plus-other-attributes). It is a subset of a larger dataset of US wildfires. The author of this dataset took a subset of the largest wildfires, and used spatial information to extract weather and vegetation features from other data sources.
Important attributes include:
* `fire_size` - the size of the fire in acres
* `fire_mag` - magnitude of fire intensity (scaled version of `fire_size`)
* `fire_size_class` - the size of fires classified into A though G (with G being the largest fires)
* `stat_cause_descr` - the cause of the fire
* `vegetation` - code corresponding to vegetation type
* Variables starting with `temp` - average temperature (Celsius) over a specified period of time (30 days, 15 days, or 7 days, or the day the fire was contained)
* Variables starting with `wind` - average windspeed (meters per second) over a specified period of time (30 days, 15 days, or 7 days, or the day the fire was contained)
* Variables starting with `hum` - average humidity (in percent humidity) over a specified period of time (30 days, 15 days, or 7 days, or the day the fire was contained)
* Variables starting with `prec` - precipitation (mm) over a specified period of time (30 days, 15 days, or 7 days, or the day the fire was contained)
* `remoteness` - the non-dimensional distance to closest city (units not specified by data author)
The data set also contains spatial coordinates in the form of longitude and latitude, as well as information on the date of the fire and the state in which it occurred.
A complete code book is available in the project file in `archive\Wildfire_att_descripition.txt`
```{r}
#read in dataset
us_wildfire <- read_csv(here("archive", "FW_Veg_Rem_Combined.csv"))
#Following to the National Geographic Education, states were classified to regions
us_wildfire$region = "NorthEast"
us_wildfire[which(us_wildfire$state %in% c("AK", "WA", "OR", "CA", "NV", "ID", "MT", "WY", "UT", "CO")),"region"] = "West"
us_wildfire[which(us_wildfire$state %in% c("AZ", "NM", "TX", "OK")), "region"] = "SouthWest"
us_wildfire[which(us_wildfire$state %in% c("ND", "SD", "NE", "KS", "MN", "IA", "MO", "WI", "IL", "MI", "IN", "OH")), "region"] = "MidWest"
us_wildfire[which(us_wildfire$state %in% c("AR", "LA", "MS", "AL", "GA", "FL", "SC", "NC", "TN", "KY", "VA", "WV")), "region"] = "SouthEast"
#year was reclassified to so that 1990 = 0 and 2015 = 24
us_wildfire$year_diff = us_wildfire$disc_pre_year - min(us_wildfire$disc_pre_year)
```
# Methods
Our methods involved cleaning the data, conducting exploratory analysis, imputing missing data, and using Principle Components Analysis to reduce the number of features.
## Cleaning Data
To clean the data we first converted all column names to lower snake case using the `clean_names()` function. Then we selected only the columns were were interested in using, thereby removing all the meaningless columns.
```{r}
us_wildfire_clean <- us_wildfire %>%
dplyr::select(fire_name:year_diff) %>% #select columns we want to use
clean_names() #convert to lowercase snake
```
We are interested in using weather to predict fire size, so we filter out observations where the weather data was not associated with the fire observation.
There are some weather observations for which every weather field is 0. This seems unlikely, especially for temperature and humidity observations. So we will replace them with NAs. Since precipitation is frequently 0, we will not replace all 0 with NAs in the precipitation column.
Instead, we will follow the pattern of NAs seen in other the other weather columns. We can see that there is a clear pattern in the missing weather data. when `temp_pre_30` is missing, so is `wind_pre_30` and `humidity_pre_30`. We will assume this extends to `prec_pre_30` and replace that 0 with a NA.
```{r}
#filter out observations that do not have a weather file
us_wildfire_clean <- us_wildfire_clean %>%
filter(weather_file != "File Not Found")
#replace 0 with NA
us_wildfire_clean <- us_wildfire_clean %>%
mutate_at(vars(temp_pre_30:hum_cont), ~na_if(., 0.0000000))
#when temp, wind, and humidity are missing, we suspect precipitation is as well.
#we use a case when statement to replace these zeros with NAs
us_wildfire_clean <- us_wildfire_clean %>%
mutate(prec_pre_30 = case_when(is.na(temp_pre_30) & is.na(hum_pre_30) & is.na(wind_pre_30) ~ NA_real_,
TRUE ~ prec_pre_30)) %>%
mutate(prec_pre_15 = case_when(is.na(temp_pre_15) & is.na(hum_pre_15) & is.na(wind_pre_15) ~ NA_real_,
TRUE ~ prec_pre_15)) %>%
mutate(prec_pre_7 = case_when(is.na(temp_pre_7) & is.na(hum_pre_7) & is.na(wind_pre_7) ~ NA_real_,
TRUE ~ prec_pre_7)) %>%
mutate(prec_cont = case_when(is.na(temp_cont) & is.na(hum_cont) & is.na(wind_cont) ~ NA_real_,
TRUE ~ prec_cont))
```
Information about vegetation was stored in a numeric form, where each number represented a class of vegetation. The key for the vegetation codes was in the provided metadata. For ease of interpretation, we replaced the numeric code with a short description of the vegetation type.
According the metadata for this dataset, the vegetation was created by interpolating most likely vegetation based on latitude and longitude. The most common vegetation type is listed as "Polar Desert/Rock/Ice" and this seems very unlikely. Although we keep the vegetation feature in the data, we have doubts about its accuracy.
```{r}
#Here we label vegetation according to the provided code book
us_wildfire_clean <- us_wildfire_clean %>%
mutate(vegetation_classed = case_when(
vegetation == 12 ~ "Open Shrubland",
vegetation == 15 ~ "Polar Desert/Rock/Ice",
vegetation == 16 ~ "Secondary Tropical Evergreen Broadleaf Forest",
vegetation == 4 ~ "Temperate Evergreen Needleleaf Forest TmpENF",
vegetation == 9 ~ "C3 Grassland/Steppe",
vegetation == 14 ~ "Desert"
))
```
Our final data cleaning action was to clean up date columns. Since there are redundant date columns, we chose to keep only month and year as variables. We also reclassified months into seasons, so there would be fewer factor levels or dummy coding involved when it came time to prepare the data for modeling.
```{r}
#keep month and year as variables
us_wildfire_clean <- us_wildfire_clean %>%
dplyr::select(-disc_clean_date,
-disc_date_pre,
-cont_clean_date,
-disc_pre_month,
-disc_date_final,
-cont_date_final,
-putout_time)
#we also reclassify months into seasons, to reduce factor levels
us_wildfire_clean <- us_wildfire_clean %>%
mutate(season = case_when(discovery_month %in% c("Apr", "May", "Jun") ~ "Spring",
discovery_month %in% c("Jul", "Aug", "Sep") ~ "Summer",
discovery_month %in% c("Oct", "Nov", "Dec") ~ "Fall",
discovery_month %in% c("Jan", "Feb", "Mar") ~ "Winter")) %>%
select(-discovery_month)
```
## Exploratory Analysis
Our next step was to explore the data. We created several figures to investigate fire observations, as well as the pattern of missing data and the correlation between variables.
First we investigated fire size through time. This figure shows that the total acrage burned per year has been increasing almost linearly since 1990. This is a clue that `Year` might be a useful feature in our models.
```{r}
#summarise acres per year burned
acres_per_year <- us_wildfire_clean %>%
group_by(disc_pre_year) %>%
summarise(acres_burned = sum(fire_size))
#fire size (finalized graph)
ggplot(data = acres_per_year) +
geom_point(aes(x = disc_pre_year,
y = acres_burned,
size = acres_burned,
color = acres_burned)) +
scale_color_continuous(high = "firebrick", low = "goldenrod1") +
labs(x = "Year", y = "Total Acres Burned",
title = "Total acres burned per year from 1990 to 2015") +
theme_minimal() +
theme(legend.position = "none")
```
We also looked at the most common causes of fires. We found that most fires were started by debris burning, followed by arson. Yikes! We were also surprised that a children were frequently listed as the cause of wildfires.
```{r}
#most common causes of fire
fire_causes <- us_wildfire_clean %>%
group_by(stat_cause_descr) %>%
count()
#cause (finalized)
ggplot(data = fire_causes, aes(y = reorder(stat_cause_descr, n), x = n)) +
geom_col(aes(fill = n)) +
scale_fill_gradient(high = "firebrick", low = "goldenrod1") +
labs(x = "Number of Fires",
y = "Cause") +
ggtitle("Number of fires per listed starting cause") +
theme_minimal() +
theme(legend.position = "none")
```
### Map of regions
Below is a map of regions. The data originally had the State that each fire occurred in, but we reclassified States into regions to reduce the number of factor levels based on the US geographical society.
According to the US geographical society, the US can be divided into five regions. North East region is mainly on the old states of the US, such as New York. Mid West region is covering Illinois and the central regions. These regions share dry weather characteristics. South East region contains many hot and humid areas. South West region is hot and dry area, including Texas and Arizona. The last one is West regions. West region has suffered many wildfires. Due to global climate change, the dry weather affected West region has more fires than before. We can see the uneven spatial distribution of wildfires in the US in the following maps.
```{r}
us_wildfire_clean$class_fac = factor(us_wildfire_clean$fire_size_class, levels = c("A", "B", "C", "D", "E", "F", "G"))
us <- map_data("world", 'usa')
state <- map_data("state")
state$region2 = "NorthEast"
state[which(state$region %in% c("alaska", "washington", "oregon", "california", "nevada", "idaho", "montana", "utah", "wyoming", "colorado")), "region2"] = "West"
state[which(state$region %in% c("arizona", "new mexico", "oklahoma", "texas")), "region2"] = "SouthWest"
state[which(state$region %in% c("north dakota", "south dakota", "nebraska", "kansas", "minnesota", "iowa", "missouri", "wisconsin", "illinois", "indiana", "michigan", "ohio")), "region2"] = "MidWest"
state[which(state$region %in% c("arkansas", "louisiana", "mississippi", "alabama", "florida", "georgia", "south carolina", "north carolina", "tennessee", "kentucky", "virginia", "west virginia")), "region2"] = "SouthEast"
#state$region = as.factor(state$region)
ggplot(data=state, aes(x=long, y=lat, group = region)) +
geom_polygon(aes(fill=region2)) +
ggtitle("US Region")+
guides(fill=guide_legend(title="Region"))+
coord_map(projection = "sinusoidal", xlim=c(-120, -75), ylim = c(25, 50))
```
### Map of fires
Below is a map showing the spatial distribution of wildfires. The higher the fire magnitude (aka intensity), the redder the point is. You can see that fires in the Western US are more likely to be high magnitude fires. However, if we does not consider the temporal scale, it is hard to detect the climate change's effect.
```{r}
ggplot() +
geom_polygon(data=state, aes(x=long, y=lat, group=group), color = "white", fill = "grey") +
geom_point(data = us_wildfire_clean, aes(x=longitude, y = latitude, color = class_fac)) +
scale_color_brewer(palette = "YlOrRd")+
ggtitle("US Wildfire Distribution")+
guides(color=guide_legend(title="Wild Fire Scale"))+
coord_map(projection = "sinusoidal", xlim=c(-120, -75), ylim = c(25, 50))
```
We divided then the wildfire events into three time periods, 1990 to 2000, 2000 to 2010, and 2010-2015. When we divide it to three periods, we can see the wildfire risk has been growing in Western parts of the US. Particularly, after 2000, both severity and the number of wildfire has increased significantly. Therefore, using researching using machine learning on wildfires is highly significant these days.
```{r}
ggplot() +
geom_polygon(data=state, aes(x=long, y=lat, group=group), color = "white", fill = "grey") +
geom_point(data = us_wildfire_clean[which(us_wildfire_clean$disc_pre_year < 2000),], aes(x=longitude, y = latitude, color = class_fac)) +
scale_color_brewer(palette = "YlOrRd")+
ggtitle("US Wildfire Distribution before 2000")+
guides(color=guide_legend(title="Wild Fire Scale"))+
coord_map(projection = "sinusoidal", xlim=c(-120, -75), ylim = c(25, 50))
ggplot() +
geom_polygon(data=state, aes(x=long, y=lat, group=group), color = "white", fill = "grey") +
geom_point(data = us_wildfire_clean[which(us_wildfire_clean$disc_pre_year >= 2000 & us_wildfire_clean$disc_pre_year < 2020),], aes(x=longitude, y = latitude, color = class_fac)) +
scale_color_brewer(palette = "YlOrRd")+
ggtitle("US Wildfire Distribution 2000-2010")+
guides(color=guide_legend(title="Wild Fire Scale"))+
coord_map(projection = "sinusoidal", xlim=c(-120, -75), ylim = c(25, 50))
ggplot() +
geom_polygon(data=state, aes(x=long, y=lat, group=group), color = "white", fill = "grey") +
geom_point(data = us_wildfire_clean[which(us_wildfire_clean$disc_pre_year >= 2010),], aes(x=longitude, y = latitude, color = class_fac)) +
scale_color_brewer(palette = "YlOrRd")+
ggtitle("US Wildfire Distribution 2000-2015")+
guides(color=guide_legend(title="Wild Fire Scale"))+
coord_map(projection = "sinusoidal", xlim=c(-120, -75), ylim = c(25, 50))
```
### Density graph
When we plot a density graph of `fire_size` we can see that the dataset is dominated by smaller fires. To create this figure, we included only fires larger than 100 acres, so `fire_size` is even more skewed towards smaller fires than this figure shows.
```{r}
ggplot() +
geom_density(data= us_wildfire_clean[which(us_wildfire_clean$disc_pre_year <= 2000 & us_wildfire_clean$fire_size > 100),], aes(x = fire_size, y=..density..),
alpha=.3,
colour="dodgerblue", fill="dodgerblue") +
geom_density(data= us_wildfire_clean[which(us_wildfire_clean$disc_pre_year >= 2000 & us_wildfire_clean$fire_size > 100 & us_wildfire_clean$disc_pre_year < 2010),], aes(x = fire_size, y=..density..),
alpha=.3,
colour="yellow3", fill="yellow3") +
geom_density(data= us_wildfire_clean[which(us_wildfire_clean$disc_pre_year >= 2010 & us_wildfire_clean$fire_size > 100),], aes(x = fire_size, y=..density..),
alpha=.3,
colour="firebrick3", fill="firebrick3") +
xlim(10000, 100000) +
ggtitle("Wildfire Severeity")
```
### Exploring Missing Data and Correlation
As part of our exploratory analysis, we we looked at the pattern of missing data. This figure shows missing data in red. You can see that most missing data is concentrated in the weather columns, and the name of the fire is also often missing.
```{r}
#missing data plot
aggr_plot <- aggr(us_wildfire_clean,
col = c('yellow','red'),
numbers = TRUE,
sortVars = TRUE,
labels = names(us_wildfire_clean),
cex.axis = .7,
gap = 2,
ylab = c("Histogram of missing data","Pattern"))
```
We hypothesized that weather variables were correlated. To test this out, we created a correlation matrix.
This figure shows that there is a strong correlation with each set of variables. I.e. the 7 day average temperature is correlated with the 30 day average temperature. This is not surprising!
```{r}
#create a dataframe of weather
weather <- us_wildfire_clean %>%
dplyr::select(temp_pre_30:prec_cont)
#create a correlation matrix (omitting NAs)
cor_matrix <- cor(weather, use = "complete.obs")
#create a visualization
corrplot(cor_matrix, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
```
## Model Preparation
After exploring our data, we were ready to prepare it for modeling. We first split the data into test and training data, then do some data manipulations separately on the test and training data.
### Splitting Data
Before we split the data, we got rid of even more variables. Some we used in exploratory data analysis but decided not to use in modeling, and some were just not useful. Then we split our data into 80% training and 20% testing data. Because fire size is heavily skewed towards smaller fires, we used stratified sampling.
There are 32903 observations in the training set and 8229 observations in the test set.
```{r}
#first we make a data frame containing only variables we want to use in our models
fire_modeling_df <- us_wildfire_clean %>%
dplyr::select(-fire_name, #remove fire name
-fire_size_class, #remove fire size class
-class_fac, #which is also a size class
-state, #because we have regions
-disc_pre_year, #because we have year_diff to represent year
-wstation_usaf, #remove weather station name
-wstation_wban, #remove this (not described in codebook)
-wstation_byear, #remove station year installed
-wstation_eyear, #remove station year ended data recording
-weather_file, #remove name of weather file
-dstation_m, #remove distance of weather station to fire
-vegetation #remove vegetation because we already have it classed
)
#define split parameters
us_wildfire_split <- fire_modeling_df %>%
initial_split(prop = 0.8, strata = fire_size)
#write split data to data frames
fire_train_split <- training(us_wildfire_split)
fire_test_split <- testing(us_wildfire_split)
#set up folds in training data for cross validation
train_folds <- vfold_cv(fire_train_split, v = 10, repeats = 5)
```
### Imputing Missing Data
Next we impute (or replace) all the missing data in our weather columns. We use the predictive mean method of imputation, which works by predicting the value of the missing variable using regression, then randomly choosing a similar value from a pool of candidates that are found in the data. We use the default values of `m = 5` and `maxit = 5` to reduce computing time, but increasing the number of multiple imputations using the `m` argument and increasing the number of iterations through the `maxit` argument could increase accuracy.
```{r, results='hide'}
#select weather cols
weather_train <- fire_train_split %>%
select(temp_pre_30:prec_cont)
weather_test <- fire_test_split %>%
select(temp_pre_30:prec_cont)
#select cols not in weather
notweather_train <- fire_train_split %>%
select(-colnames(weather_train))
notweather_test <- fire_test_split %>%
select(-colnames(weather_test))
#imputation of weather data
weather_train_imputed <- mice(weather_train, m=5, maxit=5, meth='pmm', seed=500) %>%
complete()
weather_test_imputed <- mice(weather_test, m=5, maxit=5, meth='pmm', seed=500) %>%
complete()
```
### Principle Components
Since we know that weather columns are strongly correlated with each other, we decided to use PCA to reduce the number of columns to 4 principle components that are uncorrelated. We decided to use 4 components because we suspect there are four underlying factors in the data: temperature, humidity, windspeed, and precipitation. Once we have converted the weather columns into 4 PCs, we bind these PCs back to the regular data.
Below is a PCA Biplot. The PCA Biplot shows temperature, wind, and humidity variables are all closely related to other temperature, wind, and humidity variables, respectively. We can also see the precipitation is pretty closely correlated to humidity, which makes sense!
```{r}
#Do PCA on both test and train data separately
weather_train_PCA <- weather_train_imputed %>%
scale(center = T, scale = T) %>% #first scale and center the data
prcomp(rank. = 4) #do PCA
#Do PCA on both test and train data separately
weather_test_PCA <- weather_test_imputed %>%
scale(center = T, scale = T) %>% #first scale and center the data
prcomp(rank. = 4)
#Make a PCA bi-plot
autoplot(weather_train_PCA,
data = weather_train,
loadings = TRUE,
loadings.label = TRUE,
loadings.label.hjust = 1.2) +
theme_classic() +
labs(caption = "Principle Component Analysis Bi-plot of Weather Training Data")
#put data back together
#bind imputed weather data to rest of rows
fire_train <- cbind(notweather_train, weather_train_PCA$x) %>%
na.omit()
fire_test <- cbind(notweather_test, weather_test_PCA$x) %>%
na.omit()
#also saving test and training data as "fire_train_complete" and "fire_test_complete" so that models we made earlier still run
fire_train_complete <- fire_train
fire_test_complete <- fire_test
```
## Modeling
### Multiple Regression Analysis
First of all, we applied multiple linear regression models on the dataset. The performance of the multiple linear regression will present why we need to apply machine learning approaches. First of all, I started with the whole variables including principal components. And then, less relevant variables were deleted from to find the optimal multiple linear regression model.
```{r}
lm_whole = lm(fire_size ~ PC1 + PC2 + PC3 + PC4 + year_diff + remoteness + vegetation_classed + stat_cause_descr, data = fire_train_complete)
summary(lm_whole) %>% pander
lm_whole2 <- update(lm_whole, . ~ . - stat_cause_descr)
summary(lm_whole2) %>% pander
```
When we compare the whole region's regression results, it shows that the updated model is superior than the others.
Also, I compared each region's multiple regression analysis results after selecting optimal variables. First of all, for South West regions, the cause of fire ignition, the second principal components, and the remotness are not important independent variables.
```{r}
lm_sw = lm(fire_size ~ PC1 + PC2 + PC3 + PC4 + year_diff + remoteness + vegetation_classed + stat_cause_descr, data = fire_train_complete[which(fire_train_complete$region == "SouthWest"),])
#summary(lm_sw) %>% pander
lm_sw2 <- update(lm_sw, . ~ . - stat_cause_descr - PC2 - remoteness)
summary(lm_sw2) %>% pander
```
Similarly, for South East region, the reason of ignition, the second principal component is not valid for the final model. But also, year, the fourth principal component and vegetation are not significant independent variables for the analysis. So, I deleted those variables.
```{r}
lm_se = lm(fire_size ~ PC1 + PC2 + PC3 + PC4 + year_diff + remoteness + vegetation_classed + stat_cause_descr, data = fire_train_complete[which(fire_train_complete$region == "SouthEast"),])
lm_se2 <- update(lm_se, . ~ . - stat_cause_descr - PC1 - PC4- year_diff - vegetation_classed)
summary(lm_se2) %>% pander
```
For the Mid West region, the reason of ignition, year, the second and the third principal components are irrelevant independent variables. After applying the model, the result model seems like below.
```{r}
lm_mw = lm(fire_size ~ PC1 + PC2 + PC3 + PC4 + year_diff + remoteness + vegetation_classed + stat_cause_descr, data = fire_train_complete[which(fire_train_complete$region == "MidWest"),])
lm_mw2 <- update(lm_mw, . ~ . - stat_cause_descr - PC3 - year_diff - PC2)
summary(lm_mw2) %>% pander
```
For West region, the reason of ignition and the vegetation type was not important. Also, the third and the fourth principal components are not significant according to the model results. So, we deleted those variables to earn the final model.
```{r}
lm_w = lm(fire_size ~ PC1 + PC2 + PC3 + PC4 + year_diff + remoteness + vegetation_classed + stat_cause_descr, data = fire_train_complete[which(fire_train_complete$region == "West"),])
lm_w2 <- update(lm_w, . ~ . - stat_cause_descr - PC3 - PC4 - vegetation_classed)
summary(lm_w2) %>% pander
```
Finally, North East region has less significant independent variables than other regions. The second, the third, and the fourth principal components, year, the ignition cause, and the vegetation type are not significant on the final model. So, I got the model like below.
```{r}
lm_ne = lm(fire_size ~ PC1 + PC2 + PC3 + PC4 + year_diff + remoteness + vegetation_classed + stat_cause_descr, data = fire_train_complete[which(fire_train_complete$region == "NorthEast"),])
lm_ne2 <- update(lm_ne, . ~ . - stat_cause_descr - vegetation_classed - year_diff - PC4 - PC3 - PC2)
summary(lm_ne2) %>% pander
```
The whole region's RMSE is 9682.568. We use it for comparing the model's performance. The model performance will be checked with RMSE. Each region's optimal model sometimes have larger RMSE than the whole area's model.
```{r}
#make predictions on test data
whole_pred = predict(lm_whole, newdata = fire_test_complete)
#calculate test MSE
whole_test_MSE <- mean((fire_test_complete$fire_size - whole_pred)^2)
#calculate and save test RMSE
whole_test_RMSE = sqrt(whole_test_MSE)
#repeat steps for each region
sw_pred = predict(lm_sw2, newdata = fire_test_complete[which(fire_test_complete$region == "SouthWest"),])
sw_test_MSE <- mean((fire_test_complete$fire_size[which(fire_test_complete$region == "SouthWest")] - sw_pred)^2)
#print test RMSE
sw_test_RMSE = sqrt(sw_test_MSE)
se_pred = predict(lm_se2, newdata = fire_test_complete[which(fire_test_complete$region == "SouthEast"),])
se_test_MSE <- mean((fire_test_complete$fire_size[which(fire_test_complete$region == "SouthEast")] - se_pred)^2)
#print test RMSE
se_test_RMSE = sqrt(se_test_MSE)
mw_pred = predict(lm_mw2, newdata = fire_test_complete[which(fire_test_complete$region == "MidWest"),])
mw_test_MSE <- mean((fire_test_complete$fire_size[which(fire_test_complete$region == "MidWest")] - mw_pred)^2)
#print test RMSE
mw_test_RMSE = sqrt(mw_test_MSE)
w_pred = predict(lm_w2, newdata = fire_test_complete[which(fire_test_complete$region == "West"),])
w_test_MSE <- mean((fire_test_complete$fire_size[which(fire_test_complete$region == "West")] - w_pred)^2)
#print test RMSE
w_test_RMSE = sqrt(w_test_MSE)
ne_pred = predict(lm_w2, newdata = fire_test_complete[which(fire_test_complete$region == "NorthEast"),])
ne_test_MSE <- mean((fire_test_complete$fire_size[which(fire_test_complete$region == "NorthEast")] - ne_pred)^2)
#print test RMSE
ne_test_RMSE = sqrt(ne_test_MSE)
table_lm = data.frame(Region = c("Whole US", "North East", "Mid West", "West", "South West", "South East"),
RMSE = c(whole_test_RMSE, ne_test_RMSE, mw_test_RMSE, w_test_RMSE, sw_test_RMSE, se_test_RMSE),
NumberofObservation = c(length(whole_pred),length(ne_pred),length(mw_pred),length(w_pred),length(sw_pred),length(se_pred)))
kable(table_lm, caption = "Linear Model's RMSE")
```
### K Nearest Neighbor
KNN (short for K Nearest Neighbor) is a non-linear algorithm that works for both classification and regression problems. The basic premise behind the model is that it predicts the dependent variable based on how similar they are to other observations where the dependent variable is known.
KNN works by calculating euclidean distance between observations, so all inputs have to be numeric. Therefore, we have to do some pre-model data changes to our training and test sets. For both test data and training data we dummy code categorical variables, then we center and scale the data. As before, we assume that the training and test data are completely separated, so we process the two data sets separately.
```{r}
#first we dummy code categorical variables (what about ordinal?)
knn_ready_train <- dummy_cols(fire_train, select_columns =
c("stat_cause_descr", "region", "vegetation_classed", "season")) %>%
#then we get rid of non-dummy coded variables
select(-stat_cause_descr, - region, - vegetation_classed, -season) %>%
#then convert back to a data frame (as output for `dummy_cols` is a matrix)
as.data.frame()
#then we center and scale the data, except our outcome
knn_ready_train[,-1] <- scale(knn_ready_train[,-1], center = T, scale = T)
#of course our next step is to do the same to the test data
knn_ready_test <- dummy_cols(fire_test,
select_columns = c("stat_cause_descr", "region", "vegetation_classed", "season")) %>%
select(-stat_cause_descr, - region, - vegetation_classed, -season) %>%
as.data.frame()
knn_ready_test[,-1] <- scale(knn_ready_test[,-1], center = T, scale = T)
```
Next we split up the training and test data into a data frame that has only independent variables and a data frame that has only dependent variables (in this case `fire_size`). We do this for both the test and the training data.
```{r}
#YTrain is the true values for fire size on the training set
YTrain = knn_ready_train$fire_size
#XTrain is the design matrix for training data
XTrain = knn_ready_train %>%
select(-fire_size)
#YTest is the true value for fire_size on the test set
YTest = knn_ready_test$fire_size
#Xtest is the design matrix for test data
XTest = knn_ready_test %>%
select(-fire_size)
```
Then we use Leave One Out Cross Validation to determine the best number of neighbors to consider. A low number of neighbors considered results in a highly flexible model, while a higher number results in a less flexible model. For this process we built a function which we enter a starting value of K, an ending value of K, and the sampling interval. The result is a data frame of MSE values for each value of K that we test. This process is computationally intensive so we automatically saved results to a csv.
```{r, eval = F}
#make a function that saves KNN LOOCV results for different values of K
knn_loocv <- function(startk, endk, interval)
{
#set up possible number of nearest neighbors to be considered
allK = seq(from = startk, to = endk, by = interval)
#create a vector of the same length to save the MSE in later
k_mse = rep(NA, length(allK))
#for each number in allK, use LOOCV to find a validation error
for (i in allK){
#loop through different number of neighbors
#predict on the left-out validation set
pred.Yval = knn.cv(train = XTrain, cl = YTrain, k = i)
#find the mse for each value of k
k_mse[i] = mean((as.numeric(pred.Yval) - YTrain)^2)
}
#save as a data frame and filter out NAs (caused by skipping values of k if interval is larger than 1)
k_mse <- as.data.frame(k_mse) %>%
filter(!is.na(k_mse))
#bind with k value
knn_loocv_results <- cbind(as.data.frame(allK), k_mse)
#save MSE as CSV (because the cross validation process takes a long time)
write_csv(knn_loocv_results, paste0("model_results/knn/knn_mse_k", startk,"_k", endk, "_by", interval, ".csv"))
}
#we tried several different sets of k
knn_loocv(startk = 1, endk = 20, interval = 1)
knn_loocv(startk = 1, endk = 500, interval = 20)
```
Next, we go through our results for all the values of K that we tested. We find that the MSE increases as K increases.
```{r}
#read in all the stored k values
knn_mse_k1_k500_by20 <- read_csv(here("model_results", "knn", "knn_mse_k1_k500_by20.csv"))
knn_mse_k1_k20_by1 <- read_csv(here("model_results", "knn", "knn_mse_k1_k20_by1.csv"))
#plot MSE for values of k
plot(knn_mse_k1_k500_by20$allK, knn_mse_k1_k500_by20$k_mse, type = "l", xlab = "k", ylab = "MSE", main = "MSE for K 1 through 500 at intervals of 20")
```
When we look at K 1-20 we get the lowest MSE at 1 before it starts increasing.
```{r}
plot(knn_mse_k1_k20_by1$allK, knn_mse_k1_k20_by1$k_mse, type = "l", xlab = "k", ylab = "MSE", main = "MSE for K 1 through 20 at intervals of 1")
```
Just to confirm, we will print out the best number of neighbors and the associated MSE and RMSE.
```{r}
#best number of neighbors
numneighbor = max(knn_mse_k1_k20_by1$allK[knn_mse_k1_k20_by1$k_mse == min(knn_mse_k1_k20_by1$k_mse)])
#print training mse
training_mse <- knn_mse_k1_k20_by1 %>%
filter(allK == numneighbor) %>%
mutate(RMSE = sqrt(k_mse)) %>%
rename(K = allK,
MSE = k_mse)
training_mse
```
Now that we've found the best number of nearest neighbors to consider, we try out KNN on our test data! Below is the test MSE and the test RMSE. Interestingly, the test RMSE is lower than the training MSE. Although we used stratified sampling, there may be some random noise in the variance between the test and training data set variance.
```{r}
#fit KNN model with best neighbor value to test data
pred.YTest = knn(train = XTrain, test = XTest, cl = YTrain, k = numneighbor)
#predict KNN
test_MSE <- mean((as.numeric(pred.YTest) - YTest)^2)
#print test MSE
test_MSE
#print test RMSE
knn_rmse = sqrt(test_MSE)
knn_rmse
```
Over all, this model performs okay.
### Boosted Tree
Gradient boosted decision trees partition data by asking questions about that data (create single trees) and combines individual trees by using the boosting method. The boosting method basically seeks to achieve a strong learner from many sequentially connected weak learners combining a learning algorithm in series, where the weak learners in this method are single decision trees. This method iteratively adds a tree, and every time a tree is added, it fits the new version that contains one more tree, but focusing on minimizing the errors from the previous tree. The sequential addition of trees makes the boosting method learn slowly, but one advantage is that models that learn slowly can perform better in terms of prediction. Gradient boosted decision trees can be used for regression purposes and also classification tasks. In this case, we perform a regression to detect the size of a fire. Since we perform a regression, the MSE is used to detect the residuals and therefore it helps us determine whether it performs well or not comparing to the other methods. We again use the training data to train the model and the test data to test how can it perform under new observations.
```{r,cache = TRUE}
#train boosted tree model
set.seed(123)
fire_train <- as.data.frame(unclass(fire_train), # Convert all columns to factor
stringsAsFactors = T)
fire_test <- as.data.frame(unclass(fire_test),
stringsAsFactors = T)
fire_size_boost = gbm(fire_size~.,
data = fire_train,
n.trees = 1000,
shrinkage = 0.01,
interaction.depth = 4
)
```
The 4 most prominent predictors to fire size are magnitude of the fire, PC2, remoteness (that is the non-dimensional distance to closest city), and longitude. The MSE using the training set is much lower than the RMSE on the test data set, indicating overfitting.
```{r plots and summary for BT}
#the model summary creates a pretty visualization
summary(fire_size_boost, cBars = 10,normalize = T)
```
The following plots show the relation between the variables that have higher relative influence (x axis) and the mapping function $f(x)$ (in the y axis). There is a positive relation between fire magnitude with the response fire size. However, for the other variables, the relationship is not that clear.
```{r plots}
plot(fire_size_boost,i="fire_mag")
plot(fire_size_boost,i="remoteness")
plot(fire_size_boost,i="longitude")
```
Now, let us check the MSE for both, training and test datasets, respectively.
```{r training error}
#calculate training error
#predict values using training data
predictions_train_boost <- predict(fire_size_boost, data = fire_train)
#calculate rmse for training data
RMSE(predictions_train_boost, fire_train$fire_size)
#calculate test error
#predict values using test data
predictions_test_boost <- predict(fire_size_boost, data = fire_test)
#calculate rmse for training data
bt_rmse = RMSE(predictions_test_boost, fire_test$fire_size)
bt_rmse
```
Next, we calculate the error as a function of number of trees from 100 to 1000 with a step of 10 and create a prediction matrix that contains in its columns the predictions for each tree size.
```{r compare with rf}
###compute test error as a function of number of trees
n.trees = seq(from=100 ,to=1000, by=10) #vector containing different number of trees
```
The dimension of the prediction matrix is
```{r}
#create a prediction matrix for each tree
predmatrix<-predict(fire_size_boost, data = fire_test,n.trees = n.trees)
dim(predmatrix) #dimentions of the Prediction Matrix
```
This is a sample of the test error calculated for each of the 100 trees averaged.
```{r}
#calculate the MSE
test.error<-with(fire_test,apply( (predmatrix-fire_size)^2,2,mean))
head(test.error) #contains the Mean squared test error for each of the 100 trees averaged
```
To check the performance of boosting on the test set, we include the plot containing the different MSE on the train dataset calculated for different size trees. This Figure is depicting the boosting tree performance with different number of trees and it shows that the test error increases as the number of trees increases. This shows us the sensitivity of boosted gradient methods to tree size and indicates that this model does not converge since the error should decrease as the number of trees used increases.
```{r}
#Plotting the test error vs number of trees
plot(n.trees , test.error , pch=19,col="blue",xlab="Number of Trees",ylab="Test Error", main = "Perfomance of Boosting on Test Set")
```
### Random Forest
Random forest is also a supervised method to perform regression and classification tasks. It also creates single trees at random using a bootstrapping sampling method, also called _bagging_. It trains each tree independently and combine them by computing the average of all trees estimators. Since every tree is separately trained, it is difficult for this method to suffer overfitting when increasing the tree size, however, larger tree sizes makes this model computationally slow. We perform a random forest model for the same data used for boosted trees.
```{r, cache = TRUE}
#train model
set.seed(123)
fire_size_rf = randomForest(fire_size ~ ., #writing the formula
data = fire_train, #specifying training data to be used
mtry = 9, #setting number of variables to randomly sample per each split
ntree= 500, #setting number of trees
mode = "regression", #specifying regression
na.action = na.omit, #specifying what to do with NAs
importance = TRUE #specifying importance of variables should be assessed
)
```
9 variables were considered at each split and 500 trees were used to fit the data.
```{r plot RF}
print(fire_size_rf)
```
The error decreases as the tree size increases.
```{r}
#plot error
plot(fire_size_rf)
```
We also check the variable importance for fire size, where the variables that explain better fire size are latitude, fire magnitude, PC3, PC4, and year. There is a consensus between boosted trees model and random forest model in 1 of the variables that are important for explaining fire size: fire magnitude.
```{r}
#check the importance
fire_size_rf$importance
#plot variable importance
varImpPlot(fire_size_rf,
sort = T,
main = "Variable Importance for fire size random forest model",
n.var = 5)
```
Next, we compute the RMSE on the train and test datasets, respectively.
```{r}
#calculate training error
#predict values using training data
predictions_train_rf <- predict(fire_size_rf, data = fire_train)
#calculate rmse for training data
RMSE(predictions_train_rf, fire_train$fire_size)
#calculate test error
#predict values using test data
predictions_test_rf <- predict(fire_size_rf, data = fire_test)
#calculate rmse for test data
rf_rmse = RMSE(predictions_test_rf, fire_test$fire_size)
rf_rmse
#Plotting the test error vs number of trees
```
We find that the test RMSE is slightly higher than it was when we used the boosted trees model. Lastly, we compare the test error of the boosted trees model with different tree sizes with the minimum error of the random forest model. Then, we can conclude that the boosted performs better than random forest.
```{r plot RF vs BT}
plot(n.trees , test.error , pch=19,col="blue",xlab="Number of Trees",ylab="Test Error", main = "Perfomance of Boosting on Test Set")
#adding the RandomForests Minimum Error line trained on same data and similar parameters
abline(h = (RMSE(predictions_test_rf, fire_test$fire_size))^2,col="red") # test error of a Random forest fitted on same data
legend("topright",c("Minimum Test error Line for Random Forests"),col="red",lty=1,lwd=1)
```
# Results and Conclusions
According to the analysis, each approach can be compared by residual mean square error (RMSE). The table shows that K Nearest Neighbor approach (KNN) is the best model among 4 approaches. Each of the RMSEs were calculated based on the test data set.
KNN's RMSE is lower than multiple regression method's RMSE using principal components. Therefore, we can say that KNN is the best approach for predicting the wildfire size using other variables. Surprisingly, the multiple linear model performs second best, indicating the data may have a linear component.
```{r}
table_approach = data.frame(Method = c("Multiple Regression", "K Nearest Neighbor", "Boosted Tree", "Random Forest"),
RMSE = c(whole_test_RMSE, knn_rmse, bt_rmse, rf_rmse))
kable(table_approach, caption = "RMSE Comparison Table")
```
Wildfires have intensified in the last couple decades in United States. Since wildfires can affect society in multiple ways, not only damaging structures and generating important losses, but affecting vulnerable population, especially when the size of fires escalate out of proportions. Therefore, predicting the size of a fires can serve as support to prevent catastrophes. With that purpose, we tried different methods to predict the size of a fire in case a wildfire occurs including linear regression with dimensionality reduction through PCA, KNN regression, random forest regression, and boosted trees. We find that the RMSE on the test data is lower for KNN regression, which means that KNN is the method with the best predictive power. Regarding random forest and boosted trees, boosted trees performed slightly better.
None of the models are especially accurate. It is possible that more accurate vegetation data could be used to improve our ability to predict wildfire size.
# References