-
Notifications
You must be signed in to change notification settings - Fork 0
/
Fertilizer_advice_Nigeria_maize_DST.Rmd
349 lines (251 loc) · 29.4 KB
/
Fertilizer_advice_Nigeria_maize_DST.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
title: "Fertilizer advice for farmers growning maize in Nigeria"
author: "Ivana ALEKSOVSKA"
date: "2024-06-09"
output: pdf_document
header-includes:
\usepackage{fvextra}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{breaklines,commandchars=\\\{\}}
---
# Carob
<img align="right" width="350" height="350" src="https://github.com/reagro/carob/raw/master/img/carob.png">
*Carob* creates reproducible workflows that standardize primary agricultural research data from experiments and surveys. Standardization includes the use of a common file format, variable names, units and accepted values according to the [terminag](https://github.com/reagro/terminag) standard. Standardized data sets are aggregated into larger collections that can be used in further research. We do this by writing an *R* script for each individual dataset. See the [website](https://carob-data.org) for more information.
Carob is an open access *Extract, Transform, and Load* (ETL) framework supported by [CGIAR](https://www.cgiar.org/initiative/excellence-in-agronomy/) to support predictive analytics (machine learning, artifical intelligence) and other types of data analysis.
### Get the data
Compiled versions of the dataset can be downloaded from [carob-data.org](http://carob-data.org) and some will eventually be made available on the [carob dataverse](https://dataverse.harvard.edu/dataverse/carob/).
### Purpose
The purpose of this study is to develop tailored and climate smart fertilizer advice for maize growners in Nigeria based on the available data, and provide a decision support tool to advice farmers about the usage of fertilizer. For that, consider the yield response to N,P and K in the data and the spatial variation in soil and weather in the region to predict site specific N, P and K rates.
Below we explain how we go about meeting this challenge, and explain each step of the way.
```{r setup, warning=FALSE, message=FALSE, fig.align="center", tidy=TRUE }
#############################################
# MAIN SCRIPT TO DESIGN ANALYTICAL WORKLOW (Descriptive, diagnostic, predictiv and prescriptiv analysis)
# Ivana Aleksovska
#The four most fundamental types of data analytics include:
# - Descriptive Analytics: describes events or trends over time, such as determining whether the heatwave days have increased or decreased during the past 30 years.
# - Diagnostic Analytics: unravel the causes behind specific events, such as investigating the factors leading to heavy rainfall in Auckland, New Zealand, in January 2023. This process requires formulating hypotheses and working with a diverse dataset to gain insights into the root causes.
# - Predictive Analytics: aim at anticipating events that are likely to occur in the future. For example, people are keen to know how heatwave days will change in 2050 under climate change. In predictive analysis, data analysts leverage insights from previous data. Machine learning, deep learning, and other AI technologies have been extensively applied to predictive analysis. Some may heavily depend on numerical models such as climate change projections.
# - Prescriptive Analytics: involve identifying the most effective strategy for implementing a decision that has been reached. Prescriptive Analytics entails proposing a course of action. For instance, if the probability of rising sea levels is projected to increase to very high levels in a coastal area by global climate models, an adaptation strategy could involve designing a seawall.
# Other analytics types may include quantitative and qualitative analysis, exploratory analysis (mentioned in data preparation), statistical analysis, etc. It is worth noting that the selection of data analysis methods is primarily determined by the questions that need addressing.
#############################################
# This part contains the packages and the libraries that will be used in the following
# Packages for machine learning and data analysis
# install.packages(caret)
# install.packages(factoextra)
# install.packages(Metrics)
# Packages for Rmarkdown generator
# install.packages('knitr')
# install.packages("/Users/moia/Downloads/knitr_1.47.tar.gz", repos = NULL, type="source")
library(tidyverse)
library(caret)
library(party)
library(factoextra)
library(corrplot)
library(gridExtra)
library(grid)
library(ggplot2)
library(lattice)
library(Metrics)
# ---------------------------------------------------------------------------------------
# 1. DEFINE THE QUESTION
# Develop tailored and climate smart fertilizer advice for maize growners in Nigeria
# Consider the yield response to N,P and K in the data and the spatial variation in
# soil and weather in the region to predict site specific N, P and K rates.
# ---------------------------------------------------------------------------------------
# 2. DOWNLOAD AND PREPARE THE DATA FOR ML
# I) First get the data
# Follow the steps explained here to get the data
# https://carob-data.org/compile.html
# 1) online
#install.packages("remotes")
#remotes::install_github("reagro/carobiner", force = TRUE)
#remotes::install_github("reagro/carobiner")
#ff <- carobiner::make_carob("/Users/moia/Desktop/github/carob")
# carobiner::update_terms()
# 2) or zip format
# install.packages("/Users/moia/Downloads/geodata_0.5-9.tar.gz", repos = NULL, type="source")
# install.packages("/Users/moia/Desktop/github/carob/carobiner", repos = NULL, type="source")
# LOAD THE DATA IN THE RSTUDIO WORKING ENVIRONMENT
carob_fertilizer_terms <- readr::read_csv("data/compiled/carob_fertilizer_terms.csv",show_col_types = FALSE)
carob_fertilizer_metadata <- readr::read_csv("data/compiled/carob_fertilizer_metadata.csv",show_col_types = FALSE)
carob_fertilizer <- readr::read_csv("data/compiled/carob_fertilizer.csv",show_col_types = FALSE)
# SELECT THE DATA OF MAIZE THAT CORRESPONDS TO THE REGION OF NIGERIA
colnames(carob_fertilizer) # see all columns available in the data frame
carob_fertilizer<- carob_fertilizer[ which(carob_fertilizer$country=="Nigeria" & carob_fertilizer$crop =="maize") , ]
# ---------------------------------------------------------------------------------------
# 3. DATA ANALYSIS (descriptive statistic and visualisation)
# Descriptive analysis describes events or trends over time.
# a) Perform a comprehensive exploratory data analysis to summarize the main characteristics of the data)
# b) Use appropriate visualizations to illustrate your findings.
# In this context you should consider the yield response to N, P and K in the data and the spatial variation in soil and weather in the region to predict site specific N, P, and K rates.
# In short, fertilizers are labeled N, P or K to indicate their nutrient content in terms of nitrogen (N), phosphorus (P), and potassium (K). All three are important for plant growth.
# N_fertilizer: numeric (kg/ha) N applied in inorganic fertilizer
# P_fertilizer: numeric (kg/ha) P applied in inorganic fertilizer (as P, not P2O5)
# K_fertilizer: numeric (kg/ha) K applied in inorganic fertilizer (as K, not K2O)
# yield (kg/ha)
# ------
# Identify the important predictors X that influence the yield
predictors=c('N_fertilizer','P_fertilizer','K_fertilizer','yield')
# from the carob_fertilizer select only those columns that will be used into ML algo
carob_fertilizer_ML=carob_fertilizer[predictors]
# Before all clean the NA (to avoid missing values in the statistics) and the remove the duplicate values in the data frame (this comes usually from errors in savings, and it introduce bias since double)
carob_fertilizer_ML <-na.omit(carob_fertilizer_ML)
carob_fertilizer_ML <- carob_fertilizer_ML[!duplicated(carob_fertilizer_ML),]
carob_fertilizer_ML=as.data.frame(carob_fertilizer_ML) # convert to data frame
# ------
# Visual inspection / descriptive statistics
# Let's see the main statistics in the data. Here we are interested into mean, min, max values, and also for some quantiles.
summary(carob_fertilizer_ML)
# Histograms of all variables in our data set, the target, labeled “yield” and the predictors, K, N and P fertilizers. Here you can see the univariate distributions, one variable at a time. We can visually check the type of distribution, whether it's normal or not. In addition, we can check whether they need some form of normalization or discretization. All these variables are continuous variables.
# See the histograms for every variable (column)
carob_fertilizer_ML %>% gather() %>%
ggplot(aes(x=value)) +
geom_histogram(fill="steelblue", alpha=.7) +
theme_minimal() +
facet_wrap(~key, scales="free")
# Let’s move on to bivariate statistics. We are plotting a correlation matrix, in order to :
# a) check if we have predictors that are highly correlated (which is problematic for some algorithms),
# b) get a first feeling about which predictors are correlated with the target.
# Strong correlation ~1.
corr_matrix <- cor(carob_fertilizer_ML %>% keep(is.numeric))
corr_matrix %>% as.data.frame %>% mutate(var2=rownames(.)) %>%
pivot_longer(!var2, values_to = "value") %>%
ggplot(aes(x=name,y=var2,fill=abs(value),label=round(value,2))) +
geom_tile() + geom_label() + xlab("") + ylab("") +
ggtitle("Correlation matrix of our predictors") +
labs(fill="Correlation\n(absolute):")
#Let's also visualize the pairwise correlation graph using the pairs function, to see the correlation between each pair of available data, i.e. between the target and each input, but also to study each pair of inputs.
pairs(carob_fertilizer_ML)
# Principal Component Analysis (PCA) is a statistical method designed to reduce dimensionality. This is useful if we want to make a model (decision support tool) based on multiple predictors.PCA is used to extract meaningful information from a table of multivariate data, and to express this information as a set of a few new variables called principal components. The aim of PCA is to identify directions that can be visualized graphically with minimal loss of information. The first principal component can be equivalently defined as a direction that maximizes the variance of the projected data. Principal components are often analyzed via the covariance matrix decomposition of the data or the singular value decomposition (SVD) of the data matrix.
res.pca <- prcomp(carob_fertilizer_ML[,-4], scale = TRUE) # exclude the yield column, PCA is unsupervised machine learning
print(res.pca)
summary(res.pca)
eig.val<-get_eigenvalue(res.pca)
eig.val
# On the basis of importance of components, is it visible that first two PC has the highest vales for proportion of variance. This statement is also proved by eigenvalues measure. They are large for the first PC and small for the subsequent PCs, which means that the first PC corresponds to the directions with the maximum amount of variation in the data set. As far as scatter plot is concerned, first eigenvalue explain more than 75% of the variation, second ~20%. Therefore, more than 95% of the variation is explained by the first two eigenvalues together, which is a proper indicator for further analysis.
fviz_eig(res.pca, col.var="blue")
# PCA results can be assessed with regard to variables. Firstly, I will conduct extraction of results for variables. For that purpose get_pca_var() is used to provide a list of matrices containing all the results for the active variables (coordinates, correlation between variables and axes, squared cosine, and contributions).
var <- get_pca_var(res.pca)
var
corrplot(var$cos2, is.corr=FALSE)
fviz_cos2(res.pca, choice = "var", axes = 1:2)
# Additionally, the quality of representation of variables can be draw on the factor map, where cos2 values differ by gradient colors. Variables with low cos2 values will be colored “darkorchid4”, medium cos2 values - “gold”, high co2 values - “darkorange”. Positively correlated variables are grouped together, whereas negatively correlated variables are positioned on opposite sides of the plot origin. The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.
fviz_pca_var(res.pca,
col.var = "cos2", # Color by the quality of representation
gradient.cols = c("darkorchid4", "gold", "darkorange"),
repel = TRUE
)
# P_fertilizer, K_fertilizer and N_fertilizer have very high cos2, which implies a good representation of the variable on the principal component. In this case variables are positioned close to the circumference of the correlation circle. N_fertilizer has slightly lowest cos2, which indicates that the variable is less good represented by the PCs. None of them is close to the center of the circle, so all are represented with the first components.
# Draw a bar plot of variable contributions for the most significant dimensions, therefore PC1 and PC2.
# Contributions of variables to PC1
a<-fviz_contrib(res.pca, choice = "var", axes = 1)
# Contributions of variables to PC2
b<-fviz_contrib(res.pca, choice = "var", axes = 2)
grid.arrange(a,b, ncol=2, top='Contribution of the variables to the first two PCs')
# Based on the pca analysis, we can reduce the dimension and build predictive statistics using machine learning models on the pca components (keeping the first most important ones). However, in this case, we've only taken 3 predictors, so we'll keep them for this exercise. In any case, I'm happy to do more advanced machine learning models, using the full list of predictors given in the data. In this case, PCA will be explored even further.
# ---------------------------------------------------------------------------------------
# 4. MACHINE LEARNING (predictive statistics)
# In predictive analysis, data analysts leverage insights from previous data. Aim at anticipating events that are likely to occur in the future.
# I) Partition data into training and test sets (usually also validation set, but here for simplicity only 2). We use 70% of the data for training, and 30% for testing.
set.seed(2024)
split <- sample(1:nrow(carob_fertilizer_ML), as.integer(0.7*nrow(carob_fertilizer_ML)), F)
train <- carob_fertilizer_ML[split,]
test <- carob_fertilizer_ML[-split,]
# Here, we don't pre-process the data. These are continuous, non-categorical variables. What's more, since their units are the same kg/h, no normalization is required.
# From the train and test data, select the predictors x (fertilizers), and the target y (yield)
x_predictors=predictors[predictors!="yield"]
x_train=train[,x_predictors]
x_test=test[,x_predictors]
y_train <- train[,"yield"]
y_test <- test[,"yield"]
# ---------------------------------------------------------------------------------------
# II) Visualize exemplary algorithm
#Visualize exemplary algorithm. This step helps you understand what is going on when you subsequently train a more complex algorithm on your data. Let's strart with a decision tree here, because this is the foundation of more complex algorithms such as random forests or neural network. From the simple decision tree on our training data we can implement the following. Starting at the top, the most important predictor that can divide the data into the two most dissimilar subsets (in terms of expected yields) is “N_fertilizer”, i.e. whether the amount applied is greater or less than 66 kg/hr. If the quantity applied is greater (value > 66), we continue on the right-hand branch of the tree, otherwise we continue on the left-hand branch. If the quantity applied is greater (value > 66), we move on to the right-hand branch of the tree, otherwise we move on to the left-hand branch. For the right branch, the next important predictor is K_fertilizer, and the threshold to be considered here is: either <=41 (which leads to node 12 with a higher yield value), or >41.5 (which leads to node 13 with lower expected yield quantities). For the left branch, i.e. “N_fertilizer” <=66, the next predictor is K_fertilizer (with significant thresholds <=20 and >20 kg/h). When the K fertilizer applied is <=20, we must also consider P fertilizer as an option, with a significant threshold of 4.367 kg/h (a lower amount leads to node 4 and a higher amount leads to nodes 6 and 7). The combinations of different fertilizers with associated thresholds that lead to the worst yield prediction are associated with node 7. In the training data, there were 36 cases. The box-plots on the bottom of the chart show the statistics of the recored yields in the training data for the combination of different fertilizers and the corresponding applied quantitiy.
# DECISION TREE
set.seed(2024)
tree1 <- party::ctree(y_train ~ ., data=cbind(x_train, y_train),
controls = ctree_control(minsplit=12, mincriterion = .999))
#plot(tree1)
knitr::include_graphics("DecTree.png")
# ---------------------------------------------------------------------------------------
# III) Machine learning models
# a) RANDOM FOREST
#First we will run random forest which is basically an ensemble of many trees as the one we built in the previous section. The trick is that each tree is grown with only a random subset of all predictors considered at each node, and in the end all trees take a vote how to classify a specific patient. Taking a subset of all predictors at each run ensures that the trees are less correlated, i.e. not all of them use the same rules as the example tree shown above. If there are a few dominant predictors (N_fertilizer), then there will be some trees in our forest grown without these dominant predictors. These trees will be better able to classify the subgroup of predictors, for whatever reasons. Our first tree in the previous section would be confused about what to predict for yields associated for those predictors. However, in the random forest, there are trees that understand these cases as well. Thus, an ensemble of learners such as a random forest most often outperforms a single learner.
# Let's set the hyperparamter via the tuneGrid argument. The “mtry” parameter specifies how many of the predictors to consider at each split. We have 3 predictors in our training dataset, so if you set mtry == 3, then it’s not a random forest any more, because all predictors are used and no random selection is applied. If you set mtry == 1, then the trees will be totally different from each other, but most ones will perform poorly because they are forced to use certain variables at the top split which are maybe not useful. The lower mtry, the more decorrelated the trees are, and the higher the value, the more predictors each tree can consider and thus the better the performance of a single tree. Somewhere between 1 and 3 is the optimal value, and there is no theoretical guidance as to which value should be taken. It depends on your data at hand, how correlated the predictors are, whether there are distinct sub-groups where the causal structure of the predictors works differently, and so on. The ideal will be to use mtry=2. Since for this exercise we have only 3 predictors, I will use mtry = c(1,2,3), which means I will make random forecast, from random trees that have all possible combinatons of predictors. Finding the best performance for different match for tuning hyperparameters is out of scope for this exercices.
# In “trainControl” we will specify “method = ‘cv'” which stands for “cross validation”. This means we will create another random split, splitting the training data into training and validation sets for the purpose of determining which algorithm works best on the training data. We set “number = 5”, the function creates a validation set of size 1/5 of x_train and takes 4/5 of the data for training. “Cross validation” therefore repeats this training process and changes the validation set to another fifth of the data. This is done 5 times in total, so that all parts of the data served as validation set once, and then the results are averaged. This lets us use all of the training data and have train/validation splits to avoid overfitting.
set.seed(2024)
garbage <- capture.output(mod_rf <- caret::train(x_train, y_train, method="rf",
tuneGrid = expand.grid(mtry = seq(1,ncol(x_train),by=1)),
trControl = trainControl(method="cv", number=5, verboseIter = T),
verbose = FALSE))
#There are 3 predictors, 4487 samples. Three values were used for the hyperparameter “mtry”. With each of them, 5-fold cross validation was performed. All values for mtry gives the same performances. On average (of the five runs using cross-validation), for each mtry value, the accuracy was obtained with a train/validation split into the training data, and the performance measured via RMSE, R2 and MAE.
mod_rf
#Predictor importance plot tells which of the variables were most often used as the important splits at the top of the trees. Unlike the single decision tree on all of the training data, where “N_fertilizer” was the most important predictor, across an ensemble of different trees, it’s actually “P_fertilizer”.
plot(varImp(mod_rf), main="Predictor importance of random forest model on training data",cex=0.5)
# b) NEURAL NETWORK
# Let’s try out a neural network. Those deep learning methods can be useful for a complex relationship between predictors and targets.
#Here we use the pre-processing steps of centering and scaling the data because,neural networks are optimized more easily if the predictors have similar numerical ranges. Near-zero variance (“nzv”) means that we disregard predictors where almost all fertilizers have the same value. Tree-based methods such as random forests are not as sensitive to these issues.
#We have a few more tuning parameters here. “Size” refers to the number of nodes in the hidden layer. Our network has an input layer of 3 nodes (i.e. the number of predictors) and an output layer with one node (target yield) and in between, a hidden layer where interactions between the predictors and non-linear transformations can be learned. As with other hyperparameters, the optimal size of the hidden layer(s) depend on the data, so we just try the same set of values like in RF. Decay is a regularization parameter that causes the weights of our nodes to decrease a bit after each round of updating the values after backpropagation (i.e. the opposite of what the learning rate does wich is used in other implementations of neural networks). What this means is, roughly speaking, we don’t want the network to learn too ambitiously with each step of adapting its parameters to the evidence, in order to avoid overfitting. We have passed 3 different values for “size” to consider, 4 values for “decay”, and two for “bag” (true or false, specifying how to aggregate several networks’ predictions with various random number seeds, which is what the avNNet classifier does, bagging = bootstrap aggregating), so we have 3*4*2 = 24 combinations to try out.
set.seed(2024)
garbage <- capture.output(mod_nn <- caret::train(x_train, y_train, method="avNNet",
preProcess = c("center", "scale", "nzv"),
tuneGrid = expand.grid(size = seq(1,3,by=1), decay=c(1e-03, 0.01, 0.1,0),bag=c(T,F)),
trControl = trainControl(method="cv", number=5, verboseIter = T),
importance=T))
mod_nn
#As before we use predictor importance plot. Here, “N_fertilizer” was the most often used predictor.
plot(varImp(mod_nn), main="Feature importance of neural network classifier on training data",cex=0.5)
# C) LINEAR REGRESSION
lm.fit=lm(yield~N_fertilizer+P_fertilizer+K_fertilizer, data=train)
summary(lm.fit)
regressControl <- trainControl(method="repeatedcv",
number = 4,
repeats = 5
)
mod_lm <- train(yield~N_fertilizer+P_fertilizer+K_fertilizer,
data = train,
method = "lm",
trControl = regressControl,
tuneGrid = expand.grid(intercept = FALSE))
# ---------------------------------------------------------------------------------------
# IV) Compare the performance of the three algorithms:
# Prescriptive Analytics: involve identifying the most effective strategy for implementing a decision that has been reached.
#Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how to spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit. RMSE value with zero indicates that the model has a perfect fit. The lower the RMSE, the better the model and its predictions. A higher RMSE indicates that there is a large deviation from the residual to the ground truth. From the histogram we can see that the RF outperform NN and LM.
results <- data.frame(Model = c(mod_rf$method,mod_nn$method, mod_lm$method),
RMSE = c(max(mod_rf$results$RMSE), max(mod_nn$results$RMSE), max(mod_lm$results$RMSE)))
results %>% ggplot(aes(x=Model, y=RMSE, label=paste(round(RMSE)))) +
geom_col(fill="steelblue") + theme_minimal() + geom_label() +
ggtitle("RMSE in the training data by algorithm")
#R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. The standards for a good R-squared is to be close to 1. From the histogram we can see that the RF outperform NN and LM.
results <- data.frame(Model = c(mod_rf$method,mod_nn$method, mod_lm$method),
Rsquared = c(max(mod_rf$results$Rsquared), max(na.omit(mod_nn$results$Rsquared)), max(mod_lm$results$Rsquared)))
results %>% ggplot(aes(x=Model, y=Rsquared, label=paste(round(Rsquared,1)))) +
geom_col(fill="steelblue") + theme_minimal() + geom_label() +
ggtitle("R2 in the training data by algorithm")
# MAE stands for mean absolute error, which is a measure of how close your predictions are to the actual values in a regression problem. It is calculated by taking the average of the absolute differences between the predicted and the actual values for each observation in your data set. The closer MAE is to 0, the more accurate the model is. From the histogram we can see that the RF outperform NN and LM.
results <- data.frame(Model = c(mod_rf$method,mod_nn$method, mod_lm$method),
MAE = c(max(mod_rf$results$MAE), max(mod_nn$results$MAE), max(mod_lm$results$MAE)))
results %>% ggplot(aes(x=Model, y=MAE, label=paste(round(MAE,1)))) +
geom_col(fill="steelblue") + theme_minimal() + geom_label() +
ggtitle("MAE in the training data by algorithm")
# REMARK! Even if RF shows better performance than NN and LM, I would still look for a model that gives better accuracy on the training data, i.e. R2 closer to 1 and RMSE and MAE closer to 0. This means that we should either include more predictors in the machine learning models, or think about another model that will be more appropriate for the data.
#V) EVALUATION ON "TEST" DATA
#Model evaluation against the test data. We now compare our model’s prediction against the reserved test dataset. We use the random forest to predict the test data, and we compare the predictions against the actual outcomes.
predictions <- predict(mod_rf, newdata = x_test,na.action = na.pass)
RMSE=sqrt(mean((y_test - predictions)^2))
RMSE
MAE=mae(y_test, predictions)
MAE
rss <- sum((predictions - y_test) ^ 2) ## residual sum of squares
tss <- sum((y_test - mean(y_test)) ^ 2) ## total sum of squares
R2 <- 1 - rss/tss
R2
#The evaluation metrics RMSE, MAE and R2 shows similar behaviour on test data as on training data.
# ---------------------------------------------------------------------------------------
# VI) MODEL PREDICTIONS
# Once we have a good fitted model, we can make predictions for any set of predictors. Suppose a farmer is interested in predicting yields and plans to apply the following amounts of fertilizer.
new_data=data.frame(N_fertilizer=120,P_fertilizer=35,K_fertilizer=50)
predictions <- predict(mod_rf, newdata = new_data[1,],na.action = na.pass)
predictions # Estimated yield kg/h if applied 120 kg/h (N), 35kg/h (P) and 50kg/h (K) ferlilizer.
```
### Future work
This analysis is carried out for the first part of the exercises, i.e. it considers the yield response to N, P and K in the data, once we have the N, P and K values. This is advice for farmers, to estimate what maize yield will be. However, another model should be built to advise the farmer on how much N, P and K to use for a given location and under what weather conditions. To this end, we will take into account spatial variations as predictors (latitude, longitude) and spatial variations in soil and weather conditions in the region to predict site-specific N, P and K rates (multiple target results). In addition, variations in the relevant columns of soil and weather data need to be discussed with experts. As far as I understand from the data description, the following columns should be used as predictors (in addition to latitude and longitude): `Soil_N`, `Soil_P_total`,`Soil_K`,`rain`,`OM_N`,`OM_P`,`OM_K`, as well as columns that contains information about the irrigation and probably `leaf_N`,`leaf_P`,`leaf_K` and `grain_N`,`grain_P`,`grain_K`. This requires discussions with experts to better understand how to define the problem and select the most relevant predictors in the columns indicated in `colnames(carob_fertilizer)`. This is because some of the predictors are categoretic, such as yes/no rain or yes/no irrigation. Moreover, this is crucial for the choice of certain machine learning models, as well as for data pre-processing in the case of categoric inputs.