-
Notifications
You must be signed in to change notification settings - Fork 0
/
Movie_Recommendation_System.Rmd
825 lines (624 loc) · 28 KB
/
Movie_Recommendation_System.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
---
title: "Movie Recommendation System"
author: "Luca Bartolomei"
date: "_`r format(Sys.Date(), '%d %B, %Y')`_"
output:
pdf_document:
df_print: kable
number_sections: yes
toc: yes
toc_depth: 3
fig_caption: yes
extra_dependencies: subfig
highlight: tango
#documentclass: report
classoption: a4paper
fig_height: 5
fig_width: 5
fontsize: 10pt
latex_engine: xelatex
mainfont: Arial
mathfont: LiberationMono
monofont: DejaVu Sans Mono
include-before: '`\newpage{}`{=latex}'
urlcolor: blue
---
\newpage
# **Introduction**
The purpose of the present project is to create a recommendation system for predicting the rating of movies.
A recommendation system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item.
As a data source for training and final evaluation of the system will be used the MovieLens 10M Dataset from https://grouplens.org/datasets/movielens/10m/.
The residual mean squared error (RMSE) will be used as the evaluation system of the recommendation system.
The ultimate goal is to obtain an RMSE value of less than 0.86490.
In a first phase, the data will be imported and the sub-data sets configured for training and final evaluation.
We will start to represent the base data structure using table and charts in order to highlight the
relationship between the rating and the features; once it is clear how each feature effects the outcome,
it is possible to discard the ineffective ones.
A data cleanup will then be performed to focus only on relevant data.
Finally, selected algorithms will be implemented and then respective RMSE values compared
\newpage
# **Methods**
## Model evaluation
As already mentioned in the introduction, the evaluation model used is the residual mean squared error.
The residual Mean Squared Error (RMSE), is the square root of the the average squared error of the predictions and it's the typical metric to evaluate recommendation systems.
The RMSE penalizes large deviations from the mean and is appropriate in cases where small errors are not relevant.
In general terms RMSE can be described by the following formula
$$
\mbox{RMSE} = \sqrt{\frac{1}{N} \sum_{u,i}^{} \left( \hat{y}_{u,i} - y_{u,i} \right)^2 }
$$
Where $y_{u,i}$ is the rating for movie $i$ by user $u$ and denote our prediction with $\hat{y}_{u,i}$.
## Modeling
The models that will be implemented are described below
### Random Prediction
Random prediction is performed using the probability distribution observed during the data exploration.
It is only implemented to indicate the worst case scenario. It will be implemented through the Monte Carlo method
### Linear Regression
Linear regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables.
Linear regression is based on this formula
$$
Y_i = \beta_0 + \beta_1 x_{i,1} + .... + \beta_n x_{i,n} + \varepsilon_i, i=1,\dots,n
$$
Where $Y_i$ are the dependent variables, $x_{i,1}...x_{i,n}$ are the independents variables, $\beta_0$ is a constant, $\beta_1...\beta_n$ are coefficients and $\varepsilon_{u,i}$ is the error distribution.
So, to solve that equation we need to find the constant $\beta_0$ and $\beta_1...\beta_n$ coefficients
### Regularization
Regularization is the process of adding information in order to prevent overfitting by penalizing models with extreme parameter values.
Overfitting is the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.
The aim of regularization is to regularize or to shrink the coefficient estimates towards zero.This technique discourages feeding a more complex or flexible model, so as to avoid the risk of overfitting.
If we consider this formula
$$
\frac{1}{N} \sum_{u,i} \left(y_{u,i} - \mu - \beta_i - \beta_u \right)^2 +
\lambda \left(\sum_{i} \beta_i^2 + \sum_{u} \beta_u^2\right)
$$
The first term is just the sum of squares and the second is a penalty that gets larger when many
$\beta_i$ and/or $\beta_u$ are large.
So, the idea is to find a value for $\lambda$ that minimizes the above equation.
An effective method to choose $\lambda$ that minimizes the RMSE is running simulations with several values of $\lambda$.
### Matrix factorization
Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two rectangular matrices of lower dimension.
In simpler terms, Factorization is the method of expressing something big as a product of smaller factors.
\newpage
## Data preparation
Let's start by installing the necessary packages.
```{r call library,message=FALSE,warning=FALSE}
if(!require(tidyverse)) install.packages("tidyverse",
repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret",
repos = "http://cran.us.r-project.org")
if(!require(data.table)) install.packages("data.table",
repos = "http://cran.us.r-project.org")
#Install package ggthemes to manage themes, geoms, and scales for 'ggplot2'
if(!require(ggthemes))
install.packages("ggthemes", repos = "http://cran.us.r-project.org")
#This package simplifies the way to manipulate the HTML or 'LaTeX' codes
#generated by 'kable()' and allows users to construct complex tables
#and customize styles using a readable syntax
if(!require(kableExtra))
install.packages("kableExtra", repos = "http://cran.us.r-project.org")
#This package make Dealing with Dates a Little Easier
if(!require(lubridate))
install.packages("lubridate", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(caret)
library(data.table)
library(ggthemes)
library(scales)
library(kableExtra)
library(lubridate)
```
We download the data necessary for the creation of the recommendation system,
We split the dataset in two parts: the training set "edx" and the evaluation set "validation",
with 90% and 10% of the original dataset respectively.
Then, we split "edx" in two parts: the train set "train_set" and test set "test_set",
with 90% and 10% of edx set respectively.
We are going to use "train_set" to train the models which will be tested with "test_set".
The best model will be trained with "edx" and validated with "validation".
```{r create-train-test-sets,message=FALSE,warning=FALSE}
dl <- tempfile()
download.file("http://files.grouplens.org/datasets/movielens/ml-10m.zip", dl)
ratings <- fread(text = gsub("::", "\t",
readLines(unzip(dl, "ml-10M100K/ratings.dat"))),
col.names = c("userId",
"movieId",
"rating",
"timestamp"))
movies <- str_split_fixed(readLines(unzip(dl, "ml-10M100K/movies.dat")),
"\\::", 3)
colnames(movies) <- c("movieId", "title", "genres")
movies <- as.data.frame(movies) %>% mutate(movieId = as.numeric(movieId),
title = as.character(title),
genres = as.character(genres))
movielens <- left_join(ratings, movies, by = "movieId")
# Validation set will be 10% of MovieLens data
set.seed(1, sample.kind="Rounding")
test_index <- createDataPartition(y = movielens$rating,
times = 1,
p = 0.1,
list = FALSE)
edx <- movielens[-test_index,]
temp <- movielens[test_index,]
# Make sure userId and movieId in validation set are also in edx set
validation <- temp %>%
semi_join(edx, by = "movieId") %>%
semi_join(edx, by = "userId")
# Add rows removed from validation set back into edx set
removed <- anti_join(temp, validation)
edx <- rbind(edx, removed)
rm(dl, ratings, movies, test_index, temp, movielens, removed)
#The edx set is used for training and testing, and the validation set
#is used for final validation to simulate the new data.
#Here, we split the edx set in 2 parts: the training set and the test set.
#The model building is done in the training set, and the test set is
#used to test the model. When the model is complete, we use the validation
#set to calculate the final RMSE. We use the same procedure used
#to create edx and validation sets.
#The training set will be 90% of edx data and the test set
#will be the remaining 10%.
set.seed(1, sample.kind="Rounding")
test_index <- createDataPartition(y = edx$rating,
times = 1,
p = 0.1,
list = FALSE)
train_set <- edx[-test_index,]
temp <- edx[test_index,]
# Make sure userId and movieId in test set are also in train set
test_set <- temp %>%
semi_join(train_set, by = "movieId") %>%
semi_join(train_set, by = "userId")
# Add rows removed from test set back into train set
removed <- anti_join(temp, test_set)
train_set <- rbind(train_set, removed)
rm(test_index, temp, removed)
```
\newpage
## Data exploration
Let's analyze the structure of the data.
```{r Data exploration,message=FALSE,warning=FALSE}
#Structure
str(edx)
#Dimension
dim(edx)
```
edx has six variables:
| | |
|-----------|-----------|
| userId | integer |
| movieId | numeric |
| rating | numeric |
| timestamp | integer |
| title | character |
| genres | character |
The variable "rating" is the desired outcome.
The other variables are the potential predictors.
## Visualization
Now we analyze the distribution of the rating and the distribution of the rating
with respect to users, movies, genres and dates to get a first idea of
the effect of these predictors on the rating
### Rating analysis
Round values receive more ratings than decimals and higher ratings are prevalent.
High ratings are predominant
```{r Rating Distribution,message=FALSE,warning=FALSE}
edx %>% group_by(rating) %>%
summarise(count=n()) %>%
ggplot(aes(x=rating, y=count)) +
geom_line() +
geom_point() +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
ggtitle("Rating Distribution") +
xlab("Rating") +
ylab("Count") +
theme_economist()
```
### Users analysis
Rating distribution based on Users is right skewed.
```{r Rating distribution based on Users,message=FALSE,warning=FALSE}
edx %>% group_by(userId) %>%
summarise(n=n()) %>%
ggplot(aes(n)) +
geom_histogram(color = "white") +
scale_x_log10() +
ggtitle("Rating distribution based on Users") +
xlab("Number of Ratings") +
ylab("Number of Users") +
scale_y_continuous(labels = comma) +
theme_economist()
```
### Movies analysis
Rating distribution based on Movies is almost symmetric
```{r Rating distribution based on Movies,message=FALSE,warning=FALSE}
edx %>% group_by(movieId) %>%
summarise(n=n()) %>%
ggplot(aes(n)) +
geom_histogram(color = "white") +
scale_x_log10() +
ggtitle("Rating distribution based on Movies") +
xlab("Number of Ratings") +
ylab("Number of Movies") +
theme_economist()
```
### Genres analysis
Some movies fall under several genres
```{r Edx head,message=FALSE,warning=FALSE}
head(edx, 20) %>% kable(caption = "Movies") %>%
kable_styling(font_size = 10, position = "center",
latex_options = c("scale_down","HOLD_position"))
#We can see that different movies belong to multiple genres
tibble(count = str_count(edx$genres, fixed("|")), genres = edx$genres) %>%
group_by(count, genres) %>%
summarise(n = n()) %>%
arrange(-count) %>%
head() %>% kable(caption = "Genres") %>%
kable_styling(font_size = 10, position = "center",
latex_options = c("scale_down","HOLD_position"))
```
The plots show a clear effect of genres on the rating
```{r Rating distribution based on Genres,message=FALSE,warning=FALSE}
temp_genre <- edx %>%
separate_rows(genres,sep = "\\|") %>% mutate(value=1) %>%
group_by(genres) %>%
summarize(n=n()) %>%
ungroup() %>%
mutate(sumN = sum(n), percentage = n/sumN) %>%
arrange(-percentage)
temp_genre %>%
ggplot(aes(reorder(genres, percentage), percentage, fill= percentage)) +
geom_bar(stat = "identity") + coord_flip() +
scale_fill_distiller(palette = "YlOrRd") +
labs(y = "Percentage", x = "Genre") +
ggtitle("Distribution of Genres by Percent Rated")
edx %>% group_by(genres) %>%
summarize(n = n(), avg = mean(rating), se = sd(rating)/sqrt(n())) %>%
filter(n >= 1000) %>%
mutate(genres = reorder(genres, avg)) %>%
ggplot(aes(x = genres, y = avg, ymin = avg - 2*se, ymax = avg + 2*se)) +
geom_point() +
geom_errorbar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
### Date analysis
The plot shows that time has no strong effect on average ratings
```{r Rating distribution based on Date,message=FALSE,warning=FALSE}
edx <- mutate(edx, date = as_datetime(timestamp))
validation <- mutate(validation, date = as_datetime(timestamp))
test_set <- mutate(test_set, date = as_datetime(timestamp))
train_set <- mutate(train_set, date = as_datetime(timestamp))
edx %>% mutate(date = round_date(date, unit = "week")) %>%
group_by(date) %>%
summarize(rating = mean(rating)) %>%
ggplot(aes(date, rating)) +
geom_point() +
geom_smooth() +
theme_economist()
```
\newpage
## Data cleaning
It is necessary to reduce the number of predictors as otherwise the complexity
of the analysis would be too high and, above all, it would require an amount
of memory generally not available in a laptop.
For example, genres are computationally very expensive as they would require
to separate multiple genres for many films in multiple variables.
For this analysis we will only share movie and user
```{r Data cleaning,message=FALSE,warning=FALSE}
train_set <- train_set %>% select(userId, movieId, date, rating, title)
test_set <- test_set %>% select(userId, movieId, date, rating, title)
```
\newpage
# **Results**
## Model evaluation
For the calculation of the RMSE we will use the following function
```{r RMSE function,message=FALSE,warning=FALSE}
RMSE <- function(original_ratings, predicted_ratings){
sqrt(mean((original_ratings - predicted_ratings)^2))
}
```
## Random Prediction
We randomly predict ratings using the probabilities observed in the training set.
We start by calculating the probability of each assessment in the training set,
then predict the assessment for the test set and compare it
to the actual assessment.
```{r Random prediction,message=FALSE,warning=FALSE}
set.seed(1, sample.kind = "Rounding")
#Create the probability of each rating
random_prob <- function(x, y) mean(x == y)
rating <- seq(0.5,5,0.5)
# Estimate the probability of each rating with Monte Carlo simulation
B <- 10^3
monte_carlo <- replicate(B, {
rec <- sample(train_set$rating, 100, replace = TRUE)
sapply(rating, random_prob, x = rec)
})
monte_carlo_prob <- sapply(1:nrow(monte_carlo),
function(ind)
mean(monte_carlo[ind,]))
#Predict random ratings
random_prediction <- sample(rating, size = nrow(test_set),
replace = TRUE,
prob = monte_carlo_prob)
#Create a table with the error results
monte_carlo_result <- tibble(Method = "Random prediction",
RMSE = RMSE(test_set$rating,
random_prediction))
result <- tibble(Method = "Project target", RMSE = 0.86490)
result <- bind_rows(result,
monte_carlo_result)
result %>% kable(caption = "Result") %>%
kable_styling(font_size = 10, position = "center",
latex_options = c("scale_down","HOLD_position"))
```
## Linear Regression
Linear regression euqation applied to our case is this one
$$
\hat{Y} = \mu + \beta_i + \beta_u + \varepsilon_{u,i}
$$
Where $\mu$ is the mean of rating, $\beta_i$ is the movie bias, $\beta_u$ is the user bias and $\varepsilon_{u,i}$ is the error distribution
We will not use the lm () function as it is not recommended for large databases
```{r Evaluate mean rating,message=FALSE,warning=FALSE}
set.seed(1, sample.kind="Rounding")
#Evaluate mean of the ratings
mean_rating <- mean(train_set$rating)
```
```{r Evaluate movie bias,message=FALSE,warning=FALSE}
#Evaluate movie bias
movie_bias <- train_set %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mean_rating))
```
Movie bias has an asymmetric distribution
```{r Show movie bias,message=FALSE,warning=FALSE}
movie_bias %>% ggplot(aes(x = b_i)) +
geom_histogram(bins=10, col = I("white")) +
ggtitle("Movie bias distribution") +
xlab("Movie bias") +
ylab("Count") +
scale_y_continuous(labels = comma) +
theme_economist()
```
```{r Evaluate user bias,message=FALSE,warning=FALSE}
#Evaluate user bias
user_bias <- train_set %>%
left_join(movie_bias, by = 'movieId') %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mean_rating - b_i))
```
User bias is almost symmetrical
```{r Show user bias,message=FALSE,warning=FALSE}
user_bias %>% ggplot(aes(x = b_u)) +
geom_histogram(bins=10, col = I("white")) +
ggtitle("User bias distribution") +
xlab("User bias") +
ylab("Count") +
scale_y_continuous(labels = comma) +
theme_economist()
```
```{r Predict rating,message=FALSE,warning=FALSE}
#Predict rating
regression_prediction <- test_set %>%
left_join(movie_bias, by='movieId') %>%
left_join(user_bias, by='userId') %>%
mutate(prediction = mean_rating + b_i + b_u) %>%
.$prediction
regression_rmse <- RMSE(test_set$rating,
regression_prediction)
regression_rmse
regression_prediction_result <- tibble(Method = "Linear regression prediction",
RMSE = RMSE(test_set$rating,
regression_prediction))
result <- bind_rows(result,
regression_prediction_result)
result %>% kable(caption = "Results") %>%
kable_styling(font_size = 10, position = "center",
latex_options = c("scale_down","HOLD_position"))
```
Linear regression method improves the performance in the calculation of the RMSE
compared to Random prediction
## Regularization
We want to regularize movie and user bias adding a penalty factor $\lambda$
and find a value to pick the best value that minimizes the RMSE.
We use cross-validation through the use of the function sapply()
```{r Regularization function,message=FALSE,warning=FALSE}
set.seed(1, sample.kind="Rounding")
#Regularization function
regularization <- function(lambda, training, test){
#Mean
mean_rating <- mean(training$rating)
#Movie bias
movie_bias <- training %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mean_rating)/(n()+lambda))
#User bias
user_bias <- training %>%
left_join(movie_bias, by="movieId") %>%
filter(!is.na(b_i)) %>%
group_by(userId) %>%
summarize(b_u = sum(rating - mean_rating - b_i)/(n()+lambda))
#Prediction
predicted_ratings <- test %>%
left_join(movie_bias, by = "movieId") %>%
left_join(user_bias, by = "userId") %>%
filter(!is.na(b_i), !is.na(b_u)) %>%
mutate(predicted = mean_rating + b_i + b_u) %>%
pull(predicted)
return(RMSE(test$rating, predicted_ratings))
}
```
```{r Definition of lambdas,message=FALSE,warning=FALSE}
# Definition of lambdas
lambdas <- seq(0, 10, 0.25)
```
```{r Tuning,message=FALSE,warning=FALSE}
#Tuning
rmses <- sapply(lambdas,
regularization,
training = train_set,
test = test_set)
```
```{r Plot lambdas,message=FALSE,warning=FALSE}
# Plot lambdas
tibble(Lambda = lambdas, RMSE = rmses) %>%
ggplot(aes(x = Lambda, y = RMSE)) +
geom_point() +
ggtitle("Regularization") +
theme_economist()
```
```{r Pick the best value,message=FALSE,warning=FALSE}
# Pick the best value
lambda <- lambdas[which.min(rmses)]
lambda
```
The best value is 5
```{r Regularization prediction}
# Regularization prediction
#Mean
mean_rating <- mean(train_set$rating)
#Movie bias
movie_bias <- train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mean_rating)/(n()+lambda))
#User bias
user_bias <- train_set %>%
left_join(movie_bias, by="movieId") %>%
filter(!is.na(b_i)) %>%
group_by(userId) %>%
summarize(b_u = sum(rating - mean_rating - b_i)/(n()+lambda))
#Prediction
regularization_prediction <- test_set %>%
left_join(movie_bias, by = "movieId") %>%
left_join(user_bias, by = "userId") %>%
filter(!is.na(b_i), !is.na(b_u)) %>%
mutate(predicted = mean_rating + b_i + b_u) %>%
pull(predicted)
regularization_rmse <- RMSE(test_set$rating, regularization_prediction)
regularization_result <- tibble(Method = "Regularization prediction",
RMSE = regularization_rmse)
result <- bind_rows(result,
regularization_result)
result %>% kable(caption = "Results") %>%
kable_styling(font_size = 10, position = "center",
latex_options = c("scale_down","HOLD_position"))
```
Regularization provides a slight improvement in the RMSE estimate
## Matrix factorization
To perform matrix factorization first of all we need to convert the data in a
user-movie matrix and then approximate this matrix as a product of two smaller
matrices.
These operations could be very expensive in terms of memory, so we are going
to use the recosystem package, which provides the complete solution
for a recommendation system using matrix factorization.
Basically recosystem performs the following operations:
* Create a model object by calling Reco();
* Call the tune() method to select best tuning parameters along a set of candidate values;
* Call train() method to train the model;
* Call predict() method to compute predicted values.
This package has several parameters whose values can be calibrated to increase performance,
in particular, the nthread parameter which sets the number of threads for parallel computing.
```{r Install recosystem package,message=FALSE,warning=FALSE}
#Install recosystem package
if(!require(recosystem))
install.packages("recosystem", repos = "http://cran.us.r-project.org")
library(recosystem)
```
```{r Create matrices,message=FALSE,warning=FALSE}
set.seed(1, sample.kind = "Rounding")
#Create matrices
train_data <- with(train_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
test_data <- with(test_set, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
```
```{r Create the model object,message=FALSE,warning=FALSE}
#Create the model object
model_object <- Reco()
```
```{r Call the tune() method,message=FALSE,warning=FALSE}
#Select the best tuning parameters
best_tuning <- model_object$tune(train_data,
opts = list(dim = c(10, 20, 30),
lrate = c(0.1, 0.2),
costp_l1 = c(0.01, 0.1),
costq_l1 = c(0.01, 0.1),
nthread = 4,
niter = 10))
```
```{r Call train() method,message=FALSE,warning=FALSE}
#Train model
model_object$train(train_data,
opts = c(best_tuning$min, nthread = 4, niter = 20))
```
```{r Call predict() method,message=FALSE,warning=FALSE}
#Predict
matrix_prediction <- model_object$predict(test_data, out_memory())
matrix_rmse <- RMSE(test_set$rating,
matrix_prediction)
matrix_result <- tibble(Method = "Matrix factorization prediction",
RMSE = matrix_rmse)
result <- bind_rows(result,
matrix_result)
result %>% kable(caption = "Results") %>%
kable_styling(font_size = 10, position = "center",
latex_options = c("scale_down","HOLD_position"))
```
Matrix factorization further improves the performance in calculating the RMSE
\newpage
## Final validation
From the previous results we can see that Matrix factorization gives the best results on RMSE evaluation.
In these final step we train Matrix factorization model with complete edx dataset and then we will evaluate the RMSE value on validation dataset
### Matrix factorization
```{r Final Matrix factorization,message=FALSE,warning=FALSE}
set.seed(1, sample.kind = "Rounding")
#Create matrix
edx_data <- with(edx, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
validation_data <- with(validation, data_memory(user_index = userId,
item_index = movieId,
rating = rating))
```
```{r Final Create the model object,message=FALSE,warning=FALSE}
#Create the model object
final_model_object <- Reco()
```
```{r Final Call the tune() method,message=FALSE,warning=FALSE}
#Select the best tuning parameters
final_best_tuning <- final_model_object$tune(edx_data,
opts = list(dim = c(10, 20, 30),
lrate = c(0.1, 0.2),
costp_l1 = c(0.01, 0.1),
costq_l1 = c(0.01, 0.1),
nthread = 4,
niter = 10))
```
```{r Final Call train() method,message=FALSE,warning=FALSE}
#Train model
final_model_object$train(edx_data,
opts = c(final_best_tuning$min,
nthread = 4,
niter = 20))
```
```{r Final Call predict() method,message=FALSE,warning=FALSE}
#Predict
final_matrix_prediction <- final_model_object$predict(validation_data,
out_memory())
final_matrix_rmse <- RMSE(validation$rating,
final_matrix_prediction)
final_matrix_result <- tibble(Method = "Final Matrix factorization prediction",
RMSE = final_matrix_rmse)
result <- tibble(Method = "Project target", RMSE = 0.86490)
result <- bind_rows(result,
final_matrix_result)
result %>% kable(caption = "Results") %>%
kable_styling(font_size = 10, position = "center",
latex_options = c("scale_down","HOLD_position"))
```
\newpage
# **Conclusion**
The objective of this project was to design a recommendation system for
predicting the rating of movies.
The goal was to have a RMSE equal to or less than 0.86490
We started by analyzing the dataset to understand the structure of the data
We have therefore defined a subset of predictors also considering the limitations related to memory
We identified the methods to be implemented to create the system and then evaluated their quality in minimizing the RMSE
The Matrix factorization method has been identified as providing the best performance in terms of error reduction
A computer not limited by the memory factor could allow to implement a system that also takes into consideration other predictors besides those considered (movies and users), such as genre.
It is reasonable to assume that such a system would lead to a reduction in the value of RMSE