-
Notifications
You must be signed in to change notification settings - Fork 0
/
beverage_pref.Rmd
472 lines (320 loc) · 17.1 KB
/
beverage_pref.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
---
title: "Beverage Review Project"
author: "Nick McCulloch"
output:
html_document:
toc: yes
toc_depth: 2
pdf_document:
toc: yes
toc_depth: '2'
word_document:
toc: yes
toc_depth: '2'
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo = T, warning = F, message = F)
```
# Introduction
## Project Description
The purpose of this project is to better understand factors that contribute to consumer ratings for particular beverages, namely, beer. The beer industry is booming in the American market, with consumers switching slowly from drinks with a high alcohol percentage to lighter and more refreshing beverages. Consumer trends can be seen in overall beverage ratings and in ratings for subcategories such as palate, aroma, or Alcohol By Volume. By focusing on these essential categories, key predictors can be identified and used to direct further development.
## Data Description
The initial dataset consists of 1.5 million consumer reviews provided by BeerAdvocate.com, a site where "beer geeks and industry professionals" can submit numeric and written reviews on a wide array of beverages. The focus of the project is larger trends in consumer preferences across products so this data has been filtered to exclude products with fewer than 500 reviews, aggregated by style and filtered again to exclude any style with fewer than 5 products. The resulting dataset consists of 25 styles, 275 individual products, and 270,323 individual consumer reviews.
The variable of interest is the Overall Score of a drink on a continuous scale from 0 to 5, with a score of 5.0 being the highest.
The key predictors are Beer Style Category (BSC), Review Count, Alcohol by Volume (ABV), Aroma Score, Appearance Score, Palate Score, and Taste Score.
## Packages
```{r Packages, message=TRUE, warning=FALSE}
library(pacman)
p_load(stats, lmtest, dplyr, klaR, car, glmnet, pls, ggplot2, patchwork, ggthemes, reshape2)
```
## Reading Data
```{r}
# Reading in the data with strings as factors b/c of text in style column
beer_data <- read.table("beer_data.csv", header = T, sep = ",", stringsAsFactors = T)
# Checking the data size, columns, class, etc.
head(beer_data)
names(beer_data)
str(beer_data)
data_summary <- summary(beer_data)
```
The number of reviews per product and ABV has the highest variance among predictors. The median number of reviews per product is 772 and the range is 2,500 and a standard deviation (SD) of 510.62. ABV has a median of 7.10, a range of 13.50, and an SD of 2.42. All the other predictors have medians between 3.99 (review taste) and 4.07 (review appearance) with ranges from 2.23 (review overall) to 2.62 (review aroma). The overall review score has a mean of 3.962, a median of 4.02, and a range of 2.253. Considering that the maximum overall review score is out of 5, the range of scores varies by as much as 45% of the highest possible score. With the exception of review count and ABV, all the other predictors show strong negative skewness (to the left) with values from -1.43 (review palate) to -1.92 (review appearance). All predictors show positive kurtosis implying there are more extreme values in each predictor than in a normal distribution and have fatter tails. Review appearance has the highest level of positive kurtosis (6.28) followed by the overall score with 5.53.
# Preliminary Tests and Transformations
## Test Plot
```{r}
# quick check with base R plot to get sense of data.
plot(beer_data$style, beer_data$overall)
```
## Transformations
```{r}
# transforming 'styles' strings to lower case to improve spacing
beer_data$style <- tolower(beer_data$style)
# replacing american double to shorten string
beer_data$style <- gsub("american double", "us_double", beer_data$style)
beer_data$style <- factor(beer_data$style)
```
## Improved Test Plot
```{r}
# Now an improved boxplot
a_data <- ggplot(beer_data, aes(x = overall, y = reorder(style, overall)))
a_graph <- geom_boxplot()
a_theme <- theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, margin = margin(t = 1, r = 20, b = 1, l = 0)))
a_labs <- labs(title = "Beer Style by Score", x = NULL, y = NULL)
# a_scale <- scale_y_discrete(guide = guide_axis(check.overlap = T))
a_data + a_graph + a_theme + a_labs #+ a_scale
```
## Further Transformations
```{r}
# creating column of average overall scores by style.
style_avg <- beer_data %>%
group_by(style) %>%
summarise(mean_style = mean(overall))
print(style_avg[order(style_avg$mean_style), ])
```
## Basic Linear Model
```{r}
lm_basic <- lm(overall ~ ., data = beer_data)
basic_summary <- summary(lm_basic)
basic_summary
```
## Residual Plots and Breusch-Pagan
```{r}
par(mfrow = c(2, 2))
plot(lm_basic)
```
The residuals/fitted plot isn't perfect with the largest deviation from linearity to the right. The QQ plot shows mildly fat tails, indicating more data located at the extremes. The scale location plot is neither horizontal nor evenly spread, indicating heteroskedasticty. The residuals vs leverage plot does indicate some notable outliers, including one that is 3.5x the mean.
```{r}
bptest(lm_basic, data = beer_data)
```
The BP test strongly suggests heteroskedasticity.
## Visual Check for Multicollinearity
```{r }
par(mfrow = c(2, 2))
attach(beer_data)
plot(appearance, aroma, main = "appearance vs. aroma")
plot(appearance, taste, main = "appearance vs. taste")
plot(taste, aroma, main = "taste vs. aroma")
plot(taste, palate, main = "taste vs. palate")
```
```{r}
# excluding 'style' from the corr plot, which is the only non-numeric column
corr_data <- beer_data[-1]
# creating correlation matrix
corr_mat <- round(cor(corr_data), 2)
# sorting matrix for easier interpretation
dist <- as.dist((1 - corr_mat) / 2)
# clustering the dist matrix
hclust <- hclust(dist)
corr_mat <- corr_mat[hclust$order, hclust$order]
# reduce the size of correlation matrix
melted_corr_mat <- melt(corr_mat)
# plotting the correlation heatmap
ggplot(data = melted_corr_mat, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
labs(title = "Correlation Heatmap")
```
Somewhat unsurprisingly, similar or related categories in the data like 'taste' and 'aroma' seem to be correlated. The plots above demonstrate the severe multicollinearity found in the data.
```{r}
par(mfrow = c(1, 2))
plot(revcount, overall, main = "review count vs. overall score")
plot(abv, overall, main = "Alcohol By Volume vs. Overall Score")
```
There is not a similar level of correlation observed between overall score and ABV or number of reviews.
# Main Analysis and Model Testing
## Weighted Least Squares (WLS)
The multicollinearity and heteroskedasticity found above suggested the need to test a WLS model, and because the residuals have a Gaussian distribution, a GLM model, as well.
```{r}
# First extracting fitted values from the model to create weights
fitted_ols <- fitted(lm_basic)
abs_res <- abs(residuals(lm_basic))
cbind(fitted_ols, abs_res)[1:10, ]
lm_abs_res <- lm(abs_res ~ fitted_ols)
# sanity check
# fitted(lm_abs_res)[1:10]
fit_df <- as.data.frame(cbind(fitted_ols, abs_res))
# plotting the values
ggplot(data = fit_df, aes(x = fitted_ols, y = abs_res)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE) +
labs(title = "Fitted OLS by Absolute Residuals")
# saving the weights to new variable
wts <- 1 / fitted(lm_abs_res)^2
# fiting WLS and WGLM models using the weights
fit_wls <- lm(overall ~ ., data = beer_data, weights = wts)
fit_wglm <- glm(overall ~ ., data = beer_data, weights = wts)
# creating summaries of the new models for comparison
sum_lm_basic <- summary(lm_basic) # R-squared 0.9641
sum_fit_wglm <- summary(fit_wglm)
sum_fit_wls <- summary(fit_wls) # R-squared - 0.9548
paste("lm basic-adjusted r-squared: ", format(sum_lm_basic$adj.r.squared, digits = 4))
paste("fit_wls-adjusted r-squared: ", format(sum_fit_wls$adj.r.squared, digits = 4))
```
## WGLM
```{r}
# first running regression with exisiting levels
beer_fit <- glm(overall ~ ., data = beer_data)
basic_sum <- summary(lm_basic)
fit_sum <- summary(beer_fit)
# extracting the coefficients, droppping non-style coefs, reordering, and rounding
# could have used coef(summary(x)) instead.
basic_coef <- data.frame(basic_sum$coefficients)
basic_coef <- basic_coef[!(row.names(basic_coef) %in% c("abv", "appearance", "aroma", "palate", "taste", "revcount", "(Intercept)")), ]
basic_coef <- round(basic_coef, 4)
basic_coef$coefficients <- rownames(basic_coef)
basic_coef <- basic_coef[, c(5, 1, 2, 3, 4)]
fit_coef <- data.frame(fit_sum$coefficients)
fit_coef <- fit_coef[!(row.names(fit_coef) %in% c("abv", "appearance", "aroma", "palate", "taste", "revcount", "(Intercept)")), ]
fit_coef <- round(fit_coef, 4)
fit_coef$coefficients <- rownames(fit_coef)
fit_coef <- fit_coef[, c(5, 1, 2, 3, 4)]
# extracting style with highest p-value and lowest estimate
attach(basic_coef)
level_data <- rbind(
basic_coef[Pr...t.. == max(Pr...t..), ],
basic_coef[Estimate == max(Estimate), ]
)
detach(basic_coef)
attach(fit_coef)
level_data <- rbind(
level_data, fit_coef[Pr...t.. == max(Pr...t..), ],
fit_coef[Estimate == max(Estimate), ]
)
rownames(level_data) <- c("basic coef-highest p-value:", "basic coef-lowest estimate:", "fit_coef-highest p-value:", "fit_coef-lowest estimate:")
detach(fit_coef)
print(level_data[, c(1, 2, 5)])
```
'American Pale Wheat Ale' has the lowest estimate and 'Hefeweizen' turns out to have the highest p-value value, making either a good candiate for a releveled reference.
## Releveled WGLM
```{r}
beer_datarlv <- beer_data
# releveling the factors
beer_datarlv$style <- relevel(beer_datarlv$style, ref = "american pale wheat ale")
# sanity check
# levels(beer_datarlv$style)
# fitting releveld model
beer_rlv <- glm(overall ~ ., data = beer_datarlv)
```
```{r}
# anova table for the new model.
anova(beer_rlv)
```
```{r}
# creating more readable summary for the coefficients
rlv_summ <- summary(beer_rlv)
rlv_coefs <- data.frame(rlv_summ$coefficients)
rlv_coefs <- rlv_coefs[order(rlv_coefs$Estimate), ]
pv <- rlv_coefs[, "Pr...t.."]
signif <- symnum(pv,
corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("***", "**", "*", ".", " ")
)
legend <- attr(signif, "legend")
signif <- as.vector(signif)
rlv_coefs <- cbind(rlv_coefs, signif)
# [1] 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
rlv_coefs <- rlv_coefs[, 1:5]
rlv_coefs
```
## Cross-validation -- comparing OLS to GLM
## Condition Index
```{r}
# library(klaR)
ci_basic <- cond.index(lm_basic, data = beer_data)
ci_basic[length(ci_basic)]
ci_glm <- cond.index(beer_rlv, data = beer_data)
ci_glm[length(ci_glm)]
```
The high condition index indicates both models suffer from severe multicollinearity (as indicated in the previous plots).
```{r}
# fitting an ad-hoc model that excludes several of the correlated variables.
new_lm <- glm(overall ~ abv + appearance + taste + revcount + style, data = beer_data)
ci_new <- cond.index(new_lm, data = beer_data)
ci_new[length(ci_new)]
```
The model above excludes the 'aroma' and 'palate' variables which seemed to be highly correlated with 'taste'. The resulting condition index is significantly improved.
## VIF
```{r}
# library(car)
print("lm_basic_vif")
vif(lm_basic)
print("new_lm_vif")
vif(new_lm)
```
Calculating the VIF for both models offers further support for the conclusion that there is severe multicollinearity. The initial model has VIF's over 5 for 'aroma', 'palate', and 'taste'. By contrast the reduced model, 'lm_new', which excluded 'aroma' and 'palate' shows no variable with a VIF over 5. This suggests that variable selection to reduce the predictor variables would be helpful.
## Ridge Prep - Splitting and Training the Data
Ridge regression is a suitable method for variable selection but requires some preparation.
```{r}
# creating training and test datasets
RNGkind(sample.kind = "default")
options(scipen = 4)
paste("total rows: ", nrow(beer_data))
tr_size <- 0.8
train <- sample(nrow(beer_data), tr_size * nrow(beer_data))
beer_train <- beer_data[train, ]
paste("training data rows: ", nrow(beer_train))
beer_test <- beer_data[-train, ]
paste("test data rows: ", nrow(beer_test))
paste(c("training data sample:", train[1:10]), collapse = " ")
# fitting model to training data
fit_train <- lm(overall ~ ., data = beer_train)
train_mse <- mean(fit_train$residuals^2)
c("Train MSE" = train_mse, "Train RMSE" = sqrt(train_mse))
```
## Random Split Cross-Validation Test MSE
```{r}
# fitting model to test data
pred_test <- predict(fit_train, beer_test)
test_mse_rs <- mean((beer_test$overall - pred_test)^2)
c("RSCV Test MSE" = test_mse_rs, "RSCV Test RMSE" = sqrt(test_mse_rs))
```
## Ridge
```{r}
x <- model.matrix(overall ~ ., data = beer_data)[, -1]
y <- beer_data$overall
# first creating ridge reg. without specifying lambda
ridge_reg <- glmnet(x, y, alpha = 0)
plot(ridge_reg)
# now recreating with cv.glmnet to find optimal lambda
ridge_reg_cv10 <- cv.glmnet(x, y, alpha = 0)
lambda_vec <- round(cbind("Lambda" = ridge_reg_cv10$lambda, "10FCV" = ridge_reg_cv10$cvm), digits = 3)
plot(ridge_reg_cv10)
# the best lamdas can be seen here
top_5_lambdas <- tail(lambda_vec)
# top_5_lambdas
# finding best lambda and best cv_ridge
ridge_best_lambda <- ridge_reg_cv10$lambda.min
min_cv_ridge <- min(ridge_reg_cv10$cvm)
```
```{r}
round(cbind("Best Lambda" = ridge_best_lambda, "Best 10FCV" = min_cv_ridge), digits = 3)
```
```{r}
best_model <- glmnet(x, y, alpha = 0, lambda = ridge_best_lambda)
# tidying data for easier interpretation
best_matrix <- as.matrix(coef(best_model))
best_matrix <- cbind(row.names(best_matrix), best_matrix)
best_matrix <- data.frame(best_matrix)
best_matrix <- best_matrix[order(best_matrix$s0), ]
colnames(best_matrix) <- c("variable", "coef")
rownames(best_matrix) <- 1:nrow(best_matrix)
best_matrix
```
The smaller values of lambda produced smaller cross-validation MSE. The best lambda 0.027 produced the smallest CV-MSE of 0.007, suggesting the shrinkage is closer to the coefficients of the OLS model. The ridge model with teh best lambda has minimal bias.
## PCR
A PCR test allows further tests of dimensionality and multicollinearity.
```{r}
# library(pls)
pcr_fit <- pcr(overall ~ ., data = beer_data, scale = T, validation = "CV")
summary(pcr_fit)
validationplot(pcr_fit, val.type = "MSEP")
```
The results of the scree plot and numeric output show the 30 component model has the best CV RMSE, showing the results are the same as the OLS and GLM model. The 30 component model is also best for interpretability and predictive accuracy.
# Conclusions
**Consumer Preference and Beverages**
Several conclusions can be drawn regarding consumer beverage preferences. It turns out the drinks consumers prefer the most are the ones that they think taste the best, which we can see because 'taste' turns out to be the variable with the greatest positive correlation with overall score. Consumers seem to prefer hefeweizens and dislike vegetable or fruit beers, and drinks with a high ABV. However, these preferences should be taken with a grain of salt. The severe multicollinearity found for 'taste', 'palate' and 'aroma' indicates that these distinctions might be lost on customers, or alternatively that there's more fuss made of these terms than is actually warranted.
**Statistical Analysis and Conclusions**
Overall, the weighted GLM model is the superior model because it addresses the truncated nature of the outcome variable, overall beer rating. The weighting of coefficients also reduces the variation and produces a more stable model. As the residuals follow a Gaussian distribution, the GLM produces the same results as the OLS model, which is better for interpretability and explainability. We found that taste had the largest effect on the overall rating at 1.286 and it is the only predictor with a positive effect. The categorical predictor with the largest effect is vegetable-style beer, -0.017
**Challenges and Lessons Learned**
The data set had significant number of factors, which proved both challenging and interesting to work with. Many of the models were difficult to interpret because of the sheer number of variables involved. While these many variables added significant length to the outputs, they only provided limited insights. In future studies it would make sense to first run a regression to determine the most popular category, and then run regressions within that category to find the most determinative variables. Another option would be to run preliminary models to reduce the number or effect of categorical variables such as stepwise, Ridge, or LASSO. The combination of a large number of categorical variables and several continuous variables increased the difficulty of selecting models and creating visually useful plots.
Another issue is that the number of reviews (revcount) varies significantly. This variation proved to be statistically significant in the models. That may be a useful insight on its own. However, it may make sense to weight the number of reviews per product or extract a random sample to get a similarly sized and normally distributed sample for each product so that that variation could be accounted for.
The model also demonstrated significant multicollinearity and heteroskedasticity. As is seen in the correlation matrix and BP test. So methods to account for that are key.