An application of machine learning methods to predict the star rating corresponding to Yelp reviews.
Truncated SVD? | SGD | XGBoost |
---|---|---|
yes | TBD | TBD |
no | 0.609 | TBD |
-
Fitting 3 folds for each of 96 candidates, totalling 288 fits
- [Parallel(n_jobs=20)]: Done 288 out of 288 | elapsed: 433.7min finished
- done in 27297.536s
- [Parallel(n_jobs=20)]: Done 288 out of 288 | elapsed: 433.7min finished
-
Best score (Accuracy): 0.609
-
Best parameters set:
- tfidf__norm: 'l2'
- tfidf__use_idf: True
- vect__max_df: 1.0
- vect__max_features: 50000
- vect__ngram_range: (1, 2)
accuracy = 0.6757
confusion matrix
True 1 | True 2 | True 3 | True 4 | True 5 | |
---|---|---|---|---|---|
Pred 1 | 98898 | 3356 | 2808 | 1147 | 2372 |
Pred 2 | 30192 | 14268 | 18716 | 5920 | 2730 |
Pred 3 | 7830 | 7128 | 39343 | 39461 | 9492 |
Pred 4 | 2253 | 2060 | 11179 | 103563 | 87068 |
Pred 5 | 2253 | 567 | 1572 | 31701 | 305195 |
classification report
precision | recall | f1-score | support | |
---|---|---|---|---|
1 | 0.70 | 0.91 | 0.79 | 108581 |
2 | 0.52 | 0.20 | 0.29 | 71826 |
3 | 0.53 | 0.38 | 0.44 | 103254 |
4 | 0.57 | 0.50 | 0.53 | 206123 |
5 | 0.75 | 0.90 | 0.82 | 340846 |
avg/total | 0.65 | 0.68 | 0.65 | 830630 |
-
SGD:
-
Fitting 3 folds for each of 48 candidates, totalling 144 fits
- [Parallel(n_jobs=10)]: Done 30 tasks | elapsed: 55.6min
- [Parallel(n_jobs=10)]: Done 144 out of 144 | elapsed: 223.8min finished
- done in 13895.965s
-
Best score (AUC): 0.994
-
Best parameters set:
- clf__alpha: 1e-06
- clf__n_iter: 10
- clf__penalty: 'l2'
- tfidf__norm: 'l2'
- vect__max_features: None
-
XGBoost:
-
Fitting 3 folds for each of 10 candidates, totalling 30 fits
- [Parallel(n_jobs=5)]: Done 30 out of 30 | elapsed: 727.0min finished
- done in 60602.505s
- [Parallel(n_jobs=5)]: Done 30 out of 30 | elapsed: 727.0min finished
-
Best score (AUC): 0.979
-
Best parameters set:
- clf__colsample_bytree: 0.93882416583162387
- clf__gamma: 5.2248856032270332
- clf__learning_rate: 0.079260905486493102
- clf__max_depth: 10
- clf__min_child_weight: 15.596298451557294
- clf__n_estimators: 981
- clf__reg_alpha: 6.1698510973535718
- clf__subsample: 0.90673496002706067*
-
AUC
- 0.96
-
accuracy
- 0.9726
-
confusion_matrix
-
array([[168309, 11536], [ 8349, 538963]])
-
classification_report precision recall f1-score support
0 0.95 0.94 0.94 179845 1 0.98 0.98 0.98 547312
avg / total 0.97 0.97 0.97 727157
-
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='log', n_iter=5, n_jobs=1, penalty='l2', power_t=0.5, random_state=None, shuffle=True, verbose=0, warm_start=False)
-
AUC: 0.73
-
XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.001, max_delta_step=0, max_depth=10, min_child_weight=1, missing=None, n_estimators=500, nthread=-1, objective='binary:logistic', reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, silent=True, subsample=1)
-
AUC: 0.60