Skip to content

Latest commit

 

History

History
48 lines (33 loc) · 1.4 KB

NOTES.md

File metadata and controls

48 lines (33 loc) · 1.4 KB

Notes

Feature Selection

  • We can remove words associated with only 1 particular blurb
  • TfIdfVectorizer paramaters, max_df, min_df

Tweakable Hyperparamaters

  • TF-IDF vs TF
    • IDF weights down words that appear in multiple documents
      • We may not want to do that because we are considering frequency across genre

Considerations

  • Does PCA work with non-linear relationships?
  • If logistic classifier performs badly it may indicate a non-linear relationship, in which case try SVM with kernel trick or Neural Network
    • For NN, if running takes long, research if we can use Collab given project structure or Google Cloud or NYU HPC
  • to do cross validation, combine training, dev, and test into one df

Genres

  • Pick more sub-genres for promotion and remove base genres (i.e. d0)
  • Run only on d1s or others

Cloud Computing

https://cloud.google.com/tpu/pricing

Analysis

  • Confusion matrix plot
  • accuracy, mse, f1, recall
    • per class
  • Comparisons between models
    • n_components = 100 vs 200
    • max_df lowered pov when max_df = 1 / num_classes
    • no min_df lowered pov
    • using IDF gave better results
  • due to slow iteration times with svm we decided to use compliment niave bayes during feature selection
    • cmb

TODO

  • keep only min samples over all classes for each class
    • i.e. business is smallest class with 650 items so make all classes 650 sized
  • remove children's books