Skip to content

Latest commit

 

History

History
19 lines (18 loc) · 484 Bytes

README.md

File metadata and controls

19 lines (18 loc) · 484 Bytes

Information Retrieval and Text Mining project implementation

  1. Text Preprocessing
    • Tokenization
    • Lowercasing
    • Stemming
    • Stopword
  2. Construct dictionary & tf-idf vector
    • term dictionary
    • tf-idf unit vector
    • cosine similarity
  3. Naive Bayes classification
    • Multinomial NB classifier
    • feature selection
    • smoothing
  4. HAC clustering
    • hierarchical clustering
    • pair-wise document similarity
    • similarity between clusters