- Text Preprocessing
- Tokenization
- Lowercasing
- Stemming
- Stopword
- Construct dictionary & tf-idf vector
- term dictionary
- tf-idf unit vector
- cosine similarity
- Naive Bayes classification
- Multinomial NB classifier
- feature selection
- smoothing
- HAC clustering
- hierarchical clustering
- pair-wise document similarity
- similarity between clusters