University of New Brunswick Fall-2018 CS6765: Natural Language Processing
This Repository contains the python code for the Fall Term Assignments.
No usage of numpy/nltk in any of the code and developed using Python2.7 (built-in modules)
sklearn is used only in Assignment3 for Logistic Regression
No | Python-file | Usage |
---|---|---|
1 | tokenize.py count.py |
python tokenize.py FILE > FILE.tokens python count.py FILE.tokens > FILE.freqs |
2 | lm.py perplexity.py |
python lm.py MODEL TRAIN_FILE TEST_FILE > OUTPUT python perplexity.py OUTPUT |
3 | classify.py score.py |
python classify.py METHOD TRAIN_DOCS TRAIN_CLASSES TEST_DOCS > PREDICTED_CLASSES python score.py PREDICTED_CLASSES TRUE_CLASSES |
4 | tag.py accuracy.py |
python tag.py TRAIN_FILE TEST_FILE METHOD > SYSTEM_OUTPUT python accuracy.py TRUE_TAGS SYSTEM_OUTPUT |
5 | chatbot.py | python chatbot.py METHOD |
No | Arguments | File-Location (in Individual Assignment folder) |
---|---|---|
1 | FILE | Data/tweets-en.txt.gz |
2 | MODEL TRAIN_FILE TEST_FILE |
1 or 2 or interp Data/reuters-train.txt Data/reuters-dev.txt |
3 | METHOD TRAIN_DOCS TRAIN_CLASSES TEST_FILE TRUE_CLASSES |
baseline or lr or lexicon or nb or nbbin Data/train.docs.txt Data/train.classes.txt Data/dev.docs.txt Data/dev.classes.txt |
4 | TRAIN_FILE TEST_FILE METHOD TRUE_TAGS |
Data/train.en.txt Data/dev.en.words.txt baseline or hmm Data/dev.en.tags.txt |
5 | METHOD | overlap w2v both |
Assignment 2: - MODEL
- 1 represents Unigram (with Add-1 smoothing)
- 2 represents Bigram (with Add-k smoothing)
- 3 represents Interpolated (both Unigram and Bigram)
Assignment 3: - METHOD
- baseline represents Most-Frequent-Class-Baseline
- lr represents Logistic Regression (used from skimage)
- lexicon represents Sentiment Lexicon containing + and - words
- nb represents Naive Bayes Model (with add-k smoothing)
- nbbin represents Binarized Naive Bayes
Assignment 4: - METHOD
- baseline represents Most-Frequent-Tag-Baseline
- 2 represents Hidden Markov Model (Bigram with add-k smoothing) and Viterbi Algorithm
Assignment 5: - METHOD
- overlap represents Chatbot responses based on the word overlap
- w2v represents Response with highest Cosine value (from pre-trained vectors from fastText)
- both represents both responses from overlap and w2v with their Cosine values