Project_Veritas

This is project Veritas! It serves as an academic project for CS5100: Foundation for Artificial Intelligence in Northeastern Univ.

Contributors

Shubhi, Emily Dutile and Linghan Xing are the first contributors.

The project is an automated approach to identify authenticate news from fake ones.

Use naive bayes classifier to tell Fake News vs Real News.

anaconda cloud (for jupyter notebooks) scikit-learn pandas

Data representation:

Target: take our dataset and represent them in our datastructure.

Steps:

Extract the words:
- convert words into lower case, extract words
- Apply stemming: reduce words to their root form: i.e. subscribed -> subscrib, subscriber -> subscrib; in this case we could use NLTK toolkit.
Build a dictionary of vecabulary, only retain unique keywards
Vectorise document, loop over the dictionary and mark the frequency of each word.
- term frequency (tf): boolean tf or raw count, or TF adjusted for length of d, or logarithmically scaled TF
- inverse document frequency(IDF): IDF measures how rare the term is across all documents in the corpus
- normalization after the tf-idf: L2 norm

- /project/naiveBayes.py
- /project/models_and_evals.ipynb
- /project/tfidf_implementation.ipynb
- /project/topicmodel.py