A spam classifier to predict spam and ham(not spam) Emails using concepts of Machine Learning, Natural Language Processing(NLP) and Python.
Workflow of the project
- Data cleaning (using RegEx)
- Tokenization (using Word Tokenization)
- Removing Stop words
- Lemmatization (using WordNet)
- Vectorization (using TF-IDF)
- Label Encoding
- Naïve Bayes
- Random Forest
- Support Vector Machine
- k- Nearest Neighbors
- Cross Validation Scores
- Accuracy on Testing and Testing dataset
- Machine Learning Mastery - https://machinelearningmastery.com/natural-language-processing/
- TF-IDF - https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089
- Stemming vs Lemmatization - https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
- Stop Words - https://kavita-ganesan.com/what-are-stop-words/#.YWFEfNpBxPY
- RegEx basics - https://docs.python.org/3/howto/regex.html