Skip to content

This project aims to build a binary classifier for detection of spam and ham(not spam) Emails.

Notifications You must be signed in to change notification settings

shashwatjha798/Spam-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Spam-Classifier

A spam classifier to predict spam and ham(not spam) Emails using concepts of Machine Learning, Natural Language Processing(NLP) and Python.

Workflow of the project

1. Loading Dataset

2. Data Visualization

3. Data Preprocessing

  • Data cleaning (using RegEx)
  • Tokenization (using Word Tokenization)
  • Removing Stop words
  • Lemmatization (using WordNet)
  • Vectorization (using TF-IDF)
  • Label Encoding

4. Splitting Dataset into Training and Testing set

5. Model Training

  • Naïve Bayes
  • Random Forest
  • Support Vector Machine
  • k- Nearest Neighbors

6. Model Evaluation

  • Cross Validation Scores
  • Accuracy on Testing and Testing dataset

Accuracy reported on various Algorithms used:

image

References

  1. Machine Learning Mastery - https://machinelearningmastery.com/natural-language-processing/
  2. TF-IDF - https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089
  3. Stemming vs Lemmatization - https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
  4. Stop Words - https://kavita-ganesan.com/what-are-stop-words/#.YWFEfNpBxPY
  5. RegEx basics - https://docs.python.org/3/howto/regex.html