A simple machine learning approach to detect the sender based on the mail body of the famous Enron Datasset
- python
- anaconda
- scikit learn
- other dependencies
- clone this repository -
git clone https://github.com/nowshad-sust/enron-sender-detection.git
- download enron dataset from here - https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz
- now extract this dataset(maildir) to the project(clonned) folder
- create a folder named
remail
in the project directory - open a terminal or cmd in the project directory
- run the
copy_sent_mails.py
script by the command -python copy_sent_mails.py
This should make a directory named remail in the project folder and copy all the sent mails from the original dataset directory. - Now, run the
naive_bayes_pipeline.py
by the command -python naive_bayes_pipeline.py
This should give you a number which refers to the validation sucess rate.
- Naive Bayes classifier ~ 0.46
- SVM ~ 0.79
- SVM with grid search ~ 0.85