This work explores deep learning techniques in order to facilitate the detection of text-based and image-based spam.
- Project Description
- Dataset Description
- Project Structure
- Getting Started
- Results
- Contributions
- Contact
With the rapid development of internet, there are several loopholes that are created in cyberspace and is under constant threat from being exploited by hackers. Growing internet and the increasing importance of emails in our daily lives, spams have become a common phenomenon posing serious threats, as it gives rise to undesired emails. One of the most common threats is that of spam. Hackers and spammers are using innovative and novel techniques to deceive both novice and experienced internet users. Spam consumes vital computation power and resources. Thereby, it becomes important to filter out the spam contents. Nowadays, spam is transmitted via different medium such as images, sms etc. Both the text and image spam messages are developed in such a manner that it surpasses through the traditional spam filters and firewall technologies. In our work, we would be developing frameworks for classifying both text as well image contents for spam and ham. We have followed a novel implementation strategy for text spam and image spam classifier model where we attempted to cover the loopholes in the existing techniques and present our approach. Also, we have used spam score as the metric for adjudging a give message or an image as spam or ham. On comparing our approach with several classical models, it could be inferred that our model performs significantly better.
Email Spam Dataset
The email spam dataset is taken from Kaggle. The dataset has 5572 rows and 2 features.
These two features are:
- Message – It consists of the email message
- Category – It classifies whether the email message is spam/ham.
Out of the 5572 email messages, 4825 messages were of category ‘ham’ (constituted approximately 86.6% of total messages) and 747 messages were of category ‘ham’ (constituted approximately 13.4% of total messages).
ISH Dataset
This dataset contains both spam and ham images in JPEG format which are collected from original emails. It is a publicly available dataset that can be found in the Northwestern University website. There are 810 ham and 929 spam images in total. The number of unique spam and ham images is 879 and 810 respectively.
Dredze Dataset
This dataset contains 3 sets of images. Personal Ham (PHam) has 2,021 images in which there are 1,517 unique images. Personal Spam (PSpam) has 3,298 images in which there are 1,274 unique images. Finally, the Spam Archive (SpamArch) has 16,028 files of various formats like JPEG, PNG, GIF, etc. in which there are 3,039 unique images.
spamIO
.
│
├── Image_Feature_spam.ipynb
├── Text_Based_Spam.ipynb
├── .ipynb_checkpoints/
Follow these instructions to setup the project.
Project is created using:
- Jupyter Notebook
- Python version: 3.9.0
- Tensorflow version: 2.8.0
- Keras version: 2.8.0
- Numpy version: 1.22.3
- Pandas version: 1.2.4
- Matplotlib version: 3.4.1
- MissingNo: 0.5.0
- Seaborn: 0.11.2
- Sklearn
-
Create a virtual environment in conda prompt using the following commands:
-
Make a virtual environment
$ conda create -n [ENV_NAME] python=[PYTHON_VERSION]
whereENV_NAME
is the name of the virtual environment andPYTHON_VERSION
is the version of python. -
Activate the virtual environment
$ conda activate [ENV_NAME]
-
-
Add the virtual environment in the jupyter notebook using the following commands:
-
Install the ipykernel
$ pip install --user ipykernel
-
Manually add the kernel
$ python -m ipykernel install --user --name=[ENV_NAME]
whereENV_NAME
is the name of the virtual environment.
-
-
Clone the project repo into the virtual environment
$ git clone https://github.com/ReubenJoe/spamIO.git
-
Download the required datasets from the links provided in the Dataset Description.
-
Place the dataset
.csv
file in the same level as that of the.ipynb
file (as shown in the project structure). -
Execute the file using the following commands:
$ ipython --TerminalIPythonApp.file_to_run='Image_Feature_spam.ipynb'
$ ipython --TerminalIPythonApp.file_to_run='Text_Based_Spam.ipynb'
Upon implementing various approaches for text classification such as Naïve Bayes, SVM, LSTM, we observed that SVM (linear kernel) had the highest accuracy of 99.15% after fixing the class imbalance. It is followed by the baseline approaches where the class imbalance with SVM is not considered thereby giving an accuracy of 99.10% followed by LSTM and Naïve Bayes respectively. The results can be observed in the table below.
Model | Accuracy (%) |
---|---|
SVM (fixing class imbalance) | 99.15 |
SVM (without fixing class imbalance) | 99.10 |
CNN | 98.35 |
Naïve Bayes | 97.27 |
LSTM | 94.33 |
BPNN | 97.27 |
Upon implementing various approaches for image classification such as CNN, DNN, we observed that CNN outperformed DNN by having an accuracy of 97.54 as compared to DNN which had an accuracy of 95.34. The results could be observed from the following table.
Model | Accuracy (%) |
---|---|
CNN | 97.54 |
DNN | 95.34 |
Contributions are what make open source such a fantastic environment to learn, inspire, and create. Any contribution you could provide to this existing work is much appreciated. Please fork the repository or create a pull request if you have any suggestion for betterment. Subsequently, you could also open an issue for queries. Also, Don't forget to give the project a star!