This repository not only houses Python code for detecting spam emails using Logistic Regression but also incorporates real-time email collection functionality. Upon receiving an email from a specified email address, the system promptly collects the body of the message. This allows for dynamic and up-to-date training of the model. The model itself is trained on a comprehensive dataset comprising email messages that have been meticulously labeled as either spam or ham (non-spam). By continuously gathering new email data in real-time, the system ensures that the model remains robust and adept at discerning between legitimate and unsolicited messages, thus enhancing its effectiveness in combatting spam.
- numpy
- pandas
- scikit-learn
You can install the required dependencies using pip:
pip install numpy pandas scikit-learn
- Clone the repository:
git clone https://github.com/KaavinB/Realtime_Spam_Detection.git
-
Ensure you have a CSV file containing email data. By default, the code assumes the file is named
spam.csv
and contains columns namedMessage
andCategory
(spam or ham). -
Run the Jupyter Notebook script:
The script will perform the following steps:
- Load the data from the CSV file into a pandas DataFrame.
- Replace null values with an empty string.
- Label encode the categories: spam as 0, ham as 1.
- Split the data into training and test sets.
- Convert text data into feature vectors using TF-IDF vectorization.
- Train a Logistic Regression model using the training data.
- Evaluate the accuracy of the model on both training and test data.
- Connect to a Gmail account using IMAP.
- Fetch emails from the inbox.
- Predict whether each email is spam or not using the trained model.
- Update the file path in the code to point to your CSV file if it's not named
spam.csv
. - Adjust the IMAP settings (
user
,password
,imap_url
) according to your email provider.
This project is licensed under the MIT License - see the LICENSE file for details.