Note
This repo is not complete, we will add more experiments and code in the next days
This repository contains code for URL Classification , designed to classify URLs into predefined categories using machine learning techniques. The project utilizes a leanier classifier with Stochastic Gradient Descent (SGD) optimizer, alongside a TfidfVectorizer for feature extraction.
- data/: Contains the dataset used for training and evaluating the model.
- model_results/: Stores the evaluation metrics and results from model training.
- models/: Contains the serialized form of the trained model.
- src/: Source scripts including model training and URL prediction functionalities.
-
Clone the repository:
git clone https://github.com/padas-lab-de/url-classification.git cd url-classification
-
Set up a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- Windows:
.\venv\Scripts\activate
- macOS/Linux:
source venv/bin/activate
- Windows:
-
Install the required dependencies:
pip install -r requirements.txt
To train the model, navigate to the src/
directory and run the model_training.py
script. You will be prompted to enter the path to the dataset file:
cd src
python model_training.py
Follow the prompts to enter the dataset name (e.g., OWS_URL_DS.csv
). The script will train the model and save it along with the evaluation metrics.
To classify new URLs, use the predict_urls.py
script. You will need to provide a path to a file containing URLs in either .csv
or .txt
format:
python predict_urls.py
The predictions will be saved to predictions.csv
in the root directory.
-
Use Different ML Models and Compare Them: Implement and evaluate other machine learning models to compare their performance against the current SGD Classifier such as:
- SVC
- Random Forest
- Logistic Regression
- Neural Networks
-
Measure the Prediction Latency: Measure the time it takes to predict labels for new URLs.
-
Include URL Augmentation for the Training Phase: Investigate and integrate URL augmentation techniques to enhance the diversity and volume of the training data, which could improve model robustness and accuracy.
- Utilize Class Weight: Explore using the
class_weight
parameter in the model training process to handle class imbalance
Contributions to this project are welcome! Please feel free to fork the repository, make your changes, and submit a pull request.