Skip to content

padas-lab-de/url-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Webpage URL Classification

Note

This repo is not complete, we will add more experiments and code in the next days

This repository contains code for URL Classification , designed to classify URLs into predefined categories using machine learning techniques. The project utilizes a leanier classifier with Stochastic Gradient Descent (SGD) optimizer, alongside a TfidfVectorizer for feature extraction.

Key Components

  • data/: Contains the dataset used for training and evaluating the model.
  • model_results/: Stores the evaluation metrics and results from model training.
  • models/: Contains the serialized form of the trained model.
  • src/: Source scripts including model training and URL prediction functionalities.

Setup

Installation

  1. Clone the repository:

    git clone https://github.com/padas-lab-de/url-classification.git
    cd url-classification
  2. Set up a virtual environment:

    python -m venv venv
  3. Activate the virtual environment:

    • Windows:
      .\venv\Scripts\activate
    • macOS/Linux:
      source venv/bin/activate
  4. Install the required dependencies:

    pip install -r requirements.txt

Usage

Training the Model

To train the model, navigate to the src/ directory and run the model_training.py script. You will be prompted to enter the path to the dataset file:

cd src
python model_training.py

Follow the prompts to enter the dataset name (e.g., OWS_URL_DS.csv). The script will train the model and save it along with the evaluation metrics.

Predicting URL Labels

To classify new URLs, use the predict_urls.py script. You will need to provide a path to a file containing URLs in either .csv or .txt format:

python predict_urls.py

The predictions will be saved to predictions.csv in the root directory.

To Do

  • Use Different ML Models and Compare Them: Implement and evaluate other machine learning models to compare their performance against the current SGD Classifier such as:

    • SVC
    • Random Forest
    • Logistic Regression
    • Neural Networks
  • Measure the Prediction Latency: Measure the time it takes to predict labels for new URLs.

  • Include URL Augmentation for the Training Phase: Investigate and integrate URL augmentation techniques to enhance the diversity and volume of the training data, which could improve model robustness and accuracy.

Done ✓

  • Utilize Class Weight: Explore using the class_weight parameter in the model training process to handle class imbalance

Contributing

Contributions to this project are welcome! Please feel free to fork the repository, make your changes, and submit a pull request.

About

Classify webpages using their URLs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages