CS-433 - Project 2 (Text Classification)

Folder Architecture

The project structure can be split into four main parts that are described as follows:

1. Initialization

All the code responsible of the initialization of the project is stored in the /src/datasets folder. The structure is as follows:

/src/datasets
│
├── build_vocab.sh
├── cooc.py
├── cut_vocab.sh
├── glove_solution.py
├── pickle_vocab.py
├── tweet_to_vector.py

2. Models

All the code responsible of the different models is stored in the /src/models folder. The structure is as follows:

/src/models
│
├── averaged_embeddings_models
│   ├── GradientBoosting.py
│   ├── LogisticRegression.py
│   ├── NeuralNetwork.py
│   ├── SupportVectorMachine.py
│
├── sequenced_embeddings_models
│   ├── RecurrentNeuralNetwork.py

3. Utilitaries Functions

All the code that manage file storage/loading is stored in the /src/utils folder. The structure is as follows:

/src/utils
│
├── dataloader.py
├── initialization.py
├── submission.py

4. Data Storage

All the files that contains data are stored within the /data folder. The structure is as follows:

/data
│
├── init                                // Generated by run.py
│   ├── cooc.pkl                        // Generated by run.py
|   ├── SGD_embeddings.npy              // Generated by run.py
│   ├── vocal_full.txt                  // Generated by run.py
│   ├── vocab_cut.txt                   // Generated by run.py
│   ├── vocab.pkl                       // Generated by run.py
│
├── submission                          // Generated by run.py
│   ├── <Model Name>_<Dataset Type>.csv // Generated by run.py
│
├── twitter-datasets                    // Unzipped twitter-datasets.zip
│   ├── sample_submission.csv
│   ├── test_data.txt
│   ├── train_neg_embedding.txt         // Generated by run.py
│   ├── train_neg_full_embedding.txt    // Generated by run.py
│   ├── train_neg_full.txt
│   ├── train_neg.txt
│   ├── train_pos_embedding.txt         // Generated by run.py
│   ├── train_pos_full_embedding.txt    // Generated by run.py
│   ├── train_pos_full.txt
│   ├── train_pos.txt
|
├── twitter-datasets.zip

Run Setup

As is, the src/run.py file generates the submission file that performed the best score on aicrowd.com. But it is very simple to change the parameters of it to train another model. Here are the possible modifications:

model_type - default value = RecurrentNeuralNetwork, can be changed to GradientBoosting, LogisticRegression, SupportVectorMachine or NeuralNetwork
full_dataset - default to True, can be changed to False (recommended to have a faster execution time)
force_generation - default to False, can be changed to True (not recommended)
Model Hyperparameters - every initialized model (e.g. model = GradientBoosting()) has default hyperparameters that can be changed easily.

How to Run

In order to run the code and access to the .csv submission files, it is required to execute the following steps:

Install the following python libraries (pip install <library name>)

numpy
pandas
xgboost
scikit-learn
tensorflow
tqdm

In your terminal, navigate to the /src directory and enter the following:

python run.py

Important Note 1: the first time this script is executed, all required files will be generated and stored in the data folder. This may take a while.
Important Note 2: The default training execution time is very long (>10 hours) and was previously run on the EPFL Scitas server. It is very recommended to modifiy the run.py script parameters to test the model on the small dataset.

Once the script has finished, it should have stored the .csv file inside the data/submission folder. The naming of the submission files varies depending on the model and dataset used.
submission files are named as follows: <Model Name>_<Dataset Type>.csv

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS-433 - Project 2 (Text Classification)

Folder Architecture

1. Initialization

2. Models

3. Utilitaries Functions

4. Data Storage

Run Setup

How to Run

About

Releases

Packages

Contributors 3

Languages

CS-433/ml-project-2-roadtossl

Folders and files

Latest commit

History

Repository files navigation

CS-433 - Project 2 (Text Classification)

Folder Architecture

1. Initialization

2. Models

3. Utilitaries Functions

4. Data Storage

Run Setup

How to Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages