The project structure can be split into four main parts that are described as follows:
All the code responsible of the initialization of the project is stored in the /src/datasets
folder. The structure is as follows:
/src/datasets
│
├── build_vocab.sh
├── cooc.py
├── cut_vocab.sh
├── glove_solution.py
├── pickle_vocab.py
├── tweet_to_vector.py
All the code responsible of the different models is stored in the /src/models
folder. The structure is as follows:
/src/models
│
├── averaged_embeddings_models
│ ├── GradientBoosting.py
│ ├── LogisticRegression.py
│ ├── NeuralNetwork.py
│ ├── SupportVectorMachine.py
│
├── sequenced_embeddings_models
│ ├── RecurrentNeuralNetwork.py
All the code that manage file storage/loading is stored in the /src/utils
folder. The structure is as follows:
/src/utils
│
├── dataloader.py
├── initialization.py
├── submission.py
All the files that contains data are stored within the /data
folder. The structure is as follows:
/data
│
├── init // Generated by run.py
│ ├── cooc.pkl // Generated by run.py
| ├── SGD_embeddings.npy // Generated by run.py
│ ├── vocal_full.txt // Generated by run.py
│ ├── vocab_cut.txt // Generated by run.py
│ ├── vocab.pkl // Generated by run.py
│
├── submission // Generated by run.py
│ ├── <Model Name>_<Dataset Type>.csv // Generated by run.py
│
├── twitter-datasets // Unzipped twitter-datasets.zip
│ ├── sample_submission.csv
│ ├── test_data.txt
│ ├── train_neg_embedding.txt // Generated by run.py
│ ├── train_neg_full_embedding.txt // Generated by run.py
│ ├── train_neg_full.txt
│ ├── train_neg.txt
│ ├── train_pos_embedding.txt // Generated by run.py
│ ├── train_pos_full_embedding.txt // Generated by run.py
│ ├── train_pos_full.txt
│ ├── train_pos.txt
|
├── twitter-datasets.zip
As is, the src/run.py
file generates the submission file that performed the best score on aicrowd.com. But it is very simple to change the parameters of it to train another model. Here are the possible modifications:
model_type
- default value =RecurrentNeuralNetwork
, can be changed toGradientBoosting
,LogisticRegression
,SupportVectorMachine
orNeuralNetwork
full_dataset
- default toTrue
, can be changed toFalse
(recommended to have a faster execution time)force_generation
- default toFalse
, can be changed toTrue
(not recommended)- Model Hyperparameters - every initialized model (e.g.
model = GradientBoosting()
) has default hyperparameters that can be changed easily.
In order to run the code and access to the .csv
submission files, it is required to execute the following steps:
- Install the following python libraries (
pip install <library name>
)
- numpy
- pandas
- xgboost
- scikit-learn
- tensorflow
- tqdm
- In your terminal, navigate to the
/src
directory and enter the following:
python run.py
Important Note 1: the first time this script is executed, all required files will be generated and stored in the data
folder. This may take a while.
Important Note 2: The default training execution time is very long (>10 hours) and was previously run on the EPFL Scitas server. It is very recommended to modifiy the run.py script parameters to test the model on the small dataset.
- Once the script has finished, it should have stored the
.csv
file inside thedata/submission
folder. The naming of the submission files varies depending on the model and dataset used.
submission files are named as follows:<Model Name>_<Dataset Type>.csv