- Recipe Generation
- RecipeNLG : https://recipenlg.cs.put.poznan.pl/
- RecipeBox : https://eightportions.com/datasets/Recipes/
- Text Style Transfer
- William Shakespeare :
- Translations of Shakespeare plays to Modern English
- https://www.kaggle.com/datasets/garnavaurha/shakespearify
- Taylor Swift :
- Taylor Swift Song Lyrics
- https://www.kaggle.com/datasets/PromptCloudHQ/taylor-swift-song-lyrics-from-all-the-albums
- Donald Trump :
- Donal Trump tweets through June 2020
- https://www.kaggle.com/datasets/austinreese/trump-tweets
- Michael Scott :
- Complete script of The Office
- https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-complete-dialoguetranscript
- William Shakespeare :
-
Numpy : Perform several mathematical evaluations in the preprocessing of the datasets
pip install numpy
-
Pandas : Loading/Processing/Storing of the different datasets
pip install pandas
-
Itertools : Easy iteration of large lists
pip install itertools
-
Sklearn : Cosine Similarity and TF-IDF
pip install sklearn
-
Transformers : DistilGPT2, T5-small, MarianMT (both model and tokenizers)
pip install transformers
-
SentencePiece : Used by MarianMT's tokenizer (Back Translation)
pip install sentencepiece
-
Evaluate : BLEU Score evaluation
pip install evaluate
-
Matplotlib: Plotting of the training curves
pip install matplotlib
- RecipeDataset.ipynb :
- Loading of both Recipes datasets
- Preprocessing datasets to get into a common format
- Performing statistical analysis on the data
- Storing the final concatenated dataset
- Statistics.ipynb :
- Statistical analysis on the preprocessed datasets and the final concatenated dataset
- Recipe_Generation_DistilGPT.ipynb :
- Loading of the final recipe dataset
- Data Preparation of the final dataset
- Training of DistilGPT2 Model
- Testing of the Finetuned (FT) model and baseline model
- Evaluation of the models - BLEU Score and Perplexity
- Generation of Recipe dataset for Style Transfer
- Error Analysis on Adversarial inputs
- Preprocess_TST_dataset.ipynb :
- Loading the non-parallel data - Taylor and Trump
- Preprocess the datasets
- Extract statistical info about the dataset
- Shakespeare_and_Scripts_Preprocessing.ipynb :
- Loading the non-parallel data - Michael
- Load the parallel data - Shakespeare
- Preprocess the datasets
- Extract some statistical info about the dataset
- BackTranslation.ipynb :
- Load the MarianMT models for Fr-En and En-Fr
- Perform back translation to generate synthetic parallel data - Michael, Taylor and Trump
- Store the parallel dataset
- TST_Architecture.ipynb :
- Load all the parallel datasets
- Finetune a different T5-small model on each dataset
- Generate styled recipes - Sentence-wise and Entire Recipe
- Test the performance (Human Evaluation) on the styled recipes (Sentence-wise)
- Check for style infusion on random sentences
- Supplementary/Adversarial Inputs.xlsx
- Adversarial Examples to the model. Contains 120 examples for which model's output differs from the expected behavior and is of low quality
- Supplementary/Sentence_Styled_Recipes.xlsx
- Human Evaluations on the Styled Recipes generated by the Fine tuned T5 model
Except training (due to computational limitations) of the LLMs all of the code was implemented in Google Colab. We have listed the steps that needed to be followed for a successful implementation of the project.
- Download all the .ipynb files and upload them in a new folder on Google Drive named 'Project 685'
- Download all the Recipe Datasets and add to top level folder 'Project 685'
- Run RecipeDataset.ipynb to get the 'Final_dataset' file, which consists of the preprocessed concatenated dataset
- Run Statistics.ipynb file to display some statistics about the datasets [OPTIONAL]
- Run Recipe_Generation_DistilGPT.ipynb to get the finetuned recipe generation model and Recipe generations
- Download the Text Style Transfer datasets. Create a new sub-folder {persona}_TST. (ex. Taylor_TST)
- Upload the .zip datasets for Taylor and Trump in their respective sub-folders. For Shakespeare and Michael add unzipped .csv files to top level folder
- Run the Preprocess_TST_dataset.ipynb and Shakespeare_and_Scripts_Preprocessing.ipynb to get the appropriate formatted dataset for Back translation
- Run BackTranslation.ipynb to get a parallel dataset for Taylor, Trump and Michael
- Run TST_Architecture.ipynb file to get the finetuned TST models and generate final outputs