Adjacent code related to the paper presented in Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD 2024), 25th May, 2024. Co-located with LREC-COLING 2024 in Torino, Italy.
Originaly prepared as part of a Masters dissertation project on COM4520 at The University of Sheffield, with the name of "Is Less More? Idiom Detection with Generalised Natural Language Models".
Paper Authors: Agne Knietaite, Adam Allsebrook, Anton Minkov, Adam Tomaszewski, Norbert Slinko, Richard Johnson, Thomas Pickard, Aline Villavicencio, Dylan Phelps.
Link to the paper: https://doi.org/10.48550/arXiv.2405.08497
Link to the NCSSB dataset: https://doi.org/10.15131/shef.data.25259722.v1
The code is divided into two sections, Model
and Dataset
related, following the structure of the adjacent report. The subdirectories are almost always self-contained and have their own README
files for clarity.
src
: the base model code for fine-tune and pre-train scenarios. Equipped with model performance evaluation and visualisation techniques;Jupiter Notebooks
: Jupiter notebooks, which allow to run the models as Jupiter Notebooks on Google Collab;HPC Fine Tunning
: Information on how to schedule model training as jobs on other machines;Paragraph External Context
: Gathering external context for the external knowledge enhanced model described in the paper
Data Augmentation
: Everything related to the data augmentation, used in the report;Data Generation Parser
: Main parser, used for scrapping, adapting and constructing the datasets;Datasets
: Base datasets used throughout the project;Script For Scraping Text File
&Web Scrapper & Crawler
: initial and later improved versions of data scraping and dataset forming tools.