BERTRAM on MAGPIE dataset

Use BERTRAM [2] to get single-token embeddings for idioms on the MAGPIE [3] dataset.
High level objectives:

Get single-token embeddings for idioms in the MAGPIE dataset using pretrained BERTRAM model
- Use sentences from CC News Corpus as contexts to obtain the BERTRAM embeddings.
- Insert the new tokens of the form <BERTRAM:IDsomeidiomtokenID> into bert-base-uncased model with these embeddings.
Use these embeddings to train a Sequence Classifier task on MAGPIE dataset to clasify idiom/literal meanings. Code for this downstream task is already available in idiom-principle-on-magpie-corpus repository. Specifically, we only need ./exp_helpers/run_glue_f1_macro.py file, which has been copied from here.
- Note that, while inserting the single-tokens into the MAGPIE dataset, make sure to use the format: <BERTRAM:ID...ID>

Dataset used

CC News Corpus with MAAGPIE idioms

This data is already prepared here for a different task(training a BERT model with single-token representations).
Below files are taken from above repository:

File	Taken from
all_replace_data.txt	exp3B_1

MAGPIE dataset

The dataset MAGPIE_filtered_split_{*}.jsonl is used for Idiom Detection using SequenceClassification task.

Setup

The BERTRAM model used directly from the original author's work. The submodule bertram contains the code for the same.
The BERTRAM model available from the original authors has been trained on bert-base-uncased model, and is available here for download. Hence the same will be used the Sequence Classification task on MAGPIE dataset.
Very IMP: The orignal BERTRAM implementation (./bertram) uses the old version of transformers library. Hence, a dedicated virtual environment is recommended for these experiments. The older version of transformers library is necessary only when dealing with the BERTRAM model. The downstream task on MAGPIE dataset can be done using the latest version of transformers library.

Single Token Representation

The setup is similar as described here.
Both Option-1 and Option-2 are to be experimented with in this repository.

Experiments

Steps Followed:

Get the preprocessed CC News corpus containing the idiom single-tokens. Prepare the context data from this (Mostly, you only need to prepare this once). Use the script prepare_data.sh to do this.
Use this context data to obtain BERTRAM embddings, insert the BERTRAM tokens into another BERT model and save it. Use the script train_and_update_bertram_embeddings_hpc.sh to do this. Note, the model produced with added bertram tokens will have a wrong vocab size in its config. This needs to be corrected.
The model saved by step 2 needs to be converted to a SequenceClassification model before proceeding to the next step. Use the notebook exp_helpers/MLM_to_SeqClass_model_converter.ipynb to do this.
Get the MAGPIE dataset containing idiom single-tokens for sequence classification. IMP: Rename the single tokens (<ID...ID>) as BERTRAM tokens (<BERTRAM:ID...ID>) in train, dev & test datasets. Use the script replace_with_bertram_tokens.sh to do this. (Mostly, you only need to prepare this once).
Train a Sequence Classification task on MAGPIE dataset & record the results. Use hpc.sh to do this.

Experiment	Code	Context Dataset	Single Token Type	Source MAGPIE Dataset	Base Model	No of Examples	BERTRAM Status	Idiom Detection Status	Updated model
bt0	bt0	all_replace_data.txt	`Option-1`	exp3B_1/tmp/	bert-base-uncased	20	DONE (Took 00:01:39)	Done (4 GPUs)	bert-base-uncased_option1_with_bertram_bt0_SC
bt1	bt1	same as bt0	`Option-1`	same as bt0	bert-base-uncased	50	DONE (Took 00:07:38)	Done (4 GPUs)	bert-base-uncased_option1_with_bertram_bt1_SC
bt2	bt2	same as bt0	`Option-1`	same as bt0	bert-base-uncased	200	DONE (Took 00:18:23)	Done (RTX5000 x 1)	bert-base-uncased_option1_with_bertram_bt2_SC

Results

The results on Sequence Classifier task on MAGPIE dataset task using the BERTRAM embeddings in a BERT model.

Experiment	Dev Accuracy	Dev F1	Test Accuracy	Test F1
bt_0	85.47	80.59	83.86	80.47
bt_1	86.67	82.06	83.8	79.6
bt_2	84.73	80.59	84.73	80.59

Error Analysis & Study

NOTE: Refer to Error Analysis & Study. All the implementations are copied from there.

The precomputed grouping of PIEs has been copied from idiom-principle-on-magpie-corpus repository and is available at data/PIE_segregation.

The classification reports has been copied from idiom-principle-on-magpie-corpus repository and is available at exp_helpers/produce_test_results.py.

The classification reports (both overall and segreated) is generated for the best experiment bt_2 (folder named bt2).

Statistical Significance Test

NOTE: Refer to Statistical Significance Test Report

References

[1] D. Phelps, “drsphelps at SemEval-2022 Task 2: Learning idiom representations using BERTRAM.” arXiv, May 25, 2022. Accessed: Jun. 19, 2022. [Online]. Available: http://arxiv.org/abs/2204.02821

[2] T. Schick and H. Schütze, “BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance.” arXiv, Apr. 29, 2020. Accessed: Jun. 19, 2022. [Online]. Available: http://arxiv.org/abs/1910.07181

[3] Hessel Haagsma, Johan Bos, and Malvina Nissim. 2020. MAGPIE: A Large Corpus of Potentially Idiomatic Expressions. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 279–287, Marseille, France. European Language Resources Association.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bertram @ 9e0d872		bertram @ 9e0d872
data		data
exp_helpers		exp_helpers
experiments		experiments
local_models		local_models
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERTRAM on MAGPIE dataset

Dataset used

CC News Corpus with MAAGPIE idioms

MAGPIE dataset

Setup

Single Token Representation

Experiments

Results

Error Analysis & Study

Statistical Significance Test

References

About

Releases

Packages

Languages

License

DarshanAdiga/bertram-on-magpie

Folders and files

Latest commit

History

Repository files navigation

BERTRAM on MAGPIE dataset

Dataset used

CC News Corpus with MAAGPIE idioms

MAGPIE dataset

Setup

Single Token Representation

Experiments

Results

Error Analysis & Study

Statistical Significance Test

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages