This is the code and the dataset for the paper titled
accepted at The 28th International Conference on Computational Linguistics (COLING’20).
If you end up using this code or the data, please cite our paper:
@inproceedings{joshi-etal-2020-devil,
title = "The Devil is in the Details: Evaluating Limitations of Transformer-based Methods for Granular Tasks",
author = "Joshi, Brihi and
Shah, Neil and
Barbieri, Francesco and
Neves, Leonardo",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.coling-main.326",
pages = "3652--3659",
abstract = "Contextual embeddings derived from transformer-based neural language models have shown state-of-the-art performance for various tasks such as question answering, sentiment analysis, and textual similarity in recent years. Extensive work shows how accurately such models can represent abstract, semantic information present in text. In this expository work, we explore a tangent direction and analyze such models{'} performance on tasks that require a more granular level of representation. We focus on the problem of textual similarity from two perspectives: matching documents on a granular level (requiring embeddings to capture fine-grained attributes in the text), and an abstract level (requiring embeddings to capture overall textual semantics). We empirically demonstrate, across two datasets from different domains, that despite high performance in abstract document matching as expected, contextual embeddings are consistently (and at times, vastly) outperformed by simple baselines like TF-IDF for more granular tasks. We then propose a simple but effective method to incorporate TF-IDF into models that use contextual embeddings, achieving relative improvements of up to 36{\%} on granular tasks.",
}
Figure: An example pair of articles from the News Dedup dataset: Both report the same news event, and are thus similar on a granular level; the colored text indicates fine-grained details associated with this determination. Both articles are also of the "sports" topic, and are thus similar on an abstract level.
Contextual embeddings derived from transformer-based neural language models have shown state-of-the-art performance for various tasks such as question answering, sentiment analysis, and textual similarity in recent years. Extensive work shows how accurately such models can represent abstract, semantic information present in text. In this expository work, we explore a tangent direction and analyze such models' performance on tasks that require a more granular level of representation. We focus on the problem of textual similarity from two perspectives: matching documents on a granular level (requiring embeddings to capture fine-grained attributes in the text), and an abstract level (requiring embeddings to capture overall textual semantics). We empirically demonstrate, across two datasets from different domains, that despite high performance in abstract document matching as expected, contextual embeddings are consistently (and at times, vastly) outperformed by simple baselines like TF-IDF for more granular tasks. We then propose a simple but effective method to incorporate TF-IDF into models that use contextual embeddings, achieving relative improvements of up to 36% on granular tasks.
Copyright (c) Snap Inc. 2020. This sample code is made available by Snap Inc. for informational purposes only. It is provided as-is, without warranty of any kind, express or implied, including any warranties of merchantability, fitness for a particular purpose, or non-infringement. In no event will Snap Inc. be liable for any damages arising from the sample code or your use thereof.
- Python 3.5.x
To install the dependencies used in the code, you can use the requirements.txt file as follows -
pip install -r requirements.txt
- Using the SIF Baseline, follow the installation steps given here and add it to the
code/SIF
location. - For accessing the Bugrepo dataset, download the dataset from this LogPAI Bugrepo repository..
The code is organised as follows.
├── code
│ ├── utils/ # This folder contains all the necessary pre-processing and skeleton code for the models.
│ ├── SIF/ # This folder contains the SIF baseline requirements, installed as per the above instructions.
│ ├── news_dedup_experiments/ # This folder contains the experiments done with the News Dedup dataset
│ └── bug_data_experiments/ # This folder contains the experiments done with the Bugrepo dataset
└── README.md
To run the code for specific experiments, go to their respective Jupyter Notebook and run the cells to train the models.
For example, to run the code for the TFIDF Experiments for the Bugrepo dataset run the following -
cd bug_data_experiments/
jupyter notebook
and open the tf_idf_classification_bugrepo.ipynb
notebook.
If you face any problem in running this code, you can contact us at brihi16142[at]iiitd[dot]ac[dot]in or make an Issue in this repository.
For license information, see LICENSE