Stress testing is a methodology where systems are tested in order to confirm that intended specifications are being met and identify weaknesses.
For Natural Language Inference, our stress tests are large-scale automatically constructed suites of datasets which evaluate systems on a phenomenon-by-phenomenon basis. Each evaluation set focuses on a single phenomenon so as to not introduce confounding factors, thereby providing a testbed for fine-grained evaluation and analysis.
Stress tests for word overlap, negation, length mismatch, antonym, noise and numerical reasoning stress tests as described in the paper [1] can directly be downloaded here. You can also find other resources related to this work on our website.
This repository contains the code used to automatically generate stress tests for word overlap, negation, length mismatch, antonym, spelling error and numerical reasoning, intended to help generate stress tests for new data. To evaluate your models, please use the generated stress tests.
- gen_num_test.py, quant_ner.py: These files are used to perform the preprocessing steps (such as splitting word problems into sentences, removing sentences with long rationales and removing sentences which do not contain named entities) and create a set of useful premise sentences for the quantitative reasoning stress test
- quant_example_gen.py: This file uses the set of useful premise sentences generated after preprocessing to create entailed, contradictory and neutral hypotheses for the quantitative reasoning stress test
- make_antonym_adv_samples.py: This file contains the code to sample sentences from the MultiNLI development set as possible premises and generate contradicting hypotheses.
Numerical Reasoning:
- Run python gen_num_test.py INPUT_FILE OUTPUT_FILE
- Run python quant_ner.py
- Run python quant_example_gen.py
Antonyms
- Run python make_antonym_adv_samples.py --base_dir MULTINLI_DIRECTORY
- make_distraction_adv_samples_jsonl.py: This file generates the word overlap, negation and length mismatch tests
How to run the code:
- Run python make_distraction_adv_samples_jsonl.py TAUTOLOGY_STRING INPUT_FILE OUTPUT_FILE ( this file needs the data_preprocessing.py file provided by MultiNLI creators to run)
- make_grammar_adv_samples_jsonl.py : Noise test, file generates premise-hypothesis pairs with a word perturbed by a keyboard swap spelling error.
- Run python make_grammar_adv_samples_jsonl.py --base_dir MULTINLI_DIRECTORY
The generated stress tests are also available at: https://abhilasharavichander.github.io/NLI_StressTest/
If you want to directly evaluate your system on all stress tests at once you can. Usage is as follows-
- You will need to report your predictions on the test file found here
- Write out model predictions as "prediction" field for each sample in the evaluation set. (Sample submission files are available as sample_submission.jsonl and sample_submission.txt)
- Run the evaluation script with the command python eval.py --eval_file SUBMISSION_FILE > REPORT_FILE.txt
Alternatively, you may write your own evaluation function: our script simply evaluates models on classification accuracy for each stress test at once.
Please considering citing [1] if using these stress tests to evaluate Natural Language Inference models.
[1] A. Naik, A. Ravichander, N. Sadeh, C. Rose, G.Neubig, Stress Test Evaluation for Natural Language Inference
@inproceedings{naik18coling,
title = {Stress Test Evaluation for Natural Language Inference},
author = {Aakanksha Naik and Abhilasha Ravichander and Norman Sadeh and Carolyn Rose and Graham Neubig},
booktitle = {The 27th International Conference on Computational Linguistics (COLING)},
address = {Santa Fe, New Mexico, USA},
month = {August},
year = {2018} }