NLI_StressTest

Stress testing is a methodology where systems are tested in order to confirm that intended specifications are being met and identify weaknesses.

For Natural Language Inference, our stress tests are large-scale automatically constructed suites of datasets which evaluate systems on a phenomenon-by-phenomenon basis. Each evaluation set focuses on a single phenomenon so as to not introduce confounding factors, thereby providing a testbed for fine-grained evaluation and analysis.

Stress tests for word overlap, negation, length mismatch, antonym, noise and numerical reasoning stress tests as described in the paper [1] can directly be downloaded here. You can also find other resources related to this work on our website.

This repository contains the code used to automatically generate stress tests for word overlap, negation, length mismatch, antonym, spelling error and numerical reasoning, intended to help generate stress tests for new data. To evaluate your models, please use the generated stress tests.

Competence Tests

gen_num_test.py, quant_ner.py: These files are used to perform the preprocessing steps (such as splitting word problems into sentences, removing sentences with long rationales and removing sentences which do not contain named entities) and create a set of useful premise sentences for the quantitative reasoning stress test
quant_example_gen.py: This file uses the set of useful premise sentences generated after preprocessing to create entailed, contradictory and neutral hypotheses for the quantitative reasoning stress test
make_antonym_adv_samples.py: This file contains the code to sample sentences from the MultiNLI development set as possible premises and generate contradicting hypotheses.

How to Run

Numerical Reasoning:

Run python gen_num_test.py INPUT_FILE OUTPUT_FILE
Run python quant_ner.py
Run python quant_example_gen.py

Antonyms

Run python make_antonym_adv_samples.py --base_dir MULTINLI_DIRECTORY

Distraction Tests

make_distraction_adv_samples_jsonl.py: This file generates the word overlap, negation and length mismatch tests

How to Run

How to run the code:

Run python make_distraction_adv_samples_jsonl.py TAUTOLOGY_STRING INPUT_FILE OUTPUT_FILE ( this file needs the data_preprocessing.py file provided by MultiNLI creators to run)

Noise Tests

make_grammar_adv_samples_jsonl.py : Noise test, file generates premise-hypothesis pairs with a word perturbed by a keyboard swap spelling error.

How to Run

Run python make_grammar_adv_samples_jsonl.py --base_dir MULTINLI_DIRECTORY

The generated stress tests are also available at: https://abhilasharavichander.github.io/NLI_StressTest/

Evaluation Script

If you want to directly evaluate your system on all stress tests at once you can. Usage is as follows-

You will need to report your predictions on the test file found here
Write out model predictions as "prediction" field for each sample in the evaluation set. (Sample submission files are available as sample_submission.jsonl and sample_submission.txt)
Run the evaluation script with the command python eval.py --eval_file SUBMISSION_FILE > REPORT_FILE.txt

Alternatively, you may write your own evaluation function: our script simply evaluates models on classification accuracy for each stress test at once.

References

Please considering citing [1] if using these stress tests to evaluate Natural Language Inference models.

Stress Test Evaluation for Natural Language Inference (COLING 2018)

[1] A. Naik, A. Ravichander, N. Sadeh, C. Rose, G.Neubig, Stress Test Evaluation for Natural Language Inference

@inproceedings{naik18coling, 
title = {Stress Test Evaluation for Natural Language Inference},
author = {Aakanksha Naik and Abhilasha Ravichander and Norman Sadeh and Carolyn Rose and Graham Neubig}, 
booktitle = {The 27th International Conference on Computational Linguistics (COLING)}, 
address = {Santa Fe, New Mexico, USA},
month = {August},
year = {2018} }

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
README.md		README.md
_config.yml		_config.yml
adversarial-nli-naacl-gendeep.pdf		adversarial-nli-naacl-gendeep.pdf
eval.py		eval.py
gen_num_test.py		gen_num_test.py
index.md		index.md
make_antonym_adv_samples.py		make_antonym_adv_samples.py
make_distraction_adv_samples_jsonl.py		make_distraction_adv_samples_jsonl.py
make_grammar_adv_samples_jsonl.py		make_grammar_adv_samples_jsonl.py
quant_example_gen.py		quant_example_gen.py
quant_ner.py		quant_ner.py
slides.pdf		slides.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLI_StressTest

Competence Tests

How to Run

Distraction Tests

How to Run

Noise Tests

How to Run

Evaluation Script

References

Stress Test Evaluation for Natural Language Inference (COLING 2018)

About

Releases

Packages

Languages

AbhilashaRavichander/NLI_StressTest

Folders and files

Latest commit

History

Repository files navigation

NLI_StressTest

Competence Tests

How to Run

Distraction Tests

How to Run

Noise Tests

How to Run

Evaluation Script

References

Stress Test Evaluation for Natural Language Inference (COLING 2018)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages