Release v2.0.0 · mjpost/sacrebleu

This is a major release that introduces statistical significance testing for BLEU, chrF and TER. It should be noted that as of v2.0.0, the default output format of the CLI utility is json rather than the old single-line output. All tools should adapt to this change if they parse standard output.

Build: Add Windows and OS X testing to github workflow
Improve documentation and type annotations.
Drop Python < 3.6 support and migrate to f-strings.
Drop input type manipulation through isinstance checks. If the user does not obey
to the expected annotations, exceptions will be raised. Robustness attempts lead to
confusions and obfuscated score errors in the past (fixes #121)
Use colored strings in tabular outputs (multi-system evaluation mode) through
the help of colorama package.
tokenizers: Add caching to tokenizers which seem to speed up things a bit.
intl tokenizer: Use regex module. Speed goes from ~4 seconds to ~0.6 seconds
for a particular test set evaluation. (fixes #46)
Signature: Formatting changed (mostly to remove '+' separator as it was
interfering with chrF++). The field separator is now '|' and key values
are separated with ':' rather than '.'.
Metrics: Scale all metrics into the [0, 100] range (fixes #140)
BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141).
BLEU: allow modifying max_ngram_order (fixes #156)
CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
CHRF: Added chrF+ support through word_order argument. Added test cases against chrF++.py.
Exposed it through the CLI (--chrf-word-order) (fixes #124)
CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing).
This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations.
We keep the effective ordering as the default for compatibility, since this only
affects sentence-level scoring with very short sentences. (fixes #144)
CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same.
CLI: Added --format/-f flag. The single-system output mode is now json by default.
If you want to keep the old text format persistently, you can export SACREBLEU_FORMAT=text into your
shell.
CLI: sacreBLEU now supports evaluating multiple systems for a given test set
in an efficient way. Through the use of tabulate package, the results are
nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument).
The systems can be either given as a list of plain text files to -i/--input or
as a tab-separated single stream redirected into STDIN. In the former case,
the basenames of the files will be automatically used as system names.
Statistical tests: sacreBLEU now supports confidence interval estimation
through bootstrap resampling for single-system evaluation (--confidence flag)
as well as paired bootstrap resampling (--paired-bs) and paired approximate
randomization tests (--paired-ar) when evaluating multiple systems (fixes #40 and fixes #78).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.0.0