-
Notifications
You must be signed in to change notification settings - Fork 166
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Build: Add Windows and OS X testing to github workflow - Improve documentation and type annotations. - Drop `Python < 3.6` support and migrate to f-strings. - Drop input type manipulation through `isinstance` checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (fixes #121) - Use colored strings in tabular outputs (multi-system evaluation mode) through the help of `colorama` package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit. - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (fixes #46) - Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'. - Metrics: Scale all metrics into the [0, 100] range (fixes #140) - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141). - BLEU: allow modifying max_ngram_order (fixes #156) - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case. - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (fixes #124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (fixes #144) - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same. - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your shell. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way. Through the use of `tabulate` package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to `-i/--input` or as a tab-separated single stream redirected into `STDIN`. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation through bootstrap resampling for single-system evaluation (`--confidence` flag) as well as paired bootstrap resampling (`--paired-bs`) and paired approximate randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).
- Loading branch information
1 parent
90a4b8a
commit 078c440
Showing
41 changed files
with
4,487 additions
and
1,891 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
| Dataset | Description | | ||
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------- | | ||
| mtedx/valid | mTEDx evaluation data, valid: [URL](http://openslr.org/100) | | ||
| mtedx/test | mTEDx evaluation data, test: [URL](http://openslr.org/100) | | ||
| wmt20/robust/set1 | WMT20 robustness task, set 1 | | ||
| wmt20/robust/set2 | WMT20 robustness task, set 2 | | ||
| wmt20/robust/set3 | WMT20 robustness task, set 3 | | ||
| wmt20/tworefs | WMT20 news test sets with two references | | ||
| wmt20 | Official evaluation data for WMT20 | | ||
| mtnt2019 | Test set for the WMT 19 robustness shared task | | ||
| mtnt1.1/test | Test data for the Machine Translation of Noisy Text task: [URL](http://www.cs.cmu.edu/~pmichel1/mtnt/) | | ||
| mtnt1.1/valid | Validation data for the Machine Translation of Noisy Text task: [URL](http://www.cs.cmu.edu/~pmichel1/mtnt/) | | ||
| mtnt1.1/train | Training data for the Machine Translation of Noisy Text task: [URL](http://www.cs.cmu.edu/~pmichel1/mtnt/) | | ||
| wmt20/dev | Development data for tasks new to 2020. | | ||
| wmt19 | Official evaluation data. | | ||
| wmt19/dev | Development data for tasks new to 2019. | | ||
| wmt19/google/ar | Additional high-quality reference for WMT19/en-de. | | ||
| wmt19/google/arp | Additional paraphrase of wmt19/google/ar. | | ||
| wmt19/google/wmtp | Additional paraphrase of the official WMT19 reference. | | ||
| wmt19/google/hqr | Best human selected-reference between wmt19 and wmt19/google/ar. | | ||
| wmt19/google/hqp | Best human-selected reference between wmt19/google/arp and wmt19/google/wmtp. | | ||
| wmt19/google/hqall | Best human-selected reference among original official reference and the Google reference and paraphrases. | | ||
| wmt18 | Official evaluation data. | | ||
| wmt18/test-ts | Official evaluation sources with extra test sets interleaved. | | ||
| wmt18/dev | Development data (Estonian<>English). | | ||
| wmt17 | Official evaluation data. | | ||
| wmt17/B | Additional reference for EN-FI and FI-EN. | | ||
| wmt17/tworefs | Systems with two references. | | ||
| wmt17/improved | Improved zh-en and en-zh translations. | | ||
| wmt17/dev | Development sets released for new languages in 2017. | | ||
| wmt17/ms | Additional Chinese-English references from Microsoft Research. | | ||
| wmt16 | Official evaluation data. | | ||
| wmt16/B | Additional reference for EN-FI. | | ||
| wmt16/tworefs | EN-FI with two references. | | ||
| wmt16/dev | Development sets released for new languages in 2016. | | ||
| wmt15 | Official evaluation data. | | ||
| wmt14 | Official evaluation data. | | ||
| wmt14/full | Evaluation data released after official evaluation for further research. | | ||
| wmt13 | Official evaluation data. | | ||
| wmt12 | Official evaluation data. | | ||
| wmt11 | Official evaluation data. | | ||
| wmt10 | Official evaluation data. | | ||
| wmt09 | Official evaluation data. | | ||
| wmt08 | Official evaluation data. | | ||
| wmt08/nc | Official evaluation data (news commentary). | | ||
| wmt08/europarl | Official evaluation data (Europarl). | | ||
| iwslt17 | Official evaluation data for IWSLT. | | ||
| iwslt17/tst2016 | Development data for IWSLT 2017. | | ||
| iwslt17/tst2015 | Development data for IWSLT 2017. | | ||
| iwslt17/tst2014 | Development data for IWSLT 2017. | | ||
| iwslt17/tst2013 | Development data for IWSLT 2017. | | ||
| iwslt17/tst2012 | Development data for IWSLT 2017. | | ||
| iwslt17/tst2011 | Development data for IWSLT 2017. | | ||
| iwslt17/tst2010 | Development data for IWSLT 2017. | | ||
| iwslt17/dev2010 | Development data for IWSLT 2017. | | ||
| multi30k/2016 | 2016 flickr test set of Multi30k dataset | | ||
| multi30k/2017 | 2017 flickr test set of Multi30k dataset | | ||
| multi30k/2018 | 2018 flickr test set of Multi30k dataset. See [URL](https://competitions.codalab.org/competitions/19917) for evaluation. | |
Oops, something went wrong.