Merge changes for 2.0.0 (#152)

- Build: Add Windows and OS X testing to github workflow - Improve documentation and type annotations. - Drop `Python < 3.6` support and migrate to f-strings. - Drop input type manipulation through `isinstance` checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (fixes #121) - Use colored strings in tabular outputs (multi-system evaluation mode) through the help of `colorama` package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit. - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (fixes #46) - Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'. - Metrics: Scale all metrics into the [0, 100] range (fixes #140) - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141). - BLEU: allow modifying max_ngram_order (fixes #156) - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case. - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (fixes #124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (fixes #144) - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same. - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your shell. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way. Through the use of `tabulate` package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to `-i/--input` or as a tab-separated single stream redirected into `STDIN`. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation through bootstrap resampling for single-system evaluation (`--confidence` flag) as well as paired bootstrap resampling (`--paired-bs`) and paired approximate randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).
mjpost · Jul 18, 2021 · 078c440 · 078c440
1 parent 90a4b8a
commit 078c440
Show file tree

Hide file tree

Showing 41 changed files with 4,487 additions and 1,891 deletions.
diff --git a/.github/workflows/check-build.yml b/.github/workflows/check-build.yml
@@ -3,20 +3,40 @@ name: check-build
 on:
   pull_request
 
+env:
+  PYTHONUTF8: "1"
+
 jobs:
   check-build:
-    runs-on: ubuntu-20.04
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        os: [ubuntu-latest, macos-latest, windows-latest]
+        python-version: [3.6, 3.7, 3.8, 3.9]
+        exclude:
+          - os: windows-latest
+            python-version: '3.6'   # test fails due to UTF8 stuff
     steps:
-    - name: update
-      run: sudo apt-get -y update
-    - name: install pytest-cov
-      run: pip install pytest-cov
-    - uses: actions/checkout@v1
-    - name: install
-      run: sudo python3 setup.py install
-    - name: install-ja
-      run: sudo pip install .[ja]
-    - name: pytest
-      run: python3 -m pytest
-    - name: test
-      run: ./test.sh
+      # - name: update
+      #  run: sudo apt-get -y update
+      - uses: actions/checkout@v2
+      - name: Setup Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v2
+        with:
+          python-version: ${{ matrix.python-version }}
+      - if: matrix.os == 'macos-latest'
+        name: Install Mac OS requirements
+        run: brew install bash
+      - if: matrix.os == 'windows-latest'
+        name: Install Windows requirements
+        run: choco install wget unzip
+      - name: Install python dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install pytest-cov
+          pip install .[ja]
+      - name: Python pytest test suite
+        run: python3 -m pytest
+      - name: CLI bash test suite
+        shell: bash
+        run: ./test.sh
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,66 @@
-# VERSION HISTORY
+# Release Notes
+
+- 2.0.0 (2021-07-XX)
+  - Build: Add Windows and OS X testing to Travis CI.
+  - Improve documentation and type annotations.
+  - Drop `Python < 3.6` support and migrate to f-strings.
+  - Relax `portalocker` version pinning, add `regex, tabulate, numpy` dependencies.
+  - Drop input type manipulation through `isinstance` checks. If the user does not obey
+    to the expected annotations, exceptions will be raised. Robustness attempts lead to
+    confusions and obfuscated score errors in the past (#121)
+  - Variable # references per segment is supported for all metrics by default. It is
+    still only available through the API.
+  - Use colored strings in tabular outputs (multi-system evaluation mode) through
+    the help of `colorama` package.
+  - tokenizers: Add caching to tokenizers which seem to speed up things a bit.
+  - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds
+    for a particular test set evaluation. (#46)
+  - Signature: Formatting changed (mostly to remove '+' separator as it was
+    interfering with chrF++). The field separator is now '|' and key values
+    are separated with ':' rather than '.'.
+  - Signature: Boolean true / false values are shortened to yes / no.
+  - Signature: Number of references is `var` if variable number of references is used.
+  - Signature: Add effective order (yes/no) to BLEU and chrF signatures.
+  - Metrics: Scale all metrics into the [0, 100] range (#140)
+  - Metrics API: Use explicit argument names and defaults for the metrics instead of
+    passing obscure `argparse.Namespace` objects.
+  - Metrics API: A base abstract `Metric` class is introduced to guide further
+    metric development. This class defines the methods that should be implemented
+    in the derived classes and offers boilerplate methods for the common functionality.
+    A new metric implemented this way will automatically support significance testing.
+  - Metrics API: All metrics now receive an optional `references` argument at
+    initialization time to process and cache the references. Further evaluations
+    of different systems against the same references becomes faster this way
+    for example when using significance testing.
+  - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (#141).
+  - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
+  - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py.
+    Exposed it through the CLI (--chrf-word-order) (#124)
+  - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing).
+    This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations.
+    We keep the effective ordering as the default for compatibility, since this only
+    affects sentence-level scoring with very short sentences. (#144)
+  - CLI: `--input/-i` can now ingest multiple systems. For this reason, the positional
+    `references` should always preceed the `-i` flag.
+  - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
+  - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility,
+    BLEU argument names are kept the same.
+  - CLI: Separate metric-specific arguments for clarity when `--help` is printed.
+  - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default.
+    If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your
+    shell.
+  - CLI: For multi-system mode, `json` falls back to plain text. `latex` output can only
+    be generated for multi-system mode.
+  - CLI: sacreBLEU now supports evaluating multiple systems for a given test set
+    in an efficient way. Through the use of `tabulate` package, the results are
+    nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument).
+    The systems can be either given as a list of plain text files to `-i/--input` or
+    as a tab-separated single stream redirected into `STDIN`. In the former case,
+    the basenames of the files will be automatically used as system names.
+  - Statistical tests: sacreBLEU now supports confidence interval estimation
+    through bootstrap resampling for single-system evaluation (`--confidence` flag)
+    as well as paired bootstrap resampling (`--paired-bs`) and paired approximate
+    randomization tests (`--paired-ar`) when evaluating multiple systems (#40 and #78).
 
 - 1.5.1 (2021-03-05)
   - Fix extraction error for WMT18 extra test sets (test-ts) (#142)

diff --git a/DATASETS.md b/DATASETS.md
@@ -0,0 +1,58 @@
+| Dataset                        | Description                                                                                                         |
+| ------------------------------ | ------------------------------------------------------------------------------------------------------------------- |
+| mtedx/valid                    | mTEDx evaluation data, valid: [URL](http://openslr.org/100)                                                         |
+| mtedx/test                     | mTEDx evaluation data, test: [URL](http://openslr.org/100)                                                          |
+| wmt20/robust/set1              | WMT20 robustness task, set 1                                                                                        |
+| wmt20/robust/set2              | WMT20 robustness task, set 2                                                                                        |
+| wmt20/robust/set3              | WMT20 robustness task, set 3                                                                                        |
+| wmt20/tworefs                  | WMT20 news test sets with two references                                                                            |
+| wmt20                          | Official evaluation data for WMT20                                                                                  |
+| mtnt2019                       | Test set for the WMT 19 robustness shared task                                                                      |
+| mtnt1.1/test                   | Test data for the Machine Translation of Noisy Text task: [URL](http://www.cs.cmu.edu/~pmichel1/mtnt/)              |
+| mtnt1.1/valid                  | Validation data for the Machine Translation of Noisy Text task: [URL](http://www.cs.cmu.edu/~pmichel1/mtnt/)        |
+| mtnt1.1/train                  | Training data for the Machine Translation of Noisy Text task: [URL](http://www.cs.cmu.edu/~pmichel1/mtnt/)          |
+| wmt20/dev                      | Development data for tasks new to 2020.                                                                             |
+| wmt19                          | Official evaluation data.                                                                                           |
+| wmt19/dev                      | Development data for tasks new to 2019.                                                                             |
+| wmt19/google/ar                | Additional high-quality reference for WMT19/en-de.                                                                  |
+| wmt19/google/arp               | Additional paraphrase of wmt19/google/ar.                                                                           |
+| wmt19/google/wmtp              | Additional paraphrase of the official WMT19 reference.                                                              |
+| wmt19/google/hqr               | Best human selected-reference between wmt19 and wmt19/google/ar.                                                    |
+| wmt19/google/hqp               | Best human-selected reference between wmt19/google/arp and wmt19/google/wmtp.                                       |
+| wmt19/google/hqall             | Best human-selected reference among original official reference and the Google reference and paraphrases.           |
+| wmt18                          | Official evaluation data.                                                                                           |
+| wmt18/test-ts                  | Official evaluation sources with extra test sets interleaved.                                                       |
+| wmt18/dev                      | Development data (Estonian<>English).                                                                               |
+| wmt17                          | Official evaluation data.                                                                                           |
+| wmt17/B                        | Additional reference for EN-FI and FI-EN.                                                                           |
+| wmt17/tworefs                  | Systems with two references.                                                                                        |
+| wmt17/improved                 | Improved zh-en and en-zh translations.                                                                              |
+| wmt17/dev                      | Development sets released for new languages in 2017.                                                                |
+| wmt17/ms                       | Additional Chinese-English references from Microsoft Research.                                                      |
+| wmt16                          | Official evaluation data.                                                                                           |
+| wmt16/B                        | Additional reference for EN-FI.                                                                                     |
+| wmt16/tworefs                  | EN-FI with two references.                                                                                          |
+| wmt16/dev                      | Development sets released for new languages in 2016.                                                                |
+| wmt15                          | Official evaluation data.                                                                                           |
+| wmt14                          | Official evaluation data.                                                                                           |
+| wmt14/full                     | Evaluation data released after official evaluation for further research.                                            |
+| wmt13                          | Official evaluation data.                                                                                           |
+| wmt12                          | Official evaluation data.                                                                                           |
+| wmt11                          | Official evaluation data.                                                                                           |
+| wmt10                          | Official evaluation data.                                                                                           |
+| wmt09                          | Official evaluation data.                                                                                           |
+| wmt08                          | Official evaluation data.                                                                                           |
+| wmt08/nc                       | Official evaluation data (news commentary).                                                                         |
+| wmt08/europarl                 | Official evaluation data (Europarl).                                                                                |
+| iwslt17                        | Official evaluation data for IWSLT.                                                                                 |
+| iwslt17/tst2016                | Development data for IWSLT 2017.                                                                                    |
+| iwslt17/tst2015                | Development data for IWSLT 2017.                                                                                    |
+| iwslt17/tst2014                | Development data for IWSLT 2017.                                                                                    |
+| iwslt17/tst2013                | Development data for IWSLT 2017.                                                                                    |
+| iwslt17/tst2012                | Development data for IWSLT 2017.                                                                                    |
+| iwslt17/tst2011                | Development data for IWSLT 2017.                                                                                    |
+| iwslt17/tst2010                | Development data for IWSLT 2017.                                                                                    |
+| iwslt17/dev2010                | Development data for IWSLT 2017.                                                                                    |
+| multi30k/2016                  | 2016 flickr test set of Multi30k dataset                                                                            |
+| multi30k/2017                  | 2017 flickr test set of Multi30k dataset                                                                            |
+| multi30k/2018                  | 2018 flickr test set of Multi30k dataset. See [URL](https://competitions.codalab.org/competitions/19917) for evaluation. |