Currently the master is a pre-release of our work for the WMT 2022 shared tasks Metrics and QE! Use version 1.1.3 if you are looking for a stable version!
We are planning a new release with better models and new-features for January.
COMET requires python 3.8 or above!
Simple installation from PyPI
pip install --upgrade pip # ensures that pip is current
pip install unbabel-comet
To develop locally install run the following commands:
git clone https://github.com/Unbabel/COMET
cd COMET
pip install poetry
poetry install
For development, you can run the CLI tools directly, e.g.,
PYTHONPATH=. ./comet/cli/score.py
Test examples:
echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp1.en
echo -e "The fire could have been stopped\nSchools and pre-school were open" >> hyp2.en
echo -e "They were able to control the fire.\nSchools and kindergartens opened" >> ref.en
Basic scoring command:
comet-score -s src.de -t hyp1.en -r ref.en
you can set
--gpus 0
to test on CPU.
Scoring multiple systems:
comet-score -s src.de -t hyp1.en hyp2.en -r ref.en
WMT test sets via SacreBLEU:
comet-score -d wmt20:en-de -t PATH/TO/TRANSLATIONS
The default setting of comet-score
prints the score for each segment individually. If you are only interested in a system-level score, you can use the --quiet
flag.
comet-score -s src.de -t hyp1.en -r ref.en --quiet
comet-score -s src.de -t hyp1.en --model wmt22-cometkiwi-da
When comparing multiple MT systems we encourage you to run the comet-compare
command to get statistical significance with Paired T-Test and bootstrap resampling (Koehn, et al 2004).
comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en
The MBR command allows you to rank MT hypotheses and select the best one according to COMET. For more details you can read our paper on Quality-Aware Decoding for Neural Machine Translation.
Our implementation is inspired by Amrhein et al, 2022 where sentences are cached during inference to avoid quadratic computations while creating the sentence embeddings.
comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt --num_sample [X] -o [OUTPUT_FILE].txt
COMET is optimized to be used in a single GPU by taking advantage of length batching and embedding caching. When using Multi-GPU since data e spread across GPUs we will typically get fewer cache hits and the length batching samples is replaced by a DistributedSampler. Because of that, according to our experiments, using 1 GPU is faster than using 2 GPUs specially when scoring multiple systems for the same source and reference.
Nonetheless, if your data does not have repetitions and you have more than 1 GPU available, you can run multi-GPU inference with the following command:
comet-score -s src.de -t hyp1.en -r ref.en --gpus 2 --quiet
Warning: Segment-level scores using multigpu will be out of order. This is only useful for system scoring.
from comet import download_model, load_from_checkpoint
model_path = download_model("wmt22-comet-da")
model = load_from_checkpoint(model_path)
data = [
{
"src": "Dem Feuer konnte Einhalt geboten werden",
"mt": "The fire could be stopped",
"ref": "They were able to control the fire."
},
{
"src": "Schulen und Kindergärten wurden eröffnet.",
"mt": "Schools and kindergartens were open",
"ref": "Schools and kindergartens opened"
}
]
model_output = model.predict(data, batch_size=8, gpus=1)
seg_scores, system_score = model_output.scores, model_output.system_score
All the above mentioned models are build on top of XLM-R which cover the following languages:
Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.
Thus, results for language pairs containing uncovered languages are unreliable!
We recommend the two following models to evaluate your translations:
wmt20-comet-da
: DEFAULT Reference-based Regression model build on top of XLM-R (large) and trained of Direct Assessments from WMT17 to WMT19. Same aswmt-large-da-estimator-1719
from previous versions.wmt21-comet-qe-mqm
: Reference-FREE Regression model build on top of XLM-R (large), trained on Direct Assessments and fine-tuned on MQM.eamt22-cometinho-da
: Lightweight Reference-based Regression model that was distilled from an ensemble of COMET models similar towmt20-comet-da
.
Instead of using pretrained models your can train your own model with the following command:
comet-train --cfg configs/models/{your_model_config}.yaml
You can then use your own metric to score:
comet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT
Note: Please contact ricardo.rei@unbabel.com if you wish to host your own metric within COMET available metrics!
In order to run the toolkit tests you must run the following command:
coverage run --source=comet -m unittest discover
coverage report -m
Note: Testing on CPU takes a long time
If you use COMET please cite our work and don't forget to say which model you used to evaluate your systems.
-
CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task -- Winning submission
-
COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task
-
Searching for Cometinho: The Little Metric That Could -- EAMT22 Best paper award
-
Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task
-
COMET - Deploying a New State-of-the-art MT Evaluation Metric in Production