This is the 1st place solution code for the educational competition on text detoxification at Ozon Masters NLP Course.
This competition is inspired by the original one, RUSSE 2022 Russian Text Detoxification Based on Parallel Corpora, but differs from it by test data and the way of evaluation. Namely, submissions here are evaluated using nltk.translate.bleu_score.corpus_bleu
and the development dataset of the original competition serves as the test dataset in this competition.
Detoxification is a kind of text style transfer, which aims to paraphrase some text in the toxic style to the non-toxic one, preserving the meaning of the original content and maintaining natural language fluency.
The solution consists of the following stages:
- Candidate Generation
- Scoring and Candidate Selection
In this stage, we train sberbank-ai/ruT5-base
and cointegrated/rut5-base-paraphraser
on the entire training dataset, periodically saving their snapshots. Now, for a given text fragment, each snapshot generates its candidates for both the training data and the test data.
The best model here reached 0.805.
Next, for a given pair of a toxic comment and its non-toxic candidate, we set a goal to predict the BLEU score between this candidate and the ground truth non-toxic comment. For this purpose, we label such pairs for the training dataset and train cointegrated/rubert-tiny-toxicity
on these labeled pairs.
Finally, to produce a submission file, this model scores candidates for the test data. For each toxic comment in the test dataset, the ranks obtained are exploited to determine the best candidate.
This approach increased the BLEU score to 0.814.