text-detoxification

This is the 1st place solution code for the educational competition on text detoxification at Ozon Masters NLP Course.

This competition is inspired by the original one, RUSSE 2022 Russian Text Detoxification Based on Parallel Corpora, but differs from it by test data and the way of evaluation. Namely, submissions here are evaluated using nltk.translate.bleu_score.corpus_bleu and the development dataset of the original competition serves as the test dataset in this competition.

Task Formulation

Detoxification is a kind of text style transfer, which aims to paraphrase some text in the toxic style to the non-toxic one, preserving the meaning of the original content and maintaining natural language fluency.

Solution Strategy

The solution consists of the following stages:

Candidate Generation
Scoring and Candidate Selection

Candidate Generation

In this stage, we train sberbank-ai/ruT5-base and cointegrated/rut5-base-paraphraser on the entire training dataset, periodically saving their snapshots. Now, for a given text fragment, each snapshot generates its candidates for both the training data and the test data.

The best model here reached 0.805.

Scoring and Candidate Selection

Next, for a given pair of a toxic comment and its non-toxic candidate, we set a goal to predict the BLEU score between this candidate and the ground truth non-toxic comment. For this purpose, we label such pairs for the training dataset and train cointegrated/rubert-tiny-toxicity on these labeled pairs.

Finally, to produce a submission file, this model scores candidates for the test data. For each toxic comment in the test dataset, the ranks obtained are exploited to determine the best candidate.

This approach increased the BLEU score to 0.814.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
datasets		datasets
detox		detox
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
pipeline.env		pipeline.env
skorbert-inference.py		skorbert-inference.py
skorbert-training.py		skorbert-training.py
t5-full-training.py		t5-full-training.py
t5-inference.py		t5-inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-detoxification

Task Formulation

Solution Strategy

Candidate Generation

Scoring and Candidate Selection

About

Releases

Packages

Languages

basic-go-ahead/text-detoxification

Folders and files

Latest commit

History

Repository files navigation

text-detoxification

Task Formulation

Solution Strategy

Candidate Generation

Scoring and Candidate Selection

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages