fix: ICU plural-case handling during Machine-Translation #2445

balk-sp · 2024-08-29T14:34:52Z

ICU plural element mishandling explicit case values during translation

When submitting ICU to the machine-translation, it will correctly handle plural-cases 'one' and 'other', but not explicit value cases like '=1' and '=0'

Input A: {productCount,plural,one{You have one product.}other{You have # products.}}
Input B: {productCount,plural,=1{You have one product.}other{You have # products.}}

Output A: {productCount,plural,one{You have one product.}other{You have # products.}}
Output B: {productCount,plural,one{You have 1 products.}other{You have # products.}}

Output B outputs case 'one' for input case '=1'
Output B uses the value of case 'other' instead of the value of case '=1', and therefore yields '1 products', which is grammatically incorrect.

Machine translation

As the ICU text is correctly stored in the database, and correctly shown in the web-interface, but is corrupted during the machine-translation, there must be a text-transformation just prior to making the call to the translation-provider.

During machine-translation, the following queries are sent to the provider.

Input A: {productCount,plural,one{You have one product.}other{You have # products.}}
  Translate case 'one':   You have one product.
  Translate case 'other': You have <x id="tolgee-number">10</x> products.

Input B: {productCount,plural,=1{You have one product.}other{You have # products.}}
  Translate case '=1':    You have <x id="tolgee-number">1</x> products.
  Translate case 'other': You have <x id="tolgee-number">10</x> products.

Bug

The reason this is happening is that Tolgee rewrites plural-case '=1' to 'one' (similarly, it maps '=0' to 'zero'), and then tries to find 'one' in the original plural-element, which does not exist, as the plural-element contains the '=1' case. This causes it to fallback to the 'other' case.

Solution

Besides looking for the 'one' case, we should also look for the '=1' plural-case when doing the lookup.

JanCizmar · 2024-08-30T12:33:01Z

Thanks a lot for the PR! 🎉 Looks like it breaks the tests. Also this feature will require some unit tests as well.

JanCizmar · 2024-09-02T07:47:14Z

Hey! Sorry It actually doesn't break the tests. It's only the report task which always fail for PRs. But ktlint fails. You have to run ./gradlew ktlintFormat to fix this.

Anyway, adding unit tests testing this new functionality need to be done before merging this. Are you willing to do this or should we do this ourselves?

balk-sp · 2024-09-02T09:59:23Z

It seems to me that my PR merely patches a side effect of the bug.
To me it feels odd that during translation the plural-cases are rewritten.

Issue A
If there is a need to transform the cases (like '=0', '=1', '=2', '=3' to resp. 'zero', 'one', 'two', 'few') then it should be transformed in all layers. It doesn't make sense to transform case-value '=1' to 'one', and then look for 'one' in the original plural-statement, and then fall back to '=1' (as is the workaround in my patch). If you go the route of transformation, you also have to transform all the plural-cases (in the AST representation), such that the lookup for 'one' actually works, as opposed to now, where it could only possibly work if somebody were to write both '=1' and 'one' in their plural-cases.

Issue B
As a contrived example, let's say I start off with the following ICU-text:

{var,
 plural,
 =0{No products}
 =1{A product}
 =2{A pair of products}
 =3{A typical yet odd amount of products}
 =4{A typical yet even amount of products}
 other{# products}
}

The case-values would be transformed to:

 =0 -> zero
 =1 -> one
 =2 -> two
 =3 -> few
 =4 -> few
 other -> other

Note the 2 'few' case-values. Now I realize this is a contrived example, but I hope it illustrates that the transformation is not as innocent as it seems, it has side-effects in the sense that different cases in the original icu-statement will be mapped to the same case value.

Issue C
Upon actual testing, it turns out that the above input (with cases =0...=4) yields the following output:

{var,
 plural,
 one{A product}
 other{# products}
}

So cases =0,=2,=3,=4 completely vanish during machine-translation.
Note that I added logging to the lines of my PR, and I observe that only '=1/one' and 'other' reach my PR code, the other cases are lost earlier in the process.

This means that 'Issue B' at the moment is merely a theoretical issue, because it never reaches the state of producing identical transformed case-values for different inputs.

balk-sp · 2024-09-02T14:56:13Z

The root-cause seems to be that the following code:

# PluralTranslationUtil.kt

  private val targetExamples by lazy {
    val targetLanguageTag = context.getLanguage(item.targetLanguageId).tag
    val targetULocale = getULocaleFromTag(targetLanguageTag)
    val targetRules = PluralRules.forLocale(targetULocale)
    getPluralFormExamples(targetRules)
  }

... for target-locale "en-US" returns:

{one=1, other=10}

which leads to it producing translations for these, where these keys intersect with the plural-cases, causing it to drop all other plural-cases (like =0, =4, etc) and if it weren't for my fallback in this PR, it would drop '=1' too, as only 'other' is in both the targetExamples.keys and in the plural.forms.keys.

balk-sp · 2024-09-02T16:01:24Z

Would it be a good idea to simply leave explicit-values as they are?
Additionally, not using targetExamples as the source (which is exceptionally restrictive, containing only [one/other]), but instead iterate over the actual plural-cases, in forms:

# PluralTranslationUtil.kt

  private val preparedFormSourceStrings: Sequence<Pair<String, String>> by lazy {
    return@lazy forms.forms.asSequence().map {
      if (it.key.startsWith("=") && it.key.substring(1).toDoubleOrNull() != null) {
        it.key to it.value.replaceReplaceNumberPlaceholderWithExample(it.key.substring(1).toDouble())
      } else {
        val numValue = targetExamples[it.key]?.toDouble() ?: 10.0
        val formValue = forms.forms[it.key] ?: forms.forms[sourceRules?.select(numValue)] ?: forms.forms["=" + it.value]
        ?: forms.forms[PluralRules.KEYWORD_OTHER] ?: ""

        it.key to formValue.replaceReplaceNumberPlaceholderWithExample(numValue)
      }
    }
  }

Passing the following (contrived) ICU-string to the translation-service now produces exactly the same output:

{var,
 plural,
 zero{No products whatsoever}
 one{A single product}
 =0{No products}
 =1{A product}
 =2{A pair of products}
 =3{A typical yet odd amount of products}
 =4{A typical yet even amount of products}
 other{# products}
}

JanCizmar · 2024-09-03T05:58:49Z

Yeah, this looks fine!

JanCizmar · 2024-09-03T14:37:56Z

Hey! As I am thinking about it now, maybe no transformation would indeed be better solution. The reason why I added the complex transformation was that I wanted to give the machine translators number example. But I forgot that the form is actually also pretty good example value for the machine translators. So now I believe, that maybe we can just use the variant name (zero, one ...) As the example. However, we still need to find example for other and handle the situation when there is collision with some exact form.

JanCizmar · 2024-09-03T14:48:35Z

Oh! Sorry. As I am checking the code, I can see that the transformation actually is required. I will try fix the issue and add some tests.

JanCizmar · 2024-09-03T15:21:04Z

For many cases the source forms doesn't match the target forms. For example in English there is only one and other required, but the same string requires 4 forms to be provided in Czech language. That's why this transformation is necessary. I probably did the same thing as you by adding the exact forms to the data separately, so we got them provided as well. This should handle all the situations. #2454

I just need to test, how the UI will handle this situation.

balk-sp · 2024-09-04T12:29:58Z

Thanks, I actually made a change already in the PR-code, as I noticed the same requirement, based on your unit-tests. You can see how I merge the examples/form cases.

Having said that, your implementation is obviously cleaner, as this is my first attempt at kotlin.
N.B.: the latest PR-code doesn't quite work, it was a WIP and can be discarded.

JanCizmar · 2024-09-04T12:42:51Z

OKI doke, so I am closing this. Thanks for cooperation. :)

balk-sp changed the title ~~Fix ICU plural-case handling during Machine-Translation~~ fix: ICU plural-case handling during Machine-Translation Aug 30, 2024

Add support for non-trivial plural-case expansion.

c1c45c1

balk-sp force-pushed the main branch from bb4fa83 to c1c45c1 Compare September 3, 2024 11:49

JanCizmar closed this Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ICU plural-case handling during Machine-Translation #2445

fix: ICU plural-case handling during Machine-Translation #2445

balk-sp commented Aug 29, 2024 •

edited

Loading

JanCizmar commented Aug 30, 2024

JanCizmar commented Sep 2, 2024

balk-sp commented Sep 2, 2024 •

edited

Loading

balk-sp commented Sep 2, 2024 •

edited

Loading

balk-sp commented Sep 2, 2024 •

edited

Loading

JanCizmar commented Sep 3, 2024

JanCizmar commented Sep 3, 2024

JanCizmar commented Sep 3, 2024

JanCizmar commented Sep 3, 2024

balk-sp commented Sep 4, 2024 •

edited

Loading

JanCizmar commented Sep 4, 2024

fix: ICU plural-case handling during Machine-Translation #2445

fix: ICU plural-case handling during Machine-Translation #2445

Conversation

balk-sp commented Aug 29, 2024 • edited Loading

ICU plural element mishandling explicit case values during translation

Machine translation

Bug

Solution

JanCizmar commented Aug 30, 2024

JanCizmar commented Sep 2, 2024

balk-sp commented Sep 2, 2024 • edited Loading

balk-sp commented Sep 2, 2024 • edited Loading

balk-sp commented Sep 2, 2024 • edited Loading

JanCizmar commented Sep 3, 2024

JanCizmar commented Sep 3, 2024

JanCizmar commented Sep 3, 2024

JanCizmar commented Sep 3, 2024

balk-sp commented Sep 4, 2024 • edited Loading

JanCizmar commented Sep 4, 2024

balk-sp commented Aug 29, 2024 •

edited

Loading

balk-sp commented Sep 2, 2024 •

edited

Loading

balk-sp commented Sep 2, 2024 •

edited

Loading

balk-sp commented Sep 2, 2024 •

edited

Loading

balk-sp commented Sep 4, 2024 •

edited

Loading