Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ICU plural-case handling during Machine-Translation #2445

Closed
wants to merge 1 commit into from

Conversation

balk-sp
Copy link

@balk-sp balk-sp commented Aug 29, 2024

ICU plural element mishandling explicit case values during translation

When submitting ICU to the machine-translation, it will correctly handle plural-cases 'one' and 'other', but not explicit value cases like '=1' and '=0'

Input A: {productCount,plural,one{You have one product.}other{You have # products.}}
Input B: {productCount,plural,=1{You have one product.}other{You have # products.}}

Output A: {productCount,plural,one{You have one product.}other{You have # products.}}
Output B: {productCount,plural,one{You have 1 products.}other{You have # products.}}
  • Output B outputs case 'one' for input case '=1'
  • Output B uses the value of case 'other' instead of the value of case '=1', and therefore yields '1 products', which is grammatically incorrect.

Machine translation

As the ICU text is correctly stored in the database, and correctly shown in the web-interface, but is corrupted during the machine-translation, there must be a text-transformation just prior to making the call to the translation-provider.

During machine-translation, the following queries are sent to the provider.

Input A: {productCount,plural,one{You have one product.}other{You have # products.}}
  Translate case 'one':   You have one product.
  Translate case 'other': You have <x id="tolgee-number">10</x> products.

Input B: {productCount,plural,=1{You have one product.}other{You have # products.}}
  Translate case '=1':    You have <x id="tolgee-number">1</x> products.
  Translate case 'other': You have <x id="tolgee-number">10</x> products.

Bug

The reason this is happening is that Tolgee rewrites plural-case '=1' to 'one' (similarly, it maps '=0' to 'zero'), and then tries to find 'one' in the original plural-element, which does not exist, as the plural-element contains the '=1' case. This causes it to fallback to the 'other' case.

Solution

Besides looking for the 'one' case, we should also look for the '=1' plural-case when doing the lookup.

@balk-sp balk-sp changed the title Fix ICU plural-case handling during Machine-Translation fix: ICU plural-case handling during Machine-Translation Aug 30, 2024
@JanCizmar
Copy link
Contributor

Thanks a lot for the PR! 🎉 Looks like it breaks the tests. Also this feature will require some unit tests as well.

@JanCizmar
Copy link
Contributor

Hey! Sorry It actually doesn't break the tests. It's only the report task which always fail for PRs. But ktlint fails. You have to run ./gradlew ktlintFormat to fix this.

Anyway, adding unit tests testing this new functionality need to be done before merging this. Are you willing to do this or should we do this ourselves?

@balk-sp
Copy link
Author

balk-sp commented Sep 2, 2024

It seems to me that my PR merely patches a side effect of the bug.
To me it feels odd that during translation the plural-cases are rewritten.

Issue A
If there is a need to transform the cases (like '=0', '=1', '=2', '=3' to resp. 'zero', 'one', 'two', 'few') then it should be transformed in all layers. It doesn't make sense to transform case-value '=1' to 'one', and then look for 'one' in the original plural-statement, and then fall back to '=1' (as is the workaround in my patch). If you go the route of transformation, you also have to transform all the plural-cases (in the AST representation), such that the lookup for 'one' actually works, as opposed to now, where it could only possibly work if somebody were to write both '=1' and 'one' in their plural-cases.

Issue B
As a contrived example, let's say I start off with the following ICU-text:

{var,
 plural,
 =0{No products}
 =1{A product}
 =2{A pair of products}
 =3{A typical yet odd amount of products}
 =4{A typical yet even amount of products}
 other{# products}
}

The case-values would be transformed to:

 =0 -> zero
 =1 -> one
 =2 -> two
 =3 -> few
 =4 -> few
 other -> other

Note the 2 'few' case-values. Now I realize this is a contrived example, but I hope it illustrates that the transformation is not as innocent as it seems, it has side-effects in the sense that different cases in the original icu-statement will be mapped to the same case value.

Issue C
Upon actual testing, it turns out that the above input (with cases =0...=4) yields the following output:

{var,
 plural,
 one{A product}
 other{# products}
}

So cases =0,=2,=3,=4 completely vanish during machine-translation.
Note that I added logging to the lines of my PR, and I observe that only '=1/one' and 'other' reach my PR code, the other cases are lost earlier in the process.

This means that 'Issue B' at the moment is merely a theoretical issue, because it never reaches the state of producing identical transformed case-values for different inputs.

@balk-sp
Copy link
Author

balk-sp commented Sep 2, 2024

The root-cause seems to be that the following code:

# PluralTranslationUtil.kt

  private val targetExamples by lazy {
    val targetLanguageTag = context.getLanguage(item.targetLanguageId).tag
    val targetULocale = getULocaleFromTag(targetLanguageTag)
    val targetRules = PluralRules.forLocale(targetULocale)
    getPluralFormExamples(targetRules)
  }

... for target-locale "en-US" returns:

{one=1, other=10}

which leads to it producing translations for these, where these keys intersect with the plural-cases, causing it to drop all other plural-cases (like =0, =4, etc) and if it weren't for my fallback in this PR, it would drop '=1' too, as only 'other' is in both the targetExamples.keys and in the plural.forms.keys.

@balk-sp
Copy link
Author

balk-sp commented Sep 2, 2024

Would it be a good idea to simply leave explicit-values as they are?
Additionally, not using targetExamples as the source (which is exceptionally restrictive, containing only [one/other]), but instead iterate over the actual plural-cases, in forms:

# PluralTranslationUtil.kt

  private val preparedFormSourceStrings: Sequence<Pair<String, String>> by lazy {
    return@lazy forms.forms.asSequence().map {
      if (it.key.startsWith("=") && it.key.substring(1).toDoubleOrNull() != null) {
        it.key to it.value.replaceReplaceNumberPlaceholderWithExample(it.key.substring(1).toDouble())
      } else {
        val numValue = targetExamples[it.key]?.toDouble() ?: 10.0
        val formValue = forms.forms[it.key] ?: forms.forms[sourceRules?.select(numValue)] ?: forms.forms["=" + it.value]
        ?: forms.forms[PluralRules.KEYWORD_OTHER] ?: ""

        it.key to formValue.replaceReplaceNumberPlaceholderWithExample(numValue)
      }
    }
  }

Passing the following (contrived) ICU-string to the translation-service now produces exactly the same output:

{var,
 plural,
 zero{No products whatsoever}
 one{A single product}
 =0{No products}
 =1{A product}
 =2{A pair of products}
 =3{A typical yet odd amount of products}
 =4{A typical yet even amount of products}
 other{# products}
}

@JanCizmar
Copy link
Contributor

Yeah, this looks fine!

@JanCizmar
Copy link
Contributor

Hey! As I am thinking about it now, maybe no transformation would indeed be better solution. The reason why I added the complex transformation was that I wanted to give the machine translators number example. But I forgot that the form is actually also pretty good example value for the machine translators. So now I believe, that maybe we can just use the variant name (zero, one ...) As the example. However, we still need to find example for other and handle the situation when there is collision with some exact form.

@JanCizmar
Copy link
Contributor

Oh! Sorry. As I am checking the code, I can see that the transformation actually is required. I will try fix the issue and add some tests.

@JanCizmar
Copy link
Contributor

For many cases the source forms doesn't match the target forms. For example in English there is only one and other required, but the same string requires 4 forms to be provided in Czech language. That's why this transformation is necessary. I probably did the same thing as you by adding the exact forms to the data separately, so we got them provided as well. This should handle all the situations. #2454

I just need to test, how the UI will handle this situation.

@balk-sp
Copy link
Author

balk-sp commented Sep 4, 2024

Thanks, I actually made a change already in the PR-code, as I noticed the same requirement, based on your unit-tests. You can see how I merge the examples/form cases.

Having said that, your implementation is obviously cleaner, as this is my first attempt at kotlin.
N.B.: the latest PR-code doesn't quite work, it was a WIP and can be discarded.

@JanCizmar
Copy link
Contributor

OKI doke, so I am closing this. Thanks for cooperation. :)

@JanCizmar JanCizmar closed this Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants