Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read message bundles in UTF-8 and fall back to 8859-1 in org.eclipse.osgi.util.NLS #709

Merged
merged 1 commit into from
Dec 10, 2024

Conversation

ShadelessFox
Copy link
Contributor

@ShadelessFox ShadelessFox commented Dec 2, 2024

All message bundles are currently loaded in the ISO 8859-1 character encoding. This enforces any message bundle that uses non-ASCII characters to escape such characters using escape sequences. This makes it notoriously hard to analyze or process such files without using specialized tools such as IDEs or relying on the native2ascii command-line tool.

This PR makes UTF-8 the default encoding when using org.eclipse.osgi.util.NLS#initializeMessages by leveraging the usage of PropertyResourceBundle. Despite interpreting the stream in UTF-8 by default, it still correctly handles 8859-1 that is not compatible with UTF-8 by re-reading the stream from the start if it cannot be properly decoded in UTF-8 otherwise.

Copy link

github-actions bot commented Dec 2, 2024

Test Results

  663 files  ±0    663 suites  ±0   1h 15m 48s ⏱️ - 1m 35s
2 211 tests +2  2 164 ✅ +3   47 💤 ±0  0 ❌  - 1 
6 777 runs  +6  6 634 ✅ +7  143 💤 ±0  0 ❌  - 1 

Results for commit 0552f41. ± Comparison against base commit caf78f7.

♻️ This comment has been updated with latest results.

@vogella
Copy link
Contributor

vogella commented Dec 3, 2024

Please remove merge commit.

@merks
Copy link
Contributor

merks commented Dec 3, 2024

FYI, we typically expect folks to follow these steps:

eclipse-simrel/.github#34

So typically folks will amend and force push to maintain a single commit in the PR...

@ShadelessFox ShadelessFox force-pushed the nls-use-utf8-by-default branch from 6107816 to 6932c79 Compare December 3, 2024 10:32
@ShadelessFox
Copy link
Contributor Author

I pulled changes from the upstream and removed the unwanted merge commit.

@ShadelessFox ShadelessFox force-pushed the nls-use-utf8-by-default branch from 6932c79 to 67a7575 Compare December 3, 2024 10:41
@merks
Copy link
Contributor

merks commented Dec 3, 2024

The build fails with:

06:10:29 [ERROR] Failed to execute goal org.eclipse.tycho.extras:tycho-p2-extras-plugin:4.0.10:compare-version-with-baselines (compare-attached-artifacts-with-release) on project org.eclipse.equinox.supplement: Only qualifier changed for (org.eclipse.equinox.supplement/1.11.100.v20241203-1040). Expected to have bigger x.y.z than what is available in baseline (1.11.100.v20241030-2121) -> [Help 1]

I think this is new API (right @tjwatson ?) so there needs to be an @since and this bundle needs to be incremented to 1.12.0

image

@laeubi
Copy link
Member

laeubi commented Dec 3, 2024

New API requires a package version increment of the minor version as well.

@ShadelessFox ShadelessFox force-pushed the nls-use-utf8-by-default branch from 67a7575 to fd8cb34 Compare December 3, 2024 13:10
@ShadelessFox
Copy link
Contributor Author

Since there's no package-info in the enclosing package, can @laeubi's suggestion be safely omitted?

* @param clazz the class where the constants will exist
*/
public static void initializeMessages(final String baseName, final Class<?> clazz) {
initializeMessages(baseName, clazz, StandardCharsets.UTF_8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure UTF-8 is a good default here, as it will most likely break a lot of exiting behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR's whole purpose is to make UTF-8 the default. I tested it locally against various languages; the behavior is the same against both ASCII-only messages and messages containing escape sequences (this includes anything outside the ASCII range). Please see my original comment on this PR.

Copy link
Member

@laeubi laeubi Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test with a file encoded with ISO_8859_1 special chars, ASCII is not the default and escaping is not mandatory see here

The load(InputStream) / store(OutputStream, String) methods work the same way as the load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes as defined in section 3.3 of The Java™ Language Specification;

This means that it is totally valid to have any ISO 8859-1 characters in such a file and reading it as UTF-8 will break these existing file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, one must assume that ISO 8859-1 is commonly used. it seems to me that Java was making some changes in this area and I found this:

https://docs.oracle.com/javase/9/intl/internationalization-enhancements-jdk-9.htm#JSINT-GUID-5ED91AA9-B2E3-4E05-8E99-6A009D2B36AF

image

It's definitely a bad idea to assume UTF-8.

@laeubi
Copy link
Member

laeubi commented Dec 3, 2024

Since there's no package-info in the enclosing package, can @laeubi's suggestion be safely omitted?

This is completely independent from package-info ...

@laeubi laeubi requested a review from tjwatson December 3, 2024 13:15
@ShadelessFox
Copy link
Contributor Author

ShadelessFox commented Dec 3, 2024

This is completely independent from package-info ...

I'm a bit lost. Can you please hint where that package version you're talking about is located?

@laeubi
Copy link
Member

laeubi commented Dec 3, 2024

This is completely independent from package-info ...

I'm a bit lost. Can you please hint where that package version you're talking about is located?

org.eclipse.osgi.util;version="1.1"

@ShadelessFox ShadelessFox force-pushed the nls-use-utf8-by-default branch from fd8cb34 to 390820d Compare December 3, 2024 14:37
@ShadelessFox
Copy link
Contributor Author

Updated the PR while hopefully addressing all issues. Please let me know if I missed something.

@laeubi
Copy link
Member

laeubi commented Dec 3, 2024

By the way I think the best way forward would be to add support for UTF-8 BOM together with a buffered stream of 4 bytes this would allow to detect if the file is "standard" or UTF encoded.

Copy link
Member

@laeubi laeubi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it works in the current form an ISO_8859_1 must be the default, beside that one should add at least some test-cases to ensure the code works as expected as this is a very central class used everywhere in Eclipse there is a high risk of breaking things (now or in the future) without at least minimal test coverage..

@ShadelessFox
Copy link
Contributor Author

Thanks for the feedback. Hopefully, I'll get my hands on this in the future. :^)

@ShadelessFox
Copy link
Contributor Author

By the way I think the best way forward would be to add support for UTF-8 BOM together with a buffered stream of 4 bytes this would allow to detect if the file is "standard" or UTF encoded.

Unfortunately, the BOM's byte values are valid ISO 8859-1 characters unless we assume no one is ever going to use  as the first three characters of a message bundle.

@laeubi
Copy link
Member

laeubi commented Dec 3, 2024

Unfortunately, the BOM's byte values are valid ISO 8859-1 characters unless we assume no one is ever going to use  as the first three characters of a message bundle.

I think we can safely assert that this is very unlikely as messages are usually mapped to Java constants and this is not a valid java identifier.

@merks
Copy link
Contributor

merks commented Dec 3, 2024

As I mentioned above, see this comment:

Most existing properties files should not be affected: UTF-8 and ISO-8859-1 have the same encoding for ASCII characters, and human-readable non-ASCII ISO-8859-1 encoding is not valid UTF-8. If an invalid UTF-8 byte sequence is detected, the Java runtime automatically rereads the file in ISO-8859-1.

That sounds to me like a much better approach, especially if that's what Java itself does.

@merks
Copy link
Contributor

merks commented Dec 3, 2024

Here too:

https://docs.oracle.com/javase/9/docs/api/java/util/PropertyResourceBundle.html

So it seem to me that we don't need new APIs. We could simply implement the same behavior.

@merks merks self-requested a review December 3, 2024 15:49
Copy link
Contributor

@merks merks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reconsider implementing the behavior described here instead:

https://docs.oracle.com/javase/9/docs/api/java/util/PropertyResourceBundle.html

@laeubi
Copy link
Member

laeubi commented Dec 3, 2024

Here too:

https://docs.oracle.com/javase/9/docs/api/java/util/PropertyResourceBundle.html

So it seem to me that we don't need new APIs. We could simply implement the same behavior.

I never understood why platform has its own mechanism instead of reusing the the java one, but there must be a reason :-)

@merks
Copy link
Contributor

merks commented Dec 3, 2024

It is reusing Java, using it to set the fields reflectively:

private static class MessagesProperties extends Properties {
private static final int MOD_EXPECTED = Modifier.PUBLIC | Modifier.STATIC;
private static final int MOD_MASK = MOD_EXPECTED | Modifier.FINAL;
private static final long serialVersionUID = 1L;
private final String bundleName;
private final Map<Object, Object> fields;
private final boolean isAccessible;
public MessagesProperties(Map<Object, Object> fieldMap, String bundleName, boolean isAccessible) {
super();
this.fields = fieldMap;
this.bundleName = bundleName;
this.isAccessible = isAccessible;
}
/* (non-Javadoc)
* @see java.util.Hashtable#put(java.lang.Object, java.lang.Object)
*/
@Override
public synchronized Object put(Object key, Object value) {
Object fieldObject = fields.put(key, ASSIGNED);
// if already assigned, there is nothing to do
if (fieldObject == ASSIGNED)
return null;
if (fieldObject == null) {
final String msg = "NLS unused message: " + key + " in: " + bundleName;//$NON-NLS-1$ //$NON-NLS-2$
if (SupplementDebug.STATIC_DEBUG_MESSAGE_BUNDLES)
System.out.println(msg);
// keys with '.' are ignored by design (bug 433424)
if (key instanceof String && ((String) key).indexOf('.') < 0) {
log(SEVERITY_WARNING, msg, null);
}
return null;
}
final Field field = (Field) fieldObject;
//can only set value of public static non-final fields
if ((field.getModifiers() & MOD_MASK) != MOD_EXPECTED)
return null;
try {
// Check to see if we are allowed to modify the field. If we aren't (for instance
// if the class is not public) then change the accessible attribute of the field
// before trying to set the value.
if (!isAccessible)
field.setAccessible(true);
// Set the value into the field. We should never get an exception here because
// we know we have a public static non-final field. If we do get an exception, silently
// log it and continue. This means that the field will (most likely) be un-initialized and
// will fail later in the code and if so then we will see both the NPE and this error.
// Extra care is taken to be sure we create a String with its own backing char[] (bug 287183)
// This is to ensure we do not keep the key chars in memory.
field.set(null, new String(((String) value).toCharArray()));
} catch (Exception e) {
log(SEVERITY_ERROR, "Exception setting field value.", e); //$NON-NLS-1$
}
return null;
}

* @param clazz the class where the constants will exist
*/
public static void initializeMessages(final String baseName, final Class<?> clazz) {
initializeMessages(baseName, clazz, StandardCharsets.UTF_8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API has been around for 20+ years, it is risky to change the default at this point. I'm fine adding a new initializeMessages method that takes a Charset, but the original method should still use ISO_8859_1. This way bundles that are willing to use UTF-8 can make the change. But changing it to UTF-8 for all, under the covers is risky. If you want, you could allow null charset and have null default to UTF-8.

bundles/org.eclipse.osgi/supplement/META-INF/MANIFEST.MF Outdated Show resolved Hide resolved
@tjwatson
Copy link
Contributor

tjwatson commented Dec 6, 2024

Does that mean the proposed solution, which relies on BOM detection, is not desired and should be removed from the PR, leaving us with an overloaded method that takes Charset? I just want to be extra sure I understand your intentions.

Yes, that is my suggestion. Leave the existing method unchanged. Add a new method that takes the charset. That seems the most simple way to add the function without any impact to existing users.

@ShadelessFox
Copy link
Contributor Author

ShadelessFox commented Dec 9, 2024

Adding an overload method will not have any benefits but pain when migrating to using UTF-8 in message bundles.

Since not all characters in 8859-1 are valid UTF-8, changing the encoding of a single message bundle will require re-encoding all its properties files in UTF-8. This will potentially break existing fragment bundles that provide localization for external bundles, and the opposite issue of using UTF-8 encoded properties files in an external bundle that hasn't been migrated to using UTF-8 yet.

I did a small research and noticed that the OSGI bundle's localization files (OSGI-INF/l10n/bundle) are actually loaded using PropertyResourceBundle:

private class LocalizationResourceBundle extends PropertyResourceBundle implements BundleResourceBundle {
public LocalizationResourceBundle(InputStream in) throws IOException {

Its Javadoc says:

Constructing a PropertyResourceBundle instance from an InputStream requires that the input stream be encoded in UTF-8. By default, if a MalformedInputException or an UnmappableCharacterException occurs on reading the input stream, then the PropertyResourceBundle instance resets to the state before the exception, re-reads the input stream in ISO-8859-1, and continues reading.

Therefore, I propose switching to using PropertyResourceBundle for NLS as well.

@merks
Copy link
Contributor

merks commented Dec 9, 2024

Reusing something from the JDK that already is geared to address the same problem sounds promising...

@tjwatson
Copy link
Contributor

tjwatson commented Dec 9, 2024

Therefore, I propose switching to using PropertyResourceBundle for NLS as well.

It is worth a try. My minor concern is that it will have a noticeable impact on initial startup on very large installs of Eclipse (with many bundles) given that we will now need to construct many PropertyResourceBundle to load the messages only to get the values out of and then throw the objects away for GC.

Like I mentioned before, this was a measurable performance issue, but that was long ago on much older Java versions.

@tjwatson
Copy link
Contributor

tjwatson commented Dec 9, 2024

I don't see the advantage of using the PropertyResourceBundle over directly using java.util.Properties.load(Reader).

@merks
Copy link
Contributor

merks commented Dec 9, 2024

I've not done a deep dive, but if I understand correctly, the PropertyResourceBundle handles these documented cases with the different encodings:

image

But Properties.load does not...

I don't recall all the history, but I thought the primary reason for the NLS design what to reduce memory footprint, not so much about load performance. I'm not sure about that though...

@tjwatson
Copy link
Contributor

tjwatson commented Dec 9, 2024

Curiosity got the best of me and now I am confused. We already call Properties.load here:
https://github.com/eclipse-equinox/equinox/blob/master/bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java#L350

I was thinking we could simply change that to use a Reader. But that isn't going to work. Turns out in Java 11 that changed PropertyResourceBundle.load(InputStream) to do the UTF-8 stuff:

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PropertyResourceBundle.html#%3Cinit%3E(java.io.InputStream)

In Java 8 that was not the case:

https://docs.oracle.com/javase/8/docs/api/java/util/PropertyResourceBundle.html#PropertyResourceBundle-java.io.InputStream-

So we can certainly change to using PropertyResourceBundle.load but that only helps if on Java 11 or higher. Not really an issue for the vast majority of users. Equinox framework does support Java 8 still, but I am fine with bundles wanting to run on Java 8 not having this ability to use UTF-8.

@ShadelessFox
Copy link
Contributor Author

Not really an issue for the vast majority of users. Equinox framework does support Java 8 still, but I am fine with bundles wanting to run on Java 8 not having this ability to use UTF-8.

Should the Javadoc for NLS#initializeMessages document new behavior regardless of differences between Java versions?

@ShadelessFox ShadelessFox force-pushed the nls-use-utf8-by-default branch from 1da5382 to 5c30041 Compare December 9, 2024 16:56
@ShadelessFox
Copy link
Contributor Author

For now, I have updated the PR once again. I followed the most straightforward way that doesn't involve complex refactoring.

@tjwatson
Copy link
Contributor

tjwatson commented Dec 9, 2024

Should the Javadoc for NLS#initializeMessages document new behavior regardless of differences between Java versions?

That is probably a good idea, it allows us to claim a behavior enhancement to the API contract and have a minor version bump that bundles can depend on if they want.

@ShadelessFox ShadelessFox force-pushed the nls-use-utf8-by-default branch from 5c30041 to ae7cc10 Compare December 9, 2024 17:18
@ShadelessFox ShadelessFox changed the title Use UTF-8 by default in org.eclipse.osgi.util.NLS Read message bundles in UTF-8 and fall back to 8859-1 in org.eclipse.osgi.util.NLS Dec 9, 2024
Copy link
Contributor

@tjwatson tjwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for you patience on this review!

@ShadelessFox ShadelessFox requested review from merks and laeubi December 10, 2024 10:38
Copy link
Contributor

@merks merks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It generally looks very good with tiny remark about the \uXXXX versus \XXXX.

@ShadelessFox ShadelessFox force-pushed the nls-use-utf8-by-default branch from ae7cc10 to 0552f41 Compare December 10, 2024 11:26
@merks merks self-requested a review December 10, 2024 11:28
Copy link
Contributor

@merks merks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being responsive to feedback, for writing tests, and for producing such a nice improvement that aligns better with the modern Java support! 🥇

@ShadelessFox
Copy link
Contributor Author

I truly appreciate all of you for sharing your feedback and guiding me along the way :^)

@merks merks merged commit a9877c4 into eclipse-equinox:master Dec 10, 2024
26 of 27 checks passed
@jukzi
Copy link
Contributor

jukzi commented Dec 11, 2024

Please note that this PR is blamed to fail I-Build, see eclipse-platform/eclipse.platform.releng.aggregator#2648 @ShadelessFox @merks

@merks
Copy link
Contributor

merks commented Dec 11, 2024

Yes, it will be better if (and when) the PR build is as strict about Javadoc errors as the I-build.

@ShadelessFox
Copy link
Contributor Author

I'm sorry for causing havoc. I believe I had no way to guarantee it builds everywhere; I relied solely on the PR's checks.

@merks
Copy link
Contributor

merks commented Dec 11, 2024

No need to apologize. The PR builds should have checked this given that the I-build is so strict about it later on. Everything is back on track and we have a successful I-Build now.

@akurtakov
Copy link
Member

I can only say that from my POV work on enhancing our releng process is the most important now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants