Read message bundles in UTF-8 and fall back to 8859-1 in `org.eclipse.osgi.util.NLS` #709

ShadelessFox · 2024-12-02T18:11:43Z

All message bundles are currently loaded in the ISO 8859-1 character encoding. This enforces any message bundle that uses non-ASCII characters to escape such characters using escape sequences. This makes it notoriously hard to analyze or process such files without using specialized tools such as IDEs or relying on the native2ascii command-line tool.

This PR makes UTF-8 the default encoding when using org.eclipse.osgi.util.NLS#initializeMessages by leveraging the usage of PropertyResourceBundle. Despite interpreting the stream in UTF-8 by default, it still correctly handles 8859-1 that is not compatible with UTF-8 by re-reading the stream from the start if it cannot be properly decoded in UTF-8 otherwise.

github-actions · 2024-12-02T18:56:57Z

Test Results

663 files ±0 663 suites ±0 1h 15m 48s ⏱️ - 1m 35s
2 211 tests +2 2 164 ✅ +3 47 💤 ±0 0 ❌ - 1
6 777 runs +6 6 634 ✅ +7 143 💤 ±0 0 ❌ - 1

Results for commit 0552f41. ± Comparison against base commit caf78f7.

♻️ This comment has been updated with latest results.

vogella · 2024-12-03T09:30:21Z

Please remove merge commit.

merks · 2024-12-03T09:50:23Z

FYI, we typically expect folks to follow these steps:

eclipse-simrel/.github#34

So typically folks will amend and force push to maintain a single commit in the PR...

ShadelessFox · 2024-12-03T10:37:21Z

I pulled changes from the upstream and removed the unwanted merge commit.

merks · 2024-12-03T12:29:51Z

The build fails with:

06:10:29 [ERROR] Failed to execute goal org.eclipse.tycho.extras:tycho-p2-extras-plugin:4.0.10:compare-version-with-baselines (compare-attached-artifacts-with-release) on project org.eclipse.equinox.supplement: Only qualifier changed for (org.eclipse.equinox.supplement/1.11.100.v20241203-1040). Expected to have bigger x.y.z than what is available in baseline (1.11.100.v20241030-2121) -> [Help 1]

I think this is new API (right @tjwatson ?) so there needs to be an @since and this bundle needs to be incremented to 1.12.0

laeubi · 2024-12-03T12:35:40Z

New API requires a package version increment of the minor version as well.

ShadelessFox · 2024-12-03T13:11:14Z

Since there's no package-info in the enclosing package, can @laeubi's suggestion be safely omitted?

laeubi · 2024-12-03T13:12:38Z

bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java

+	 * @param clazz the class where the constants will exist
+	 */
+	public static void initializeMessages(final String baseName, final Class<?> clazz) {
+		initializeMessages(baseName, clazz, StandardCharsets.UTF_8);


I'm not sure UTF-8 is a good default here, as it will most likely break a lot of exiting behavior.

This PR's whole purpose is to make UTF-8 the default. I tested it locally against various languages; the behavior is the same against both ASCII-only messages and messages containing escape sequences (this includes anything outside the ASCII range). Please see my original comment on this PR.

Please test with a file encoded with ISO_8859_1 special chars, ASCII is not the default and escaping is not mandatory see here

The load(InputStream) / store(OutputStream, String) methods work the same way as the load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes as defined in section 3.3 of The Java™ Language Specification;

This means that it is totally valid to have any ISO 8859-1 characters in such a file and reading it as UTF-8 will break these existing file.

Yes, one must assume that ISO 8859-1 is commonly used. it seems to me that Java was making some changes in this area and I found this:

https://docs.oracle.com/javase/9/intl/internationalization-enhancements-jdk-9.htm#JSINT-GUID-5ED91AA9-B2E3-4E05-8E99-6A009D2B36AF

It's definitely a bad idea to assume UTF-8.

laeubi · 2024-12-03T13:13:28Z

Since there's no package-info in the enclosing package, can @laeubi's suggestion be safely omitted?

This is completely independent from package-info ...

ShadelessFox · 2024-12-03T14:09:57Z

This is completely independent from package-info ...

I'm a bit lost. Can you please hint where that package version you're talking about is located?

laeubi · 2024-12-03T14:17:40Z

This is completely independent from package-info ...

I'm a bit lost. Can you please hint where that package version you're talking about is located?

equinox/bundles/org.eclipse.osgi/supplement/META-INF/MANIFEST.MF

Line 20 in fd8cb34

org.eclipse.osgi.util;version="1.1"

bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java

ShadelessFox · 2024-12-03T14:39:01Z

Updated the PR while hopefully addressing all issues. Please let me know if I missed something.

laeubi · 2024-12-03T14:42:23Z

By the way I think the best way forward would be to add support for UTF-8 BOM together with a buffered stream of 4 bytes this would allow to detect if the file is "standard" or UTF encoded.

laeubi

I don't think it works in the current form an ISO_8859_1 must be the default, beside that one should add at least some test-cases to ensure the code works as expected as this is a very central class used everywhere in Eclipse there is a high risk of breaking things (now or in the future) without at least minimal test coverage..

ShadelessFox · 2024-12-03T14:55:20Z

Thanks for the feedback. Hopefully, I'll get my hands on this in the future. :^)

ShadelessFox · 2024-12-03T15:06:42Z

By the way I think the best way forward would be to add support for UTF-8 BOM together with a buffered stream of 4 bytes this would allow to detect if the file is "standard" or UTF encoded.

Unfortunately, the BOM's byte values are valid ISO 8859-1 characters unless we assume no one is ever going to use ï»¿ as the first three characters of a message bundle.

laeubi · 2024-12-03T15:12:49Z

Unfortunately, the BOM's byte values are valid ISO 8859-1 characters unless we assume no one is ever going to use ï»¿ as the first three characters of a message bundle.

I think we can safely assert that this is very unlikely as messages are usually mapped to Java constants and this is not a valid java identifier.

merks · 2024-12-03T15:44:23Z

As I mentioned above, see this comment:

Most existing properties files should not be affected: UTF-8 and ISO-8859-1 have the same encoding for ASCII characters, and human-readable non-ASCII ISO-8859-1 encoding is not valid UTF-8. If an invalid UTF-8 byte sequence is detected, the Java runtime automatically rereads the file in ISO-8859-1.

That sounds to me like a much better approach, especially if that's what Java itself does.

merks · 2024-12-03T15:48:40Z

Here too:

https://docs.oracle.com/javase/9/docs/api/java/util/PropertyResourceBundle.html

So it seem to me that we don't need new APIs. We could simply implement the same behavior.

merks

Please reconsider implementing the behavior described here instead:

https://docs.oracle.com/javase/9/docs/api/java/util/PropertyResourceBundle.html

laeubi · 2024-12-03T15:50:22Z

Here too:

https://docs.oracle.com/javase/9/docs/api/java/util/PropertyResourceBundle.html

So it seem to me that we don't need new APIs. We could simply implement the same behavior.

I never understood why platform has its own mechanism instead of reusing the the java one, but there must be a reason :-)

merks · 2024-12-03T15:59:00Z

It is reusing Java, using it to set the fields reflectively:

equinox/bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java

Lines 409 to 467 in 793dcc5

    
           	private static class MessagesProperties extends Properties { 
        
           		private static final int MOD_EXPECTED = Modifier.PUBLIC | Modifier.STATIC; 
        
           		private static final int MOD_MASK = MOD_EXPECTED | Modifier.FINAL; 
        
           		private static final long serialVersionUID = 1L; 
        
           		private final String bundleName; 
        
           		private final Map<Object, Object> fields; 
        
           		private final boolean isAccessible; 
        
           		public MessagesProperties(Map<Object, Object> fieldMap, String bundleName, boolean isAccessible) { 
        
           			super(); 
        
           			this.fields = fieldMap; 
        
           			this.bundleName = bundleName; 
        
           			this.isAccessible = isAccessible; 
        
           		} 
        
           		/* (non-Javadoc) 
        
           		 * @see java.util.Hashtable#put(java.lang.Object, java.lang.Object) 
        
           		 */ 
        
           		@Override 
        
           		public synchronized Object put(Object key, Object value) { 
        
           			Object fieldObject = fields.put(key, ASSIGNED); 
        
           			// if already assigned, there is nothing to do 
        
           			if (fieldObject == ASSIGNED) 
        
           				return null; 
        
           			if (fieldObject == null) { 
        
           				final String msg = "NLS unused message: " + key + " in: " + bundleName;//$NON-NLS-1$ //$NON-NLS-2$ 
        
           				if (SupplementDebug.STATIC_DEBUG_MESSAGE_BUNDLES) 
        
           					System.out.println(msg); 
        
           				// keys with '.' are ignored by design (bug 433424) 
        
           				if (key instanceof String && ((String) key).indexOf('.') < 0) { 
        
           					log(SEVERITY_WARNING, msg, null); 
        
           				} 
        
           				return null; 
        
           			} 
        
           			final Field field = (Field) fieldObject; 
        
           			//can only set value of public static non-final fields 
        
           			if ((field.getModifiers() & MOD_MASK) != MOD_EXPECTED) 
        
           				return null; 
        
           			try { 
        
           				// Check to see if we are allowed to modify the field. If we aren't (for instance 
        
           				// if the class is not public) then change the accessible attribute of the field 
        
           				// before trying to set the value. 
        
           				if (!isAccessible) 
        
           					field.setAccessible(true); 
        
           				// Set the value into the field. We should never get an exception here because 
        
           				// we know we have a public static non-final field. If we do get an exception, silently 
        
           				// log it and continue. This means that the field will (most likely) be un-initialized and 
        
           				// will fail later in the code and if so then we will see both the NPE and this error. 
        
           				// Extra care is taken to be sure we create a String with its own backing char[] (bug 287183) 
        
           				// This is to ensure we do not keep the key chars in memory. 
        
           				field.set(null, new String(((String) value).toCharArray())); 
        
           			} catch (Exception e) { 
        
           				log(SEVERITY_ERROR, "Exception setting field value.", e); //$NON-NLS-1$ 
        
           			} 
        
           			return null; 
        
           		}

tjwatson · 2024-12-03T15:56:21Z

bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java

+	 * @param clazz the class where the constants will exist
+	 */
+	public static void initializeMessages(final String baseName, final Class<?> clazz) {
+		initializeMessages(baseName, clazz, StandardCharsets.UTF_8);


This API has been around for 20+ years, it is risky to change the default at this point. I'm fine adding a new initializeMessages method that takes a Charset, but the original method should still use ISO_8859_1. This way bundles that are willing to use UTF-8 can make the change. But changing it to UTF-8 for all, under the covers is risky. If you want, you could allow null charset and have null default to UTF-8.

bundles/org.eclipse.osgi/supplement/META-INF/MANIFEST.MF

tjwatson · 2024-12-06T15:27:10Z

Does that mean the proposed solution, which relies on BOM detection, is not desired and should be removed from the PR, leaving us with an overloaded method that takes Charset? I just want to be extra sure I understand your intentions.

Yes, that is my suggestion. Leave the existing method unchanged. Add a new method that takes the charset. That seems the most simple way to add the function without any impact to existing users.

ShadelessFox · 2024-12-09T11:36:41Z

Adding an overload method will not have any benefits but pain when migrating to using UTF-8 in message bundles.

Since not all characters in 8859-1 are valid UTF-8, changing the encoding of a single message bundle will require re-encoding all its properties files in UTF-8. This will potentially break existing fragment bundles that provide localization for external bundles, and the opposite issue of using UTF-8 encoded properties files in an external bundle that hasn't been migrated to using UTF-8 yet.

I did a small research and noticed that the OSGI bundle's localization files (OSGI-INF/l10n/bundle) are actually loaded using PropertyResourceBundle:

equinox/bundles/org.eclipse.osgi/container/src/org/eclipse/osgi/storage/ManifestLocalization.java

Lines 256 to 258 in caf78f7

    
           private class LocalizationResourceBundle extends PropertyResourceBundle implements BundleResourceBundle { 
        
           	public LocalizationResourceBundle(InputStream in) throws IOException {

Its Javadoc says:

Constructing a PropertyResourceBundle instance from an InputStream requires that the input stream be encoded in UTF-8. By default, if a MalformedInputException or an UnmappableCharacterException occurs on reading the input stream, then the PropertyResourceBundle instance resets to the state before the exception, re-reads the input stream in ISO-8859-1, and continues reading.

Therefore, I propose switching to using PropertyResourceBundle for NLS as well.

merks · 2024-12-09T11:53:58Z

Reusing something from the JDK that already is geared to address the same problem sounds promising...

tjwatson · 2024-12-09T15:42:13Z

Therefore, I propose switching to using PropertyResourceBundle for NLS as well.

It is worth a try. My minor concern is that it will have a noticeable impact on initial startup on very large installs of Eclipse (with many bundles) given that we will now need to construct many PropertyResourceBundle to load the messages only to get the values out of and then throw the objects away for GC.

Like I mentioned before, this was a measurable performance issue, but that was long ago on much older Java versions.

tjwatson · 2024-12-09T16:07:48Z

I don't see the advantage of using the PropertyResourceBundle over directly using java.util.Properties.load(Reader).

merks · 2024-12-09T16:17:18Z

I've not done a deep dive, but if I understand correctly, the PropertyResourceBundle handles these documented cases with the different encodings:

But Properties.load does not...

I don't recall all the history, but I thought the primary reason for the NLS design what to reduce memory footprint, not so much about load performance. I'm not sure about that though...

tjwatson · 2024-12-09T16:35:13Z

Curiosity got the best of me and now I am confused. We already call Properties.load here:
https://github.com/eclipse-equinox/equinox/blob/master/bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java#L350

I was thinking we could simply change that to use a Reader. But that isn't going to work. Turns out in Java 11 that changed PropertyResourceBundle.load(InputStream) to do the UTF-8 stuff:

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PropertyResourceBundle.html#%3Cinit%3E(java.io.InputStream)

In Java 8 that was not the case:

https://docs.oracle.com/javase/8/docs/api/java/util/PropertyResourceBundle.html#PropertyResourceBundle-java.io.InputStream-

So we can certainly change to using PropertyResourceBundle.load but that only helps if on Java 11 or higher. Not really an issue for the vast majority of users. Equinox framework does support Java 8 still, but I am fine with bundles wanting to run on Java 8 not having this ability to use UTF-8.

ShadelessFox · 2024-12-09T16:55:55Z

Not really an issue for the vast majority of users. Equinox framework does support Java 8 still, but I am fine with bundles wanting to run on Java 8 not having this ability to use UTF-8.

Should the Javadoc for NLS#initializeMessages document new behavior regardless of differences between Java versions?

ShadelessFox · 2024-12-09T16:57:52Z

For now, I have updated the PR once again. I followed the most straightforward way that doesn't involve complex refactoring.

tjwatson · 2024-12-09T16:58:27Z

Should the Javadoc for NLS#initializeMessages document new behavior regardless of differences between Java versions?

That is probably a good idea, it allows us to claim a behavior enhancement to the API contract and have a minor version bump that bundles can depend on if they want.

tjwatson

LGTM

Thanks for you patience on this review!

merks

It generally looks very good with tiny remark about the \uXXXX versus \XXXX.

bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java

….osgi.util.NLS`

merks

Thanks for being responsive to feedback, for writing tests, and for producing such a nice improvement that aligns better with the modern Java support! 🥇

ShadelessFox · 2024-12-10T11:44:31Z

I truly appreciate all of you for sharing your feedback and guiding me along the way :^)

jukzi · 2024-12-11T07:28:02Z

Please note that this PR is blamed to fail I-Build, see eclipse-platform/eclipse.platform.releng.aggregator#2648 @ShadelessFox @merks

merks · 2024-12-11T09:40:14Z

Yes, it will be better if (and when) the PR build is as strict about Javadoc errors as the I-build.

ShadelessFox · 2024-12-11T10:56:07Z

I'm sorry for causing havoc. I believe I had no way to guarantee it builds everywhere; I relied solely on the PR's checks.

merks · 2024-12-11T11:04:29Z

No need to apologize. The PR builds should have checked this given that the I-build is so strict about it later on. Everything is back on track and we have a successful I-Build now.

akurtakov · 2024-12-11T11:08:03Z

I can only say that from my POV work on enhancing our releng process is the most important now.

ShadelessFox force-pushed the nls-use-utf8-by-default branch from 6107816 to 6932c79 Compare December 3, 2024 10:32

ShadelessFox force-pushed the nls-use-utf8-by-default branch from 6932c79 to 67a7575 Compare December 3, 2024 10:41

ShadelessFox force-pushed the nls-use-utf8-by-default branch from 67a7575 to fd8cb34 Compare December 3, 2024 13:10

laeubi reviewed Dec 3, 2024

View reviewed changes

laeubi requested a review from tjwatson December 3, 2024 13:15

laeubi reviewed Dec 3, 2024

View reviewed changes

bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java Outdated Show resolved Hide resolved

ShadelessFox force-pushed the nls-use-utf8-by-default branch from fd8cb34 to 390820d Compare December 3, 2024 14:37

laeubi requested changes Dec 3, 2024

View reviewed changes

merks self-requested a review December 3, 2024 15:49

merks requested changes Dec 3, 2024

View reviewed changes

tjwatson requested changes Dec 3, 2024

View reviewed changes

ShadelessFox force-pushed the nls-use-utf8-by-default branch from 1da5382 to 5c30041 Compare December 9, 2024 16:56

ShadelessFox force-pushed the nls-use-utf8-by-default branch from 5c30041 to ae7cc10 Compare December 9, 2024 17:18

ShadelessFox changed the title ~~Use UTF-8 by default in org.eclipse.osgi.util.NLS~~ Read message bundles in UTF-8 and fall back to 8859-1 in org.eclipse.osgi.util.NLS Dec 9, 2024

tjwatson approved these changes Dec 9, 2024

View reviewed changes

ShadelessFox requested review from merks and laeubi December 10, 2024 10:38

laeubi approved these changes Dec 10, 2024

View reviewed changes

merks reviewed Dec 10, 2024

View reviewed changes

bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java Outdated Show resolved Hide resolved

Read message bundles in UTF-8 and fall back to 8859-1 in `org.eclipse…

0552f41

….osgi.util.NLS`

ShadelessFox force-pushed the nls-use-utf8-by-default branch from ae7cc10 to 0552f41 Compare December 10, 2024 11:26

merks self-requested a review December 10, 2024 11:28

merks approved these changes Dec 10, 2024

View reviewed changes

merks merged commit a9877c4 into eclipse-equinox:master Dec 10, 2024
26 of 27 checks passed

SougandhS mentioned this pull request Dec 11, 2024

4.35 I-Build: I20241210-1800 - BUILD FAILED eclipse-platform/eclipse.platform.releng.aggregator#2648

Closed

1 task

Read message bundles in UTF-8 and fall back to 8859-1 in org.eclipse.osgi.util.NLS #709

Read message bundles in UTF-8 and fall back to 8859-1 in org.eclipse.osgi.util.NLS #709

Conversation

ShadelessFox commented Dec 2, 2024 • edited Loading

github-actions bot commented Dec 2, 2024 • edited Loading

Test Results

vogella commented Dec 3, 2024

merks commented Dec 3, 2024

ShadelessFox commented Dec 3, 2024

merks commented Dec 3, 2024

laeubi commented Dec 3, 2024 • edited Loading

ShadelessFox commented Dec 3, 2024

laeubi Dec 3, 2024

Choose a reason for hiding this comment

ShadelessFox Dec 3, 2024

Choose a reason for hiding this comment

laeubi Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

merks Dec 3, 2024

Choose a reason for hiding this comment

laeubi commented Dec 3, 2024

ShadelessFox commented Dec 3, 2024 • edited Loading

laeubi commented Dec 3, 2024

ShadelessFox commented Dec 3, 2024

laeubi commented Dec 3, 2024

laeubi left a comment

Choose a reason for hiding this comment

ShadelessFox commented Dec 3, 2024

ShadelessFox commented Dec 3, 2024

laeubi commented Dec 3, 2024

merks commented Dec 3, 2024

merks commented Dec 3, 2024

merks left a comment

Choose a reason for hiding this comment

laeubi commented Dec 3, 2024

merks commented Dec 3, 2024

tjwatson Dec 3, 2024

Choose a reason for hiding this comment

tjwatson commented Dec 6, 2024

ShadelessFox commented Dec 9, 2024 • edited Loading

merks commented Dec 9, 2024

tjwatson commented Dec 9, 2024

tjwatson commented Dec 9, 2024

merks commented Dec 9, 2024

tjwatson commented Dec 9, 2024

ShadelessFox commented Dec 9, 2024

ShadelessFox commented Dec 9, 2024

tjwatson commented Dec 9, 2024

tjwatson left a comment

Choose a reason for hiding this comment

merks left a comment

Choose a reason for hiding this comment

merks left a comment

Choose a reason for hiding this comment

ShadelessFox commented Dec 10, 2024

jukzi commented Dec 11, 2024

merks commented Dec 11, 2024

ShadelessFox commented Dec 11, 2024

merks commented Dec 11, 2024

akurtakov commented Dec 11, 2024

Read message bundles in UTF-8 and fall back to 8859-1 in `org.eclipse.osgi.util.NLS` #709

Read message bundles in UTF-8 and fall back to 8859-1 in `org.eclipse.osgi.util.NLS` #709

ShadelessFox commented Dec 2, 2024 •

edited

Loading

github-actions bot commented Dec 2, 2024 •

edited

Loading

laeubi commented Dec 3, 2024 •

edited

Loading

laeubi Dec 3, 2024 •

edited

Loading

ShadelessFox commented Dec 3, 2024 •

edited

Loading

ShadelessFox commented Dec 9, 2024 •

edited

Loading