-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read message bundles in UTF-8 and fall back to 8859-1 in org.eclipse.osgi.util.NLS
#709
Conversation
Please remove merge commit. |
FYI, we typically expect folks to follow these steps: So typically folks will amend and force push to maintain a single commit in the PR... |
6107816
to
6932c79
Compare
I pulled changes from the upstream and removed the unwanted merge commit. |
6932c79
to
67a7575
Compare
The build fails with:
I think this is new API (right @tjwatson ?) so there needs to be an |
New API requires a package version increment of the minor version as well. |
67a7575
to
fd8cb34
Compare
Since there's no |
* @param clazz the class where the constants will exist | ||
*/ | ||
public static void initializeMessages(final String baseName, final Class<?> clazz) { | ||
initializeMessages(baseName, clazz, StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure UTF-8 is a good default here, as it will most likely break a lot of exiting behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR's whole purpose is to make UTF-8 the default. I tested it locally against various languages; the behavior is the same against both ASCII-only messages and messages containing escape sequences (this includes anything outside the ASCII range). Please see my original comment on this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test with a file encoded with ISO_8859_1
special chars, ASCII is not the default and escaping is not mandatory see here
The load(InputStream) / store(OutputStream, String) methods work the same way as the load(Reader)/store(Writer, String) pair, except the input/output stream is encoded in ISO 8859-1 character encoding. Characters that cannot be directly represented in this encoding can be written using Unicode escapes as defined in section 3.3 of The Java™ Language Specification;
This means that it is totally valid to have any ISO 8859-1 characters in such a file and reading it as UTF-8 will break these existing file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, one must assume that ISO 8859-1 is commonly used. it seems to me that Java was making some changes in this area and I found this:
It's definitely a bad idea to assume UTF-8.
This is completely independent from |
I'm a bit lost. Can you please hint where that package version you're talking about is located? |
|
bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java
Outdated
Show resolved
Hide resolved
fd8cb34
to
390820d
Compare
Updated the PR while hopefully addressing all issues. Please let me know if I missed something. |
By the way I think the best way forward would be to add support for UTF-8 BOM together with a buffered stream of 4 bytes this would allow to detect if the file is "standard" or UTF encoded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it works in the current form an ISO_8859_1
must be the default, beside that one should add at least some test-cases to ensure the code works as expected as this is a very central class used everywhere in Eclipse there is a high risk of breaking things (now or in the future) without at least minimal test coverage..
Thanks for the feedback. Hopefully, I'll get my hands on this in the future. :^) |
Unfortunately, the BOM's byte values are valid ISO 8859-1 characters unless we assume no one is ever going to use |
I think we can safely assert that this is very unlikely as messages are usually mapped to Java constants and this is not a valid java identifier. |
As I mentioned above, see this comment:
That sounds to me like a much better approach, especially if that's what Java itself does. |
Here too: https://docs.oracle.com/javase/9/docs/api/java/util/PropertyResourceBundle.html So it seem to me that we don't need new APIs. We could simply implement the same behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reconsider implementing the behavior described here instead:
https://docs.oracle.com/javase/9/docs/api/java/util/PropertyResourceBundle.html
I never understood why platform has its own mechanism instead of reusing the the java one, but there must be a reason :-) |
It is reusing Java, using it to set the fields reflectively: equinox/bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java Lines 409 to 467 in 793dcc5
|
* @param clazz the class where the constants will exist | ||
*/ | ||
public static void initializeMessages(final String baseName, final Class<?> clazz) { | ||
initializeMessages(baseName, clazz, StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API has been around for 20+ years, it is risky to change the default at this point. I'm fine adding a new initializeMessages
method that takes a Charset, but the original method should still use ISO_8859_1
. This way bundles that are willing to use UTF-8 can make the change. But changing it to UTF-8 for all, under the covers is risky. If you want, you could allow null
charset and have null
default to UTF-8.
Yes, that is my suggestion. Leave the existing method unchanged. Add a new method that takes the charset. That seems the most simple way to add the function without any impact to existing users. |
Adding an overload method will not have any benefits but pain when migrating to using UTF-8 in message bundles. Since not all characters in 8859-1 are valid UTF-8, changing the encoding of a single message bundle will require re-encoding all its properties files in UTF-8. This will potentially break existing fragment bundles that provide localization for external bundles, and the opposite issue of using UTF-8 encoded properties files in an external bundle that hasn't been migrated to using UTF-8 yet. I did a small research and noticed that the OSGI bundle's localization files ( equinox/bundles/org.eclipse.osgi/container/src/org/eclipse/osgi/storage/ManifestLocalization.java Lines 256 to 258 in caf78f7
Its Javadoc says:
Therefore, I propose switching to using |
Reusing something from the JDK that already is geared to address the same problem sounds promising... |
It is worth a try. My minor concern is that it will have a noticeable impact on initial startup on very large installs of Eclipse (with many bundles) given that we will now need to construct many PropertyResourceBundle to load the messages only to get the values out of and then throw the objects away for GC. Like I mentioned before, this was a measurable performance issue, but that was long ago on much older Java versions. |
I don't see the advantage of using the |
I've not done a deep dive, but if I understand correctly, the But I don't recall all the history, but I thought the primary reason for the NLS design what to reduce memory footprint, not so much about load performance. I'm not sure about that though... |
Curiosity got the best of me and now I am confused. We already call I was thinking we could simply change that to use a Reader. But that isn't going to work. Turns out in Java 11 that changed In Java 8 that was not the case: So we can certainly change to using |
Should the Javadoc for |
1da5382
to
5c30041
Compare
For now, I have updated the PR once again. I followed the most straightforward way that doesn't involve complex refactoring. |
That is probably a good idea, it allows us to claim a behavior enhancement to the API contract and have a minor version bump that bundles can depend on if they want. |
5c30041
to
ae7cc10
Compare
org.eclipse.osgi.util.NLS
org.eclipse.osgi.util.NLS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for you patience on this review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It generally looks very good with tiny remark about the \uXXXX
versus \XXXX
.
bundles/org.eclipse.osgi/supplement/src/org/eclipse/osgi/util/NLS.java
Outdated
Show resolved
Hide resolved
ae7cc10
to
0552f41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for being responsive to feedback, for writing tests, and for producing such a nice improvement that aligns better with the modern Java support! 🥇
I truly appreciate all of you for sharing your feedback and guiding me along the way :^) |
Please note that this PR is blamed to fail I-Build, see eclipse-platform/eclipse.platform.releng.aggregator#2648 @ShadelessFox @merks |
Yes, it will be better if (and when) the PR build is as strict about Javadoc errors as the I-build. |
I'm sorry for causing havoc. I believe I had no way to guarantee it builds everywhere; I relied solely on the PR's checks. |
No need to apologize. The PR builds should have checked this given that the I-build is so strict about it later on. Everything is back on track and we have a successful I-Build now. |
I can only say that from my POV work on enhancing our releng process is the most important now. |
All message bundles are currently loaded in the ISO 8859-1 character encoding. This enforces any message bundle that uses non-ASCII characters to escape such characters using escape sequences. This makes it notoriously hard to analyze or process such files without using specialized tools such as IDEs or relying on the
native2ascii
command-line tool.This PR makes UTF-8 the default encoding when using
org.eclipse.osgi.util.NLS#initializeMessages
by leveraging the usage of PropertyResourceBundle. Despite interpreting the stream in UTF-8 by default, it still correctly handles 8859-1 that is not compatible with UTF-8 by re-reading the stream from the start if it cannot be properly decoded in UTF-8 otherwise.