-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import of html coded with unicode UTF-8 fails #8
Comments
@ihemsen, sorry I didn't notice this issue earlier. Can you post a sample file, so I can give best advice? In the meantime, if you click ignore, it will look like nothing is imported; however, as long as you don't make any changes to the source HTML, then you can select different input-encoding settings to see if there's one that works. Balthisar Tidy only uses macOS' guess, so the original document even macOS is guessing wrong. Edit: oops, I see the attachment. I'll be back. |
If I bring your sample document into BBEdit, for example, it shows me that it's Western (ISO Latin 1) with Unix line endings. It looks good, and all of the diacriticals look good, or at least not garbled (I don't really read Norwegian). In Balthisar Tidy, using the guessed MacOS Roman screws things up severely. If I manually choose Western (ISO Latin 1) as the input encoding, the document then displays correctly. @ihemsen, I think knowing this will solve your issue. As far as fixing the bug, as indicated above, it's macOS doing the guessing, so I probably won't be able to fix the issue. I imagine that the reason BBEdit works is because they have their own character encoding libraries dating back to the System 7 era! In any case, let me know if this works for you, and again, sorry for the delay. |
Thanks @balthisar, Other SW I have worked with has had the encoding change in the Edit menu. As a new Balthisar Tidy user I did look the "obvious" places in the menu system. So when importing this html source the following happens
Bathisar detects a mismatch between 1 and 2 and ignores 3. Then it displays the following dialogue box: "Balthisar Tidy opened your document "index.html" successfully, but it appears that the Tidy input-encoding is not properly set. Currently "Unicode (UTF-8)" is specified. Balthisar Tidy will automatically set input encoding to "Western (Mac OS Roman)" for you (unless you choose to ignore). This guess may not be correct, so you should have a look at the Source HTML afterwards and choose the correct input-encoding for this document before making any other changes. Hint: you can choose a default input-encoding in Preferences if you open this type of file often." First of all: Would it be possible for Balthisar Tidy to use the encoding declared by the source code as a third option? [Allow Change] [Use "iso-8859-1"] [Ignore] The empty source html window is also a bit confusing. Perhaps Balthisar Tidy could display greyed text that can be scrolled but not edited until the input-encoding is changed? If you click in the window, a dialogue box telling you to set input-encoding in the left pane. The text in the dialogue box sited above is understandable when you are familiar with Balthisar Tidy, but importing old html is perhaps the first thing a new user does. I suggest to change the text: Balthisar Tidy has detected a mismatch between the character set reported by Mac OS, "Mac OS Roman" and the input-encoding setting "Unicode (UTF-8)". |
Thanks for feeding back. I'll see what I can do. I might have to refer this to the upstream project (HTML Tidy proper), as Balthisar Tidy doesn't touch encoding (other than trying to set the input-encoding if it detects no document upon loading). I might considering looking at the charset definition, but I generally try to let HTML Tidy do the heavy lifting. I'm proud of the human interface, though, but it's apparent that this isn't working. I will definitely do something to improve the experience in an upcoming release. I love hearing this type of feedback and being able to improve things. Thanks. |
Brilliant! |
This is a minor update with lots of small but important changes. - Native Apple Silicon Support - New, modern toolbar icons. - New, modern Preferences icons. - New, modern UI icons. - Balthisar Tidy web version get Balthisar Tidy for Work features. - Balthisar Tidy is now Balthisar Tidy Classic. - Balthisar Tidy for Work is now Balthisar Tidy. - Validator Preference panel now simply uses localhost; using the hostname didn't work with secure transport. - Sample AppleScripts are no longer included int the application bundle, but can be downloaded from the Help menu. - Sample AppleScripts have been updated, and the disk image is now notarized. - Help Book has been updated with modern images and updated content. - Updated to HTML Tidy 5.7.47. - Updated to Nu Validator 21.5.16. - Fix hovering behavior on macOS prior to 10.14 - Alternating Row Colors is no longer the default. - No longer possible to use system tidy dylibs. Security for macOS is getting tighter these days, so we'll just build Tidy directly into Balthisar Tidy. - Improve the input encoding helper due to issue #8. Thanks @ihemsen. - Building my own JDK for each of the targets. This allows me to use the hardened runtime without code-signing issues. This also also building a mostly-fat JRE for Apple Silicon support. - Most Cocoa dependencies are now via Carthage. - Spaces won in the spaces vs tab war. Balthisar Tidy is now completely detabbed. - Updating documentation comments so that they don't look like ass in the new, default Xcode appearance. - Features support have been refactored. In order to migrate to a single "Balthisar Tidy" with future in-app purchases to enable pro features, the current hard-coded features had to be refactored so that features can be enabled in software rather than compiled in. - All menus now use "Balthisar Tidy," and we will start to remove "for Work" branding. - Preferences controller demoted to an object that has a MASPreferencesController, so that we can re-build the controller at will to add/remove panels. - AppleScript improvements, including finally fixing a warning, and updating to the current, modern SDEF format. - Add the new EdDSA public key required by Sparkle. - libtidy now used statically and removed last remnants of dylib support. - Rename to TidyTableViewController, to disambiguate its purpose. - Renamed preferences classes and nibs to better reflect current names. - Moved RTF to NSAttributedString functionality to a category, because I keep using it more and more. - Do better sniffing of possible string encoding mismatches. I thought I'd done this back in 1.01, so I'm not sure how this got lost. In any case, MUCH better.
@ihemsen, it's been a while, and I hope you're still using Balthisar Tidy. If so, and you've not updated to the newest version, I'd love to know your opinion on the behavior of the newest version when there are encoding mismatches! |
I just started using Baltisar Tidy after having used PageSpinner for many years.
When opening the old html files, Balthisar displays a message about false input-encoding and offers to convert to MacOSRoman. Accepting this garbles the document. Apparently Balthisar has guessed wrong. Choosing "Ignore" does not import any text to the document.
A sample html-file and screen shot attached.
nett.html.zip
I use macOS Catalina 10.15.3 (19D76)
Balthisar Tidy version 4.2.0
The text was updated successfully, but these errors were encountered: