Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import of html coded with unicode UTF-8 fails #8

Open
ihemsen opened this issue Mar 22, 2020 · 6 comments
Open

Import of html coded with unicode UTF-8 fails #8

ihemsen opened this issue Mar 22, 2020 · 6 comments

Comments

@ihemsen
Copy link

ihemsen commented Mar 22, 2020

I just started using Baltisar Tidy after having used PageSpinner for many years.
When opening the old html files, Balthisar displays a message about false input-encoding and offers to convert to MacOSRoman. Accepting this garbles the document. Apparently Balthisar has guessed wrong. Choosing "Ignore" does not import any text to the document.
A sample html-file and screen shot attached.
Skjermbilde 2020-03-22 kl  17 12 53
nett.html.zip

I use macOS Catalina 10.15.3 (19D76)
Balthisar Tidy version 4.2.0

@balthisar
Copy link
Owner

balthisar commented Apr 14, 2020

@ihemsen, sorry I didn't notice this issue earlier. Can you post a sample file, so I can give best advice?

In the meantime, if you click ignore, it will look like nothing is imported; however, as long as you don't make any changes to the source HTML, then you can select different input-encoding settings to see if there's one that works.

Balthisar Tidy only uses macOS' guess, so the original document even macOS is guessing wrong.

Edit: oops, I see the attachment. I'll be back.

@balthisar
Copy link
Owner

If I bring your sample document into BBEdit, for example, it shows me that it's Western (ISO Latin 1) with Unix line endings. It looks good, and all of the diacriticals look good, or at least not garbled (I don't really read Norwegian).

In Balthisar Tidy, using the guessed MacOS Roman screws things up severely.

If I manually choose Western (ISO Latin 1) as the input encoding, the document then displays correctly.

@ihemsen, I think knowing this will solve your issue. As far as fixing the bug, as indicated above, it's macOS doing the guessing, so I probably won't be able to fix the issue. I imagine that the reason BBEdit works is because they have their own character encoding libraries dating back to the System 7 era!

In any case, let me know if this works for you, and again, sorry for the delay.

@ihemsen
Copy link
Author

ihemsen commented Apr 15, 2020

Thanks @balthisar,
This works, but the user interface was not obvious even with your guidance above. To find the input encoding option I first looked in the Edit menu, then in the File menu and finally found it via the Preferences and Tidy-tab. Then the document loaded correctly the next time. Having done that I now see the input encoding document setting also in the left pane (Tidy options).

Other SW I have worked with has had the encoding change in the Edit menu. As a new Balthisar Tidy user I did look the "obvious" places in the menu system.

So when importing this html source the following happens

  1. Mac OS reports the character set as "Mac OS Roman"
  2. Baltisar Tidy has "Unicode (UTF-8)" as input encoding
  3. The source code itself states "content="text/html; charset=iso-8859-1"

Bathisar detects a mismatch between 1 and 2 and ignores 3. Then it displays the following dialogue box:

"Balthisar Tidy opened your document "index.html" successfully, but it appears that the Tidy input-encoding is not properly set. Currently "Unicode (UTF-8)" is specified.

Balthisar Tidy will automatically set input encoding to "Western (Mac OS Roman)" for you (unless you choose to ignore). This guess may not be correct, so you should have a look at the Source HTML afterwards and choose the correct input-encoding for this document before making any other changes.
[Allow Change] [Ignore]

Hint: you can choose a default input-encoding in Preferences if you open this type of file often."

First of all: Would it be possible for Balthisar Tidy to use the encoding declared by the source code as a third option? [Allow Change] [Use "iso-8859-1"] [Ignore]

The empty source html window is also a bit confusing. Perhaps Balthisar Tidy could display greyed text that can be scrolled but not edited until the input-encoding is changed? If you click in the window, a dialogue box telling you to set input-encoding in the left pane.

The text in the dialogue box sited above is understandable when you are familiar with Balthisar Tidy, but importing old html is perhaps the first thing a new user does. I suggest to change the text:

Balthisar Tidy has detected a mismatch between the character set reported by Mac OS, "Mac OS Roman" and the input-encoding setting "Unicode (UTF-8)".
The input-encoding setting can be found in the left pane "TIDY OPTIONS". Try different encodings and check the text to see if it is displayed correctly.
Baltisar Tidy will import the text as "Mac OS Roman"
[OK]
Hint: you can choose a default input-encoding in Preferences if you open this type of file often."

@balthisar
Copy link
Owner

Thanks for feeding back. I'll see what I can do. I might have to refer this to the upstream project (HTML Tidy proper), as Balthisar Tidy doesn't touch encoding (other than trying to set the input-encoding if it detects no document upon loading). I might considering looking at the charset definition, but I generally try to let HTML Tidy do the heavy lifting.

I'm proud of the human interface, though, but it's apparent that this isn't working. I will definitely do something to improve the experience in an upcoming release. I love hearing this type of feedback and being able to improve things.

Thanks.

@ihemsen
Copy link
Author

ihemsen commented Apr 15, 2020

Brilliant!
I am looking forward to tidying my website with Balthisar.

balthisar added a commit that referenced this issue Apr 28, 2021
…e I keep using it

  more and more.
- Do better sniffing of possible string encoding mismatches. I thought I'd done this
  back in 1.01, so I'm not sure how this got lost. In any case, MUCH better.
- Improve the input encoding helper due to issue #8. Thanks @ihemsen.
balthisar pushed a commit that referenced this issue May 16, 2021
This is a minor update with lots of small but important changes.

- Native Apple Silicon Support
- New, modern toolbar icons.
- New, modern Preferences icons.
- New, modern UI icons.
- Balthisar Tidy web version get Balthisar Tidy for Work features.
- Balthisar Tidy is now Balthisar Tidy Classic.
- Balthisar Tidy for Work is now Balthisar Tidy.
- Validator Preference panel now simply uses localhost; using the hostname didn't work
  with secure transport.
- Sample AppleScripts are no longer included int the application bundle, but can be
  downloaded from the Help menu.
- Sample AppleScripts have been updated, and the disk image is now notarized.
- Help Book has been updated with modern images and updated content.
- Updated to HTML Tidy 5.7.47.
- Updated to Nu Validator 21.5.16.
- Fix hovering behavior on macOS prior to 10.14
- Alternating Row Colors is no longer the default.
- No longer possible to use system tidy dylibs. Security for macOS is getting tighter
  these days, so we'll just build Tidy directly into Balthisar Tidy.
- Improve the input encoding helper due to issue #8. Thanks @ihemsen.

- Building my own JDK for each of the targets. This allows me to use the hardened
  runtime without code-signing issues. This also also building a mostly-fat JRE
  for Apple Silicon support.
- Most Cocoa dependencies are now via Carthage.
- Spaces won in the spaces vs tab war. Balthisar Tidy is now completely detabbed.
- Updating documentation comments so that they don't look like ass in the
  new, default Xcode appearance.
- Features support have been refactored. In order to migrate to a single "Balthisar Tidy"
  with future in-app purchases to enable pro features, the current hard-coded features
  had to be refactored so that features can be enabled in software rather than compiled
  in.
- All menus now use "Balthisar Tidy," and we will start to remove "for Work" branding.
- Preferences controller demoted to an object that has a MASPreferencesController, so
  that we can re-build the controller at will to add/remove panels.
- AppleScript improvements, including finally fixing a warning, and updating to the
  current, modern SDEF format.
- Add the new EdDSA public key required by Sparkle.
- libtidy now used statically and removed last remnants of dylib support.
- Rename to TidyTableViewController, to disambiguate its purpose.
- Renamed preferences classes and nibs to better reflect current names.
- Moved RTF to NSAttributedString functionality to a category, because I keep using it
  more and more.
- Do better sniffing of possible string encoding mismatches. I thought I'd done this
  back in 1.01, so I'm not sure how this got lost. In any case, MUCH better.
@balthisar
Copy link
Owner

@ihemsen, it's been a while, and I hope you're still using Balthisar Tidy. If so, and you've not updated to the newest version, I'd love to know your opinion on the behavior of the newest version when there are encoding mismatches!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants