Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Unicode support? :-( #5

Closed
dertuxmalwieder opened this issue Oct 13, 2018 · 47 comments
Closed

No Unicode support? :-( #5

dertuxmalwieder opened this issue Oct 13, 2018 · 47 comments
Assignees

Comments

@dertuxmalwieder
Copy link

One would think that SciTECO could insert German umlauts just fine, but iÄÖÜ$$ leads to visual garbage.

The manpage states:

Currently however, SciTECO will only handle ASCII files.

That does not imply that direct input of non-ASCII characters into an empty document was forbidden, so I thought it could work...?

@rhaberkorn
Copy link
Owner

rhaberkorn commented Oct 16, 2018

Yes, that's true. SciTECO does currently not handle Unicode. Inserting Unicodes would be relatively easy to implement, but adding full Unicode support is unfortunately not trivial. The problem is that SciTECO as a language views documents as a sequence of characters/glyphs. C should move across an entire character, instead of a single byte in an Unicode glyph. With UTF8, documents are not "random accessible", though, so some clever heuristics need to be implemented. Also, I'm still somewhat unsure which rules to impose for interactions between buffers and Q-Registers - all of which could have different encodings. Not to mention that the parser needs to be Unicode-aware as well. This feature will also generally complicate the Curses UI, as we have to support more Curses variants. Most of these problems are specific to SciTECO. It does not help much that Scintilla is already supporting Unicode.

This feature however has long been on my TODO list, I have already begun implementing it (but then lost interest), and it will eventually be targetted.

Currently I'm working on a conversion of the entire codebase from C++ to plain C and won't start work on this until I've finished that.

@dertuxmalwieder
Copy link
Author

I never assumed that this would be an easy task... :-) Unicode in plain C hurts a lot though, at least if you want to target the Microsoft compiler. Good luck, I guess. (But I'm seriously looking forward to it. In fact, it is the only real "blocker" for me. :-))

@dertuxmalwieder
Copy link
Author

I noticed that you're back from hibernation (well, conversion). Good! :-)
And thank you for the ongoing work.

@rhaberkorn
Copy link
Owner

I want to do a release first before tackling Unicode. It's overdue. Somehow I got distracted again. Ah yes, I couldn't sort out all the problems with the Windows port because I did not have any Windows machine around and things evidently change and break every few years...

@dertuxmalwieder
Copy link
Author

I have upgraded to Windows 11 now and I still have Visual Studio here. I guess I could - at least - test whatever is there...

@rhaberkorn
Copy link
Owner

I now have a Windows 2008 Server installation in a VM which is the absolute minimum nowadays. (Not my choice. If it were for me, I'd support even MSDOS.) I've consequently resolved all of the remaining Windows issues and working Windows 32-bit Curses and Gtk+ binaries are built every day automatically as part of nightly builds. Feel free to test these.

I'm personally not using Windows, so the Windows builds will usually be tested poorly compared to the Linux/Ubuntu builds. I "dogfeed" everything to myself, so bugs usually get resolved very quickly. It's just that I run Windows only to do SciTECO stuff.

I'm currently resolving some remaining issues with the Gtk+ interface that should be fully supported in the v2.0 release. Afterwards, I wanted to (re)test Haiku OS since I hope it runs much more stable now. Well and I did not test FreeBSD in ages. Probably I will restrain from getting packages upstreamed for Haiku and FreeBSD for the time being. But I'd at least like it to compile and not crash before I release v2.0. Some kind of generic Linux AppImage packaging was also on the agenda, but I'll postpone that as well. Otherwise, I will never get any release out ever.

You don't happen to have a Mac with an actual Mac OS? The Mac situation is even worse, unfortunately.

@dertuxmalwieder
Copy link
Author

I actually do (a 2019 one, running macOS 11). Considering that my own development projects don't get much love these days (my day only has 24 hours), is there anything I can do for you on it?

@rhaberkorn
Copy link
Owner

Of course, you can try to build SciTECO on Mac OS and see whether it runs and works as expected. But don't bother to do until the CI runs successfully on the master branch for Mac OS. It doesn't right now. But the issue has already been fixed and will be merged soon.

Of course, I'd like to get Mac OS packages as well. But I'm not experienced with Mac OS packaging at all. How do you think a program like SciTECO should be packaged best? Perhaps simply adding a Homebrew port? But this won't work for nightly builds. How would you usually distribute a Mac OS CLI program that can be downloaded and installed manually?

@dertuxmalwieder
Copy link
Author

I, personally, am a strong advocate of pkgsrc on macOS as it doesn't clutter my file system as much as Homebrew does. That would have the additional advantage of you only having to maintain one single package for all platforms (except Windows which is not supported by it).

Most other tech-savvy macOS users I know either use Homebrew or MacPorts, although MacPorts seems to be disappearing these days.

@rhaberkorn
Copy link
Owner

I, personally, am a strong advocate of pkgsrc on macOS as it doesn't clutter my file system as much as Homebrew does. That would have the additional advantage of you only having to maintain one single package for all platforms (except Windows which is not supported by it).

Most other tech-savvy macOS users I know either use Homebrew or MacPorts, although MacPorts seems to be disappearing these days.

All of these download packages from official repos as far as I understand. That's nice and I'd definitely like to be there, but the version in the repos will always be outdated. Do you never install CLI apps using some kind of package format (cf. how you can download and install *.deb packages)?

@dertuxmalwieder
Copy link
Author

No, never, at least not on macOS. I don't use Linux at all.

@rhaberkorn
Copy link
Owner

Of course, you can try to build SciTECO on Mac OS and see whether it runs and works as expected. But don't bother to do until the CI runs successfully on the master branch for Mac OS. It doesn't right now. But the issue has already been fixed and will be merged soon.

Mac OS builds now. Test suite runs. Whether it displays anything, I have no idea.
https://github.com/rhaberkorn/sciteco/runs/3883169018?check_suite_focus=true

@rhaberkorn
Copy link
Owner

No, never, at least not on macOS. I don't use Linux at all.

So you actually used a Mac OS build of SciTECO at some point in the past?

It's a pitty, I cannot just link in everything statically and provide a tar.gz of the UNIX tree for experienced users.

@dertuxmalwieder
Copy link
Author

It's a pitty, I cannot just link in everything statically and provide a tar.gz of the UNIX tree for experienced users.

Why not? :D

@rhaberkorn
Copy link
Owner

It's a pitty, I cannot just link in everything statically and provide a tar.gz of the UNIX tree for experienced users.

Why not? :D

Well, true. Why not! If Homebrew provides static glib libraries, it should be possible. I should even be able to test this using "Darling". It might actually also be an alternative to AppImage as a generic distribution method for Linux - I don't think that AppImage would work very well.
(I cannot currently statically link on Windows because some of the glib tools I have to ship are actually dynamically linked. But that's not a big deal on Windows anyway.)

@dertuxmalwieder
Copy link
Author

It shouldn't be a big deal on other systems either. Shared libraries exist for a reason. But indeed, Windows users are (usually) less annoyed by that.

@rhaberkorn
Copy link
Owner

It shouldn't be a big deal on other systems either. Shared libraries exist for a reason. But indeed, Windows users are (usually) less annoyed by that.

Well it's not that easy. You don't want to install a bunch of third party libraries into your standard library directory as it could conflict with your system libraries. It is also not guaranteed at all that these libraries will work because of their transitive dependencies. Installing stuff into your root is generally discouraged as it is not managed by the native package manager. Apps that need to ship their own copies of third party libs on Linux and are not packaged natively, usually use some kind of ld.so hack so they don't have to install them. That's actually why technologies like AppImage and Snap exist - they package application along with all of their transitive dependencies in a way that does not affect the rest of your system. People nowadays also (ab)use Docker for that purpose, but I refuse to distribute a system utility as a Docker image!

@dertuxmalwieder
Copy link
Author

dertuxmalwieder commented Oct 13, 2021

I also don't use "containers" because I prefer to know what's actually there - containers are harder to check for security issues. (The common way to install Docker stuff is piping it into bash. Ugh.) Luckily, none of my server systems even have Docker... :-)

Side remark: I find it funny that Linux developers turn Linux into Windows (KDE, Snap/Flatpak/..., systemd, ...) while still claiming that Linux was so much better. Heh.

@rhaberkorn
Copy link
Owner

Side remark: I find it funny that Linux developers turn Linux into Windows (KDE, Snap/Flatpak/..., systemd, ...) while still claiming that Linux was so much better. Heh.

I partially agree and that's what turns me away from Linux. Although the Snap/AppImage/Flatpak thing is a weird convergence. On Linux it is a result of binary incompatibilities between distributions and even two releases of the same distribution. That in itself is not a new problem in UNIX but back in the days, the source tarball was the cross-platform package format. Nowadays, a minority of Linux users is able to build from source. Windows did things wrong from the very beginning. Linux "grows" with its users so to speak.
I really liked how FreeBSD solved things without resorting to systemd-like madness. Time to set it up again...

@dertuxmalwieder
Copy link
Author

I really liked how FreeBSD solved things without resorting to systemd-like madness.

I use FreeBSD, OmniOS and OpenBSD side-by-side. All of them are weird. I like that.

@rhaberkorn
Copy link
Owner

Regarding Haiku: I've installed the new Haiku R1/beta3 release and cannot even clone my repository. It crashes the system. Last time, I did get it to build but it had strange crashes and Curses quirks that I could not debug and suspected to be OS bugs. It's simply not mature enough and probably not worth the effort to support. Although it would be nice to be one of the few available editors in their repos.

@rhaberkorn
Copy link
Owner

I, personally, am a strong advocate of pkgsrc on macOS as it doesn't clutter my file system as much as Homebrew does.

I found that pkgsrc will mercilessly install libraries that shadow system libraries (ncurses for instance). In that sense it does "clutter" your system. Homebrew has these libraries in "kegs" and you have to explicitly enable linking against Homebrew-installed versions of libraries that are also present in the base system.

That would have the additional advantage of you only having to maintain one single package for all platforms

It would be nice to have a package in pkgsrc and Nix and whatsnot. But this can never be the primary or only means of distribution (even for regular releases) as few people will install a new package manager to try out your software.

@dertuxmalwieder
Copy link
Author

I found that pkgsrc will mercilessly install libraries that shadow system libraries (ncurses for instance).

It's debatable whether there are any advantages of preferring system libraries over (newer) pkgsrc libraries. Either way, if you assume that ncurses already exists on the target system, don't add it as a runtime dependency, so it won't be installed. (I guess.)

few people will install a new package manager to try out your software.

That's true; but, at least on macOS, there is no standard package manager anyway, and pkgsrc is the de-facto standard package manager of illumos (ex-OpenSolaris) and NetBSD. :-)

@rhaberkorn
Copy link
Owner

Either way, if you assume that ncurses already exists on the target system, don't add it as a runtime dependency, so it won't be installed. (I guess.)

It's enough that some other package pulls in the dependency. But it won't conflict with existing installed applications since they have hardcoded dependencies on the system ncurses. Everything will link against the pkgsrc-ncurses, though by default. On the other hand, you can at least distribute the pkgsrc-ncurses as it expects the same paths as the system ncurses and can reuse terminfo.
I'm still linking against the system ncurses for now since you reported that the new Homebrew-curses did not make any difference and would require further tweaks.

@rhaberkorn rhaberkorn self-assigned this Nov 22, 2022
@rhaberkorn
Copy link
Owner

rhaberkorn commented Aug 22, 2024

I want to tackle Unicode once and for all this summer. I have the following ideas:

  • The parser itself will stay purely ASCII, but source files can contain codepaged and UTF8 bytes. For him, it's just bytes. UTF32 TECO source files simply won't be supported and you could always convert it to UTF8 anyway. That's at least for the time being as an UTF8-aware parser could have advantages, for instance you could use umlauts in positions where single characters are expected, as in ^^ä. The question is whether latin1 TECO source code makes any sense without the ability to actually edit TECO macros in latin1 or insert raw 128-255 bytes or whatever. So going all UTF8 for the parser might be the way to go. We will still support editing raw non-UTF8 binary files at the very least.
  • In the UIs, we will always exclusively insert UTF8 into the command line macro. At least for the time being. Is it even worth considering code pages? I doubt it.
  • Position integers will be glyph-addresses, not byte addresses. So commands will have to be UTF-aware if the underlying buffer is UTF.
  • Every buffer will store the current position (.) and the end of the buffer (Z) in glyphs. The byte offsets are known from Scintilla. This gives us three tuples of known glyph-offset mappings. When navigating in UTF8 documents, being close to one of those anchor points will decide whether we are searching for the byte offsets from the left or right. Expressions like ., Z, C, R are guaranteed to be fast. Everything else has worst case linear complexity.
  • Inserting UTF8 into raw buffers will simply insert the bytes (could be multiple bytes per glyph). There is no sensible conversion of course. The ASCII-subset of UTF8 will remain ASCII after insertion of course.
  • Inserting UTF8 into codepaged buffers will for the time being just insert the raw bytes, ie. fail for everything outside of ASCII. In the future, we might try to automatically iconv to the target code page. This is not guaranteed to succeed, though.
  • Inserting UTF8 into UTF16/32 buffers and from UTF16/32 registers can autoconvert the code points.
  • Inserting raw byte registers into UTF buffers will just paste the bytes, so it might not result in valid Unicode.
  • Inserting codepaged registers into UTF buffers will do the same, for the time being. See above, iconv could be used, but it's not a priority.
  • When creating registers, the new register will inherit the source document's encoding. So X[foo] from a raw/ASCII buffer will create an ASCII register foo as well. ^U[foo]...$ however will always be UTF8 as source files are assumed to be UTF8. This is analogous of how registers "inherit" the EOL style.
  • Opening buffers need some kind of heuristics to determine the initial encoding, defaulting to UTF8. This should be configurable via ED flags to facilitate editing raw files.
  • New files will always be UTF8 by default unless we are in "raw" mode.
  • There needs to be a new command for getting and setting the buffer/registers's encoding: raw/ASCII, codepage, UTFx. Since you don't want to EQ into a register just to set its encoding, there will probably have to be a register-taking-variant as well.
  • The EOL normalization feature is left as it is. I want to keep constant conversions between glyph and byte offsets whereever possible, so it still makes sense to convert CRLF to LF in ASCII and UTF32 documents.
  • I will try to keep support for non-widechar Curses libraries. If it turns out, we absolutely need the widechar libraries, I will not try to keep support for both. Non-widechar support will simply be dropped. Say goodbye to the DOS port!

@rhaberkorn
Copy link
Owner

rhaberkorn commented Aug 28, 2024

Regarding the curses variants, I found out:

  • You need to link against widechar versions of ncurses (ncursesw) for Unicode handling. ncurses only differentiates between the two library names if the platform allows installing both. Ubuntu for instance has libncurses and libncursesw. On FreeBSD, libncurses is always the widechar version. But it ships both ncurses and ncursesw pkg-config files, both pointing to the same library. SciTECO therefore prefers ncursesw and falls back to ncurses.
  • You don't necessarily have to use the widechar APIs even when linking against the widechar-variant of ncurses to display Unicode correctly. This is what I am now trying - avoiding widechar-variant specific APIs while still supporting Unicode. Scinterm appears to do the same - it does not appear to use any of the widechar API. This definitely works on ncurses, but it remains to be seen what PDCurses and netbsd-curses does.
  • Keeping non-widechar support is important only for certain minimalist systems, mostly embedded systems, where ncurses is actually built without widechar support. SciTECO does target those systems, although it's certainly not the smallest and leanest editor around.
  • The thing with legacy systems (DOS, OS/2 etc.) - where there really is no curses with widechar support - is, that we don't support them anyway thanks to glib, which is a portability killer. (Many years ago, I did get OS/2 versions to build, but I think now that will be practically impossible.)

I can now at least input and display Unicode on all platforms, but I am not going to push it to master yet.

@rhaberkorn
Copy link
Owner

rhaberkorn commented Aug 28, 2024

Actually, I have command line editing and navigation in Unicode documents already working based on Scintilla's SCI_POSITIONRELATIVE and SCI_COUNTCHARACTERS. But it is way too slow. Apparently it really counts characters with every call. I hoped it would already be optimized by the SCI_ALLOCATELINECHARACTERINDEX. This is not the case.

SCI_POSITIONRELATIVE is also special in not being able to detect whether movements to the left are within the document's bounds as they reuse the 0 return value.

For instance in order to get the current glyph position (dot) using the character index, I will apparently have to:

pos = SCI_GETCURRENTPOS()
line = SCI_LINEFROMPOSITION(pos)
glyphs = SCI_INDEXPOSITIONFROMLINE(line) +
         SCI_COUNTCHARACTERS(SCI_POSITIONFROMLINE(line), pos)

Which is a ridiculous 5 Scintilla messages instead of 1 for the single-byte case or 2 when using SCI_COUNTCHARACTERS(0, SCI_GETCURRENTPOS()).

@rhaberkorn
Copy link
Owner

The inverse operation - getting the byte offset from a character/glyph index would be:

line = SCI_LINEFROMINDEXPOSITION(pos)
off = SCI_POSITIONRELATIVE(SCI_POSITIONFROMLINE(line), pos-SCI_INDEXPOSITIONFROMLINE(line))

Ie. 4 calls instead of the naive SCI_POSITIONRELATIVE(0, pos).

Perhaps I will just do that for each and every conversion, along with an optimization for
left/right movements <= X with X being experimentally determined.
Eg. 12C would still simply call SCI_POSITIONRELATIVE(SCI_GETCURRENTPOS(), 12).
Consequently, C would be significantly faster than .+1J (2 vs. 9 messages).
In order to perform bounds checking, 12R will have to consult the index first,
followed by one additional SCI_POSITIONRELATIVE() (6 messages).

That is not counting the SCI_GETCODEPAGE() message, necessary to determine whether we can actually consult the index.
Once we call that, it makes sense to optimize the single-byte encodings again.

You see, it's all very tricky to get done efficiently.

@rhaberkorn
Copy link
Owner

rhaberkorn commented Aug 29, 2024

I basically implemented the new glyph-based model and I am already dogfooding the Unicode version to myself.

I noticed another problem: Scintilla messages (ES) almost always work with byte offsets, no matter what's the buffer's encoding. In order to interface with them from macros - not all of them are mapped to proper SciTECO commands - there needs to be a wrapper around the glyph-to-byte conversion routines (although these algorithms could theoretically be rewritten in the form of SciTECO macros, this would result in tons of redundancies). Sometimes, you can also work around the problem by using ESGETCURRENTPOS$$ instead of ..

rhaberkorn added a commit that referenced this issue Sep 9, 2024
* Eg. when typing with a Russian layout, CTRL+I will always insert ^I.
* Works with all of the start-state command Ex, Fx, ^x commands and
  string building constructs.
  This is exactly where process_edit_cmd_cb() case folds case-insensitive
  characters.
  The corresponding state therefore sets an is_case_insensitive flag now.
* Does not yet work with anything embedded into Q-Register specifications.
  This could only be realized with a new state callback (is_case_insensitive()?)
  that chains to the Q-Register and string building states recursively.
* Also it doesn't work with Ё on my Russian phonetic layout,
  probably because the ANSI caret on that same key is considered dead
  and not returned by gdk_keyval_to_unicode().
  Perhaps we should directly wheck the keyval values?
* Whenever a non-ANSI key is pressed in an allowed state,
  we try to check all other keyvals that could be produced by the same
  hardware keycode, ie. we check all groups (keyboard layouts).
rhaberkorn added a commit that referenced this issue Sep 9, 2024
rhaberkorn added a commit that referenced this issue Sep 9, 2024
…#5)

* ^Uq however always sets an UTF8 register as the source
  is supposed to be a SciTECO macro which is always UTF-8.
* :^Uq preserves the register's encoding
* teco_doc_set_string() now also sets the encoding
* instead of trying to restore the encoding in teco_doc_undo_set_string(),
  we now swap out the document in a teco_doc_t and pass it to an undo token.
* The get_codepage() Q-Reg method has been removed as the same
  can now be done with teco_doc_get_string() and the get_string() method.
rhaberkorn added a commit that referenced this issue Sep 9, 2024
* When enabled with bit 2 in the ED flags (0,4ED),
  all registers and buffers will get the raw ANSI encoding (as if 0EE had been
  called on them).
  You can still manually change the encoding, eg. by calling 65001EE afterwards.
* Also the ANSI mode sets up character representations for all bytes >= 0x80.
  This is currently done only depending on the ED flag, not when setting 0EE.
* Since setting 16,4ED for 8-bit clean editing in a macro can be tricky -
  the default unnamed buffer will still be at UTF-8 and at least a bunch
  of environment registers as well - we added the command line option
  `--8bit` (short `-8`) which configures the ED flags very early on.
  As another advantage you can mung the profile in 8-bit mode as well
  when using SciTECO as a sort of interactive hex editor.
* Disable UTF-8 checks in 8-bit clean mode (sample.teco_ini).
rhaberkorn added a commit that referenced this issue Sep 9, 2024
…_glyphs2bytes() and teco_interface_bytes2glyphs() (refs #5)

* for consistency with all the other teco_view wrappers in interface.h
rhaberkorn added a commit that referenced this issue Sep 9, 2024
* significantly speeds up build time
* Scintilla and Lexilla headers and symbols are all-ASCII anyway.
* We should probably have a look at the quicksort implementation
  in string.tes, as it can probably be optimized in UTF-8 documents as well.
@rhaberkorn
Copy link
Owner

rhaberkorn commented Sep 9, 2024

I have pushed the first usable Unicode version. There are also a bunch of other new minor features and 8-bit cleanliness has been improved significantly. Non-ANSI single byte encodings are still only partially supported. You can use EE to configure a code page, it should be displayed correctly in Gtk, but you cannot actually insert non-latin text in these code pages. I have not decided which way to go - iconv'ing from UTF-8 to the target codepage whenever interacting with registers/buffers of different encodings; or iconv'ing during readin/readout (internally handling all non-ANSI codepages as Unicode). Both solutions somewhat suck.

The current state however is more than enough to edit some German text.

The next thing I will do is making the SciTECO parser Unicode-aware, so you can use Unicode glyphs whereever the parser expects a single character. I will then probably rework the "function key macros" into more generic "key macros", so you can make more efficient use of your international non-latin keyboard keys.

@rhaberkorn
Copy link
Owner

There have been a number of fixes in sample.teco_ini, you might want to merge into your ~/.teco_ini.

rhaberkorn added a commit that referenced this issue Sep 9, 2024
hopefully fixes the Unicode test cases on Mac OS
rhaberkorn added a commit that referenced this issue Sep 9, 2024
* Should prevent data loss due to system locale conversions
  when parsing command line arguments.
* Should also fix passing Unicode arguments to munged macros and
  therefore opening files via ~/.teco_ini.
* The entire option parsing is based on GStrv (null-terminated string lists)
  now, also on UNIX.
@dertuxmalwieder
Copy link
Author

The current state however is more than enough to edit some German text.

Which is more than enough for me. :-)

Thank you a lot for your effort! I'll leave this open until you decide that you're happy with the feature...

rhaberkorn added a commit that referenced this issue Sep 9, 2024
* The default ANSI versions of the Win32 API calls will not work
  if the filename contains non-ANSI UTF-8 characters.
* There is g_win32_locale_filename_from_utf8(), but it's not guaranteed
  to derive an unique filename.
* Therefore we define UNICODE and convert between UTF-8 and UTF-16.
rhaberkorn added a commit that referenced this issue Sep 10, 2024
* Should prevent data loss due to system locale conversions
  when parsing command line arguments.
* Should also fix passing Unicode arguments to munged macros and
  therefore opening files via ~/.teco_ini.
* The entire option parsing is based on GStrv (null-terminated string lists)
  now, also on UNIX.
rhaberkorn added a commit that referenced this issue Sep 10, 2024
* The default ANSI versions of the Win32 API calls worked only as
  long as we used the ANSI subset of UTF-8 in filenames.
* There is g_win32_locale_filename_from_utf8(), but it's not guaranteed
  to derive an unique filename.
* Therefore we define UNICODE and convert between UTF-8 and UTF-16
  (Windows' native Unicode encoding).
rhaberkorn added a commit that referenced this issue Sep 10, 2024
* The libtool wrapper binaries do not pass down UTF-8 strings correctly,
  so the Unicode tests failed under some circumstances.
* As we aren't actually linking against any locally-built shared libraries,
  we are passing --disable-shared to libtool which inhibts wrapper generation
  on win32 and fixes the test suite.
* Also use up to date autotools. This didn't fix anything, though.
* test suite: try writing an Unicode filename as well
  * There have been problems doing that on Win32 where UTF-8 was not
    correctly passed down from the command line and some Windows API
    calls were only working with ANSI filenames etc.
rhaberkorn added a commit that referenced this issue Sep 10, 2024
* The libtool wrapper binaries do not pass down UTF-8 strings correctly,
  so the Unicode tests failed under some circumstances.
* As we aren't actually linking against any locally-built shared libraries,
  we are passing --disable-shared to libtool which inhibts wrapper generation
  on win32 and fixes the test suite.
* Also use up to date autotools. This didn't fix anything, though.
* test suite: try writing an Unicode filename as well
  * There have been problems doing that on Win32 where UTF-8 was not
    correctly passed down from the command line and some Windows API
    calls were only working with ANSI filenames etc.
@rhaberkorn
Copy link
Owner

rhaberkorn commented Sep 10, 2024

@dertuxmalwieder Have a try again at the Mac OS nightly builds. Since it's now preferring widechar builds of ncurses, I suspect that it could behave differently. Perhaps the visual glitches are now fixed:
https://github.com/rhaberkorn/sciteco/releases/download/nightly/sciteco-curses_nightly_macos_x86_64.pkg

But better comment in #12 if you find anything new.

@dertuxmalwieder
Copy link
Author

I'll try the macOS nightlies tonight - sorry, been somewhat busy yesterday.
I noticed a new problem on Windows with the Gtk3 UI though:

Unbenannt

At least I guess that's Gtk3-specific.

rhaberkorn added a commit that referenced this issue Sep 11, 2024
The following rules apply:
 * All SciTECO macros __must__ be in valid UTF-8, regardless of the
   the register's configured encoding.
   This is checked against before execution, so we can use glib's non-validating
   UTF-8 API afterwards.
 * Things will inevitably get slower as we have to validate all macros first
   and convert to gunichar for each and every character passed into the parser.
   As an optimization, it may make sense to have our own inlineable version of
   g_utf8_get_char() (TODO).
   Also, Unicode glyphs in syntactically significant positions may be case-folded -
   just like ASCII chars were. This is is of course slower than case folding
   ASCII. The impact of this should be measured and perhaps we should restrict
   case folding to a-z via teco_ascii_toupper().
 * The language itself does not use any non-ANSI characters, so you don't have to
   use UTF-8 characters.
 * Wherever the parser expects a single character, it will now accept an arbitrary
   Unicode/UTF-8 glyph as well.
   In other words, you can call macros like M§ instead of having to write M[§].
   You can also get the codepoint of any Unicode character with ^^x.
   Pressing an Unicode character in the start state or in Ex and Fx will now
   give a sane error message.
 * When pressing a key which produces a multi-byte UTF-8 sequence, the character
   gets translated back and forth multiple times:
   1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk).
      On Curses we could directly get a wide char using wget_wch(), but it's
      not currently used, so we don't depend on widechar curses.
   2. Parsed into gunichar for passing into the edit command callbacks.
      This also validates the codepoint - everything later on can assume valid
      codepoints and valid UTF-8 strings.
   3. Once the edit command handling decides to insert the key into the command line,
      it is serialized back into an UTF-8 string as the command line macro has
      to be in UTF-8 (like all other macros).
   4. The parser reads back gunichars without validation for passing into
      the parser callbacks.
 * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely
   inserted and displayed UTF-8 sequences, are now fixed.
@rhaberkorn
Copy link
Owner

rhaberkorn commented Sep 11, 2024

I'll try the macOS nightlies tonight - sorry, been somewhat busy yesterday. I noticed a new problem on Windows with the Gtk3 UI though:

That was a known limitation of the single-byte-based parser. Since it got fed single characters of an UTF-8 sequence, it could sometimes ask Gtk to draw these invalid sequences. The Curses UI had an analogous flickering instead.

With 6857807, this is also fixed. The entire language is Unicoded now. Feel free to and M⍝ to your heart's content.

Here are the updated binaries: https://github.com/rhaberkorn/sciteco/releases/download/nightly/sciteco-gtk3_nightly_win32.zip

@dertuxmalwieder
Copy link
Author

Perhaps the visual glitches are now fixed:
https://github.com/rhaberkorn/sciteco/releases/download/nightly/sciteco-curses_nightly_macos_x86_64.pkg

Sadly, no.

rhaberkorn added a commit that referenced this issue Sep 12, 2024
* This is used for error messages (TECO macro stackframes),
  so it's important to display columns in characters.
* Program counters are in bytes and therefore everywhere gsize.
  This is by glib convention.
rhaberkorn added a commit that referenced this issue Sep 12, 2024
…feature

* ALL keypresses (the UTF-8 sequences resulting from key presses) can now be remapped.
* This is especially useful with Unicode support, as you might want to alias
  international characters to their corresponding latin form in the start state,
  so you don't have to change keyboard layouts so often.
  This is done automatically in Gtk, where we have hardware key press information,
  but has to be done with key macros in Curses.
  There is a new key mask 4 (bit 3) for that purpose now.
* Also, you might want to define non-ANSI letters to perform special functions in
  the start state where it won't be accepted by the parser anyway.
  Suppose you have a macro M→, you could define
  @^U[^K→]{m→} 1^_U[^K→]
  This effectively "extends" the parser and allow you to call macro "→" by a single
  key press. See also #5.
* The register prefix has been changed from ^F (for function) to ^K (for key).
  This is the only thing you have to change in order to migrate existing
  function key macros.
* Key macros are enabled by default. There is no longer any way to disable
  function key handling in curses, as I never found any reason or need to disable it.
  Theoretically, the default ESCDELAY could turn out to be too small and function
  keys don't get through. I doubt that's possible unless on extremely slow serial lines.
  Even then, you'd have to increase ESCDELAY and instead of disabling function keys
  simply define an escape surrogate.
* The ED flag has been removed and its place is reserved for a future mouse support flag
  (which does make sense to disable in curses sometimes).
  fnkeys.tes is consequently also enabled by default in sample.teco_ini.
* Key macros are handled as an unit. If one character results in an error,
  the entire string is rubbed out.
  This fixes the "CLOSE" key on Gtk.
  It also makes sure that the original error message is preserved and not overwritten
  by some subsequent syntax error.
  It was never useful that we kept inserting characters after the first error.
rhaberkorn added a commit that referenced this issue Sep 16, 2024
* Practically requires one of the "Nerd Font" fonts,
  so it's disabled by default.
  Add 0,512ED to the profile to enable them.
* The new ED flag could be used to control Gtk icons as well,
  but they are left always-enabled for the time being.
  Is there any reason anybody would like to disable icons in Gtk?
* The list of icons has been adapted and extended from exa:
  https://github.com/ogham/exa/blob/master/src/output/icons.rs
* The icons are hardcoded as presorted lists,
  so we can binary search them.
  This could change in the future. If there is any demand,
  they could be made configurable via Q-Registers as well.
@rhaberkorn
Copy link
Owner

Now that we have Unicode support, I couldn't resist to implement icons in the Curses version. It requires Nerd Fonts and you need to add 0,512ED to ~/.teco_ini.
screenshot-1726520831

@rhaberkorn
Copy link
Owner

Support for other single-byte code pages is still left open, but this is only indirectly linked to Unicode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants