-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No Unicode support? :-( #5
Comments
Yes, that's true. SciTECO does currently not handle Unicode. Inserting Unicodes would be relatively easy to implement, but adding full Unicode support is unfortunately not trivial. The problem is that SciTECO as a language views documents as a sequence of characters/glyphs. This feature however has long been on my Currently I'm working on a conversion of the entire codebase from C++ to plain C and won't start work on this until I've finished that. |
I never assumed that this would be an easy task... :-) Unicode in plain C hurts a lot though, at least if you want to target the Microsoft compiler. Good luck, I guess. (But I'm seriously looking forward to it. In fact, it is the only real "blocker" for me. :-)) |
I noticed that you're back from hibernation (well, conversion). Good! :-) |
I want to do a release first before tackling Unicode. It's overdue. Somehow I got distracted again. Ah yes, I couldn't sort out all the problems with the Windows port because I did not have any Windows machine around and things evidently change and break every few years... |
I have upgraded to Windows 11 now and I still have Visual Studio here. I guess I could - at least - test whatever is there... |
I now have a Windows 2008 Server installation in a VM which is the absolute minimum nowadays. (Not my choice. If it were for me, I'd support even MSDOS.) I've consequently resolved all of the remaining Windows issues and working Windows 32-bit Curses and Gtk+ binaries are built every day automatically as part of nightly builds. Feel free to test these. I'm personally not using Windows, so the Windows builds will usually be tested poorly compared to the Linux/Ubuntu builds. I "dogfeed" everything to myself, so bugs usually get resolved very quickly. It's just that I run Windows only to do SciTECO stuff. I'm currently resolving some remaining issues with the Gtk+ interface that should be fully supported in the v2.0 release. Afterwards, I wanted to (re)test Haiku OS since I hope it runs much more stable now. Well and I did not test FreeBSD in ages. Probably I will restrain from getting packages upstreamed for Haiku and FreeBSD for the time being. But I'd at least like it to compile and not crash before I release v2.0. Some kind of generic Linux AppImage packaging was also on the agenda, but I'll postpone that as well. Otherwise, I will never get any release out ever. You don't happen to have a Mac with an actual Mac OS? The Mac situation is even worse, unfortunately. |
I actually do (a 2019 one, running macOS 11). Considering that my own development projects don't get much love these days (my day only has 24 hours), is there anything I can do for you on it? |
Of course, you can try to build SciTECO on Mac OS and see whether it runs and works as expected. But don't bother to do until the CI runs successfully on the master branch for Mac OS. It doesn't right now. But the issue has already been fixed and will be merged soon. Of course, I'd like to get Mac OS packages as well. But I'm not experienced with Mac OS packaging at all. How do you think a program like SciTECO should be packaged best? Perhaps simply adding a Homebrew port? But this won't work for nightly builds. How would you usually distribute a Mac OS CLI program that can be downloaded and installed manually? |
I, personally, am a strong advocate of pkgsrc on macOS as it doesn't clutter my file system as much as Homebrew does. That would have the additional advantage of you only having to maintain one single package for all platforms (except Windows which is not supported by it). Most other tech-savvy macOS users I know either use Homebrew or MacPorts, although MacPorts seems to be disappearing these days. |
All of these download packages from official repos as far as I understand. That's nice and I'd definitely like to be there, but the version in the repos will always be outdated. Do you never install CLI apps using some kind of package format (cf. how you can download and install |
No, never, at least not on macOS. I don't use Linux at all. |
Mac OS builds now. Test suite runs. Whether it displays anything, I have no idea. |
So you actually used a Mac OS build of SciTECO at some point in the past? It's a pitty, I cannot just link in everything statically and provide a tar.gz of the UNIX tree for experienced users. |
Why not? :D |
Well, true. Why not! If Homebrew provides static glib libraries, it should be possible. I should even be able to test this using "Darling". It might actually also be an alternative to AppImage as a generic distribution method for Linux - I don't think that AppImage would work very well. |
It shouldn't be a big deal on other systems either. Shared libraries exist for a reason. But indeed, Windows users are (usually) less annoyed by that. |
Well it's not that easy. You don't want to install a bunch of third party libraries into your standard library directory as it could conflict with your system libraries. It is also not guaranteed at all that these libraries will work because of their transitive dependencies. Installing stuff into your root is generally discouraged as it is not managed by the native package manager. Apps that need to ship their own copies of third party libs on Linux and are not packaged natively, usually use some kind of ld.so hack so they don't have to install them. That's actually why technologies like AppImage and Snap exist - they package application along with all of their transitive dependencies in a way that does not affect the rest of your system. People nowadays also (ab)use Docker for that purpose, but I refuse to distribute a system utility as a Docker image! |
I also don't use "containers" because I prefer to know what's actually there - containers are harder to check for security issues. (The common way to install Docker stuff is piping it into Side remark: I find it funny that Linux developers turn Linux into Windows (KDE, Snap/Flatpak/..., systemd, ...) while still claiming that Linux was so much better. Heh. |
I partially agree and that's what turns me away from Linux. Although the Snap/AppImage/Flatpak thing is a weird convergence. On Linux it is a result of binary incompatibilities between distributions and even two releases of the same distribution. That in itself is not a new problem in UNIX but back in the days, the source tarball was the cross-platform package format. Nowadays, a minority of Linux users is able to build from source. Windows did things wrong from the very beginning. Linux "grows" with its users so to speak. |
I use FreeBSD, OmniOS and OpenBSD side-by-side. All of them are weird. I like that. |
Regarding Haiku: I've installed the new Haiku R1/beta3 release and cannot even clone my repository. It crashes the system. Last time, I did get it to build but it had strange crashes and Curses quirks that I could not debug and suspected to be OS bugs. It's simply not mature enough and probably not worth the effort to support. Although it would be nice to be one of the few available editors in their repos. |
I found that pkgsrc will mercilessly install libraries that shadow system libraries (ncurses for instance). In that sense it does "clutter" your system. Homebrew has these libraries in "kegs" and you have to explicitly enable linking against Homebrew-installed versions of libraries that are also present in the base system.
It would be nice to have a package in pkgsrc and Nix and whatsnot. But this can never be the primary or only means of distribution (even for regular releases) as few people will install a new package manager to try out your software. |
It's debatable whether there are any advantages of preferring system libraries over (newer)
That's true; but, at least on macOS, there is no standard package manager anyway, and |
It's enough that some other package pulls in the dependency. But it won't conflict with existing installed applications since they have hardcoded dependencies on the system ncurses. Everything will link against the pkgsrc-ncurses, though by default. On the other hand, you can at least distribute the pkgsrc-ncurses as it expects the same paths as the system ncurses and can reuse terminfo. |
I want to tackle Unicode once and for all this summer. I have the following ideas:
|
Regarding the curses variants, I found out:
I can now at least input and display Unicode on all platforms, but I am not going to push it to master yet. |
Actually, I have command line editing and navigation in Unicode documents already working based on Scintilla's SCI_POSITIONRELATIVE and SCI_COUNTCHARACTERS. But it is way too slow. Apparently it really counts characters with every call. I hoped it would already be optimized by the SCI_ALLOCATELINECHARACTERINDEX. This is not the case. SCI_POSITIONRELATIVE is also special in not being able to detect whether movements to the left are within the document's bounds as they reuse the 0 return value. For instance in order to get the current glyph position (dot) using the character index, I will apparently have to:
Which is a ridiculous 5 Scintilla messages instead of 1 for the single-byte case or 2 when using |
The inverse operation - getting the byte offset from a character/glyph index would be:
Ie. 4 calls instead of the naive Perhaps I will just do that for each and every conversion, along with an optimization for That is not counting the You see, it's all very tricky to get done efficiently. |
I basically implemented the new glyph-based model and I am already dogfooding the Unicode version to myself. I noticed another problem: Scintilla messages ( |
* Eg. when typing with a Russian layout, CTRL+I will always insert ^I. * Works with all of the start-state command Ex, Fx, ^x commands and string building constructs. This is exactly where process_edit_cmd_cb() case folds case-insensitive characters. The corresponding state therefore sets an is_case_insensitive flag now. * Does not yet work with anything embedded into Q-Register specifications. This could only be realized with a new state callback (is_case_insensitive()?) that chains to the Q-Register and string building states recursively. * Also it doesn't work with Ё on my Russian phonetic layout, probably because the ANSI caret on that same key is considered dead and not returned by gdk_keyval_to_unicode(). Perhaps we should directly wheck the keyval values? * Whenever a non-ANSI key is pressed in an allowed state, we try to check all other keyvals that could be produced by the same hardware keycode, ie. we check all groups (keyboard layouts).
…#5) * ^Uq however always sets an UTF8 register as the source is supposed to be a SciTECO macro which is always UTF-8. * :^Uq preserves the register's encoding * teco_doc_set_string() now also sets the encoding * instead of trying to restore the encoding in teco_doc_undo_set_string(), we now swap out the document in a teco_doc_t and pass it to an undo token. * The get_codepage() Q-Reg method has been removed as the same can now be done with teco_doc_get_string() and the get_string() method.
* When enabled with bit 2 in the ED flags (0,4ED), all registers and buffers will get the raw ANSI encoding (as if 0EE had been called on them). You can still manually change the encoding, eg. by calling 65001EE afterwards. * Also the ANSI mode sets up character representations for all bytes >= 0x80. This is currently done only depending on the ED flag, not when setting 0EE. * Since setting 16,4ED for 8-bit clean editing in a macro can be tricky - the default unnamed buffer will still be at UTF-8 and at least a bunch of environment registers as well - we added the command line option `--8bit` (short `-8`) which configures the ED flags very early on. As another advantage you can mung the profile in 8-bit mode as well when using SciTECO as a sort of interactive hex editor. * Disable UTF-8 checks in 8-bit clean mode (sample.teco_ini).
…_glyphs2bytes() and teco_interface_bytes2glyphs() (refs #5) * for consistency with all the other teco_view wrappers in interface.h
* significantly speeds up build time * Scintilla and Lexilla headers and symbols are all-ASCII anyway. * We should probably have a look at the quicksort implementation in string.tes, as it can probably be optimized in UTF-8 documents as well.
I have pushed the first usable Unicode version. There are also a bunch of other new minor features and 8-bit cleanliness has been improved significantly. Non-ANSI single byte encodings are still only partially supported. You can use The current state however is more than enough to edit some German text. The next thing I will do is making the SciTECO parser Unicode-aware, so you can use Unicode glyphs whereever the parser expects a single character. I will then probably rework the "function key macros" into more generic "key macros", so you can make more efficient use of your international non-latin keyboard keys. |
There have been a number of fixes in sample.teco_ini, you might want to merge into your |
hopefully fixes the Unicode test cases on Mac OS
* Should prevent data loss due to system locale conversions when parsing command line arguments. * Should also fix passing Unicode arguments to munged macros and therefore opening files via ~/.teco_ini. * The entire option parsing is based on GStrv (null-terminated string lists) now, also on UNIX.
Which is more than enough for me. :-) Thank you a lot for your effort! I'll leave this open until you decide that you're happy with the feature... |
* The default ANSI versions of the Win32 API calls will not work if the filename contains non-ANSI UTF-8 characters. * There is g_win32_locale_filename_from_utf8(), but it's not guaranteed to derive an unique filename. * Therefore we define UNICODE and convert between UTF-8 and UTF-16.
* Should prevent data loss due to system locale conversions when parsing command line arguments. * Should also fix passing Unicode arguments to munged macros and therefore opening files via ~/.teco_ini. * The entire option parsing is based on GStrv (null-terminated string lists) now, also on UNIX.
* The default ANSI versions of the Win32 API calls worked only as long as we used the ANSI subset of UTF-8 in filenames. * There is g_win32_locale_filename_from_utf8(), but it's not guaranteed to derive an unique filename. * Therefore we define UNICODE and convert between UTF-8 and UTF-16 (Windows' native Unicode encoding).
* The libtool wrapper binaries do not pass down UTF-8 strings correctly, so the Unicode tests failed under some circumstances. * As we aren't actually linking against any locally-built shared libraries, we are passing --disable-shared to libtool which inhibts wrapper generation on win32 and fixes the test suite. * Also use up to date autotools. This didn't fix anything, though. * test suite: try writing an Unicode filename as well * There have been problems doing that on Win32 where UTF-8 was not correctly passed down from the command line and some Windows API calls were only working with ANSI filenames etc.
* The libtool wrapper binaries do not pass down UTF-8 strings correctly, so the Unicode tests failed under some circumstances. * As we aren't actually linking against any locally-built shared libraries, we are passing --disable-shared to libtool which inhibts wrapper generation on win32 and fixes the test suite. * Also use up to date autotools. This didn't fix anything, though. * test suite: try writing an Unicode filename as well * There have been problems doing that on Win32 where UTF-8 was not correctly passed down from the command line and some Windows API calls were only working with ANSI filenames etc.
@dertuxmalwieder Have a try again at the Mac OS nightly builds. Since it's now preferring widechar builds of ncurses, I suspect that it could behave differently. Perhaps the visual glitches are now fixed: But better comment in #12 if you find anything new. |
The following rules apply: * All SciTECO macros __must__ be in valid UTF-8, regardless of the the register's configured encoding. This is checked against before execution, so we can use glib's non-validating UTF-8 API afterwards. * Things will inevitably get slower as we have to validate all macros first and convert to gunichar for each and every character passed into the parser. As an optimization, it may make sense to have our own inlineable version of g_utf8_get_char() (TODO). Also, Unicode glyphs in syntactically significant positions may be case-folded - just like ASCII chars were. This is is of course slower than case folding ASCII. The impact of this should be measured and perhaps we should restrict case folding to a-z via teco_ascii_toupper(). * The language itself does not use any non-ANSI characters, so you don't have to use UTF-8 characters. * Wherever the parser expects a single character, it will now accept an arbitrary Unicode/UTF-8 glyph as well. In other words, you can call macros like M§ instead of having to write M[§]. You can also get the codepoint of any Unicode character with ^^x. Pressing an Unicode character in the start state or in Ex and Fx will now give a sane error message. * When pressing a key which produces a multi-byte UTF-8 sequence, the character gets translated back and forth multiple times: 1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk). On Curses we could directly get a wide char using wget_wch(), but it's not currently used, so we don't depend on widechar curses. 2. Parsed into gunichar for passing into the edit command callbacks. This also validates the codepoint - everything later on can assume valid codepoints and valid UTF-8 strings. 3. Once the edit command handling decides to insert the key into the command line, it is serialized back into an UTF-8 string as the command line macro has to be in UTF-8 (like all other macros). 4. The parser reads back gunichars without validation for passing into the parser callbacks. * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely inserted and displayed UTF-8 sequences, are now fixed.
That was a known limitation of the single-byte-based parser. Since it got fed single characters of an UTF-8 sequence, it could sometimes ask Gtk to draw these invalid sequences. The Curses UI had an analogous flickering instead. With 6857807, this is also fixed. The entire language is Unicoded now. Feel free to Here are the updated binaries: https://github.com/rhaberkorn/sciteco/releases/download/nightly/sciteco-gtk3_nightly_win32.zip |
Sadly, no. |
* This is used for error messages (TECO macro stackframes), so it's important to display columns in characters. * Program counters are in bytes and therefore everywhere gsize. This is by glib convention.
…feature * ALL keypresses (the UTF-8 sequences resulting from key presses) can now be remapped. * This is especially useful with Unicode support, as you might want to alias international characters to their corresponding latin form in the start state, so you don't have to change keyboard layouts so often. This is done automatically in Gtk, where we have hardware key press information, but has to be done with key macros in Curses. There is a new key mask 4 (bit 3) for that purpose now. * Also, you might want to define non-ANSI letters to perform special functions in the start state where it won't be accepted by the parser anyway. Suppose you have a macro M→, you could define @^U[^K→]{m→} 1^_U[^K→] This effectively "extends" the parser and allow you to call macro "→" by a single key press. See also #5. * The register prefix has been changed from ^F (for function) to ^K (for key). This is the only thing you have to change in order to migrate existing function key macros. * Key macros are enabled by default. There is no longer any way to disable function key handling in curses, as I never found any reason or need to disable it. Theoretically, the default ESCDELAY could turn out to be too small and function keys don't get through. I doubt that's possible unless on extremely slow serial lines. Even then, you'd have to increase ESCDELAY and instead of disabling function keys simply define an escape surrogate. * The ED flag has been removed and its place is reserved for a future mouse support flag (which does make sense to disable in curses sometimes). fnkeys.tes is consequently also enabled by default in sample.teco_ini. * Key macros are handled as an unit. If one character results in an error, the entire string is rubbed out. This fixes the "CLOSE" key on Gtk. It also makes sure that the original error message is preserved and not overwritten by some subsequent syntax error. It was never useful that we kept inserting characters after the first error.
* Practically requires one of the "Nerd Font" fonts, so it's disabled by default. Add 0,512ED to the profile to enable them. * The new ED flag could be used to control Gtk icons as well, but they are left always-enabled for the time being. Is there any reason anybody would like to disable icons in Gtk? * The list of icons has been adapted and extended from exa: https://github.com/ogham/exa/blob/master/src/output/icons.rs * The icons are hardcoded as presorted lists, so we can binary search them. This could change in the future. If there is any demand, they could be made configurable via Q-Registers as well.
Now that we have Unicode support, I couldn't resist to implement icons in the Curses version. It requires Nerd Fonts and you need to add |
Support for other single-byte code pages is still left open, but this is only indirectly linked to Unicode. |
One would think that SciTECO could insert German umlauts just fine, but
iÄÖÜ$$
leads to visual garbage.The manpage states:
That does not imply that direct input of non-ASCII characters into an empty document was forbidden, so I thought it could work...?
The text was updated successfully, but these errors were encountered: