Skip to content

Commit

Permalink
the SciTECO parser is Unicode-based now (refs #5)
Browse files Browse the repository at this point in the history
The following rules apply:
 * All SciTECO macros __must__ be in valid UTF-8, regardless of the
   the register's configured encoding.
   This is checked against before execution, so we can use glib's non-validating
   UTF-8 API afterwards.
 * Things will inevitably get slower as we have to validate all macros first
   and convert to gunichar for each and every character passed into the parser.
   As an optimization, it may make sense to have our own inlineable version of
   g_utf8_get_char() (TODO).
   Also, Unicode glyphs in syntactically significant positions may be case-folded -
   just like ASCII chars were. This is is of course slower than case folding
   ASCII. The impact of this should be measured and perhaps we should restrict
   case folding to a-z via teco_ascii_toupper().
 * The language itself does not use any non-ANSI characters, so you don't have to
   use UTF-8 characters.
 * Wherever the parser expects a single character, it will now accept an arbitrary
   Unicode/UTF-8 glyph as well.
   In other words, you can call macros like M§ instead of having to write M[§].
   You can also get the codepoint of any Unicode character with ^^x.
   Pressing an Unicode character in the start state or in Ex and Fx will now
   give a sane error message.
 * When pressing a key which produces a multi-byte UTF-8 sequence, the character
   gets translated back and forth multiple times:
   1. It's converted to an UTF-8 string, either buffered or by IME methods (Gtk).
      On Curses we could directly get a wide char using wget_wch(), but it's
      not currently used, so we don't depend on widechar curses.
   2. Parsed into gunichar for passing into the edit command callbacks.
      This also validates the codepoint - everything later on can assume valid
      codepoints and valid UTF-8 strings.
   3. Once the edit command handling decides to insert the key into the command line,
      it is serialized back into an UTF-8 string as the command line macro has
      to be in UTF-8 (like all other macros).
   4. The parser reads back gunichars without validation for passing into
      the parser callbacks.
 * Flickering in the Curses UI and Pango warnings in Gtk, due to incompletely
   inserted and displayed UTF-8 sequences, are now fixed.
  • Loading branch information
rhaberkorn committed Sep 11, 2024
1 parent adc067b commit 6857807
Show file tree
Hide file tree
Showing 29 changed files with 325 additions and 202 deletions.
20 changes: 14 additions & 6 deletions doc/sciteco.7.template
Original file line number Diff line number Diff line change
Expand Up @@ -87,10 +87,6 @@ regular commands for command-line editing.
.
When the user presses a key or key-combination it is first translated
to an UTF-8 string.
All immediate editing commands and regular \*(ST commands however operate on
a language based solely on
.B ASCII
codes, which is a subset of Unicode.
The rules for translating keys are as follows:
.RS
.IP 1. 4
Expand Down Expand Up @@ -138,6 +134,18 @@ This feature is called function key macros and explained in the
next subsection.
.RE
.
.LP
All immediate editing commands and regular \*(ST commands however operate on
a language based solely on
.B ASCII
codes, which is a subset of Unicode.
\# This is because we cannot assume the presence of any particular non-ANSI
\# symbol on a user's keyboard.
Since the \*(ST parser is Unicode-aware, this does not exclude
using Unicode glyphs wherever a single character is expected,
ie. \fB^^\fIx\fR and \fBU\fIq\fR works with arbitrary Unicode glyphs.
All \*(ST macros must be in valid UTF-8.
.
.SS Function Key Macros
.
.SCITECO_TOPIC "function key"
Expand Down Expand Up @@ -1082,8 +1090,8 @@ Consequently when querying the code at a character position
or inserting characters by code, the code may be an Unicode
codepoint instead of byte-sized integer.
.LP
Currently, \*(ST supports UTF-8 and single-byte ANSI encodings,
that can also be used for editing raw binary files.
Currently, \*(ST supports buffers in UTF-8 and single-byte
ANSI encodings, that can also be used for editing raw binary files.
\# You can configure other single-byte code pages with EE,
\# but there isn't yet any way to insert characters.
UTF-8 is the default codepage for new buffers and Q-Registers
Expand Down
79 changes: 55 additions & 24 deletions src/cmdline.c
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ teco_cmdline_rubin(GError **error)
}

gboolean
teco_cmdline_keypress_c(gchar key, GError **error)
teco_cmdline_keypress_wc(gunichar key, GError **error)
{
teco_machine_t *machine = &teco_cmdline.machine.parent;
g_autoptr(GError) tmp_error = NULL;
Expand Down Expand Up @@ -283,6 +283,30 @@ teco_cmdline_keypress_c(gchar key, GError **error)
return TRUE;
}

/*
* FIXME: If one character causes an error, we should rub out the
* entire string.
* Usually it will be called only with single keys (strings containing
* single codepoints), but especially teco_cmdline_fnmacro() can emulate
* many key presses at once.
*/
gboolean
teco_cmdline_keypress(const gchar *str, gsize len, GError **error)
{
for (guint i = 0; i < len; i += g_utf8_next_char(str+i) - (str+i)) {
gunichar chr = g_utf8_get_char_validated(str+i, len-i);
if ((gint32)chr < 0) {
g_set_error_literal(error, TECO_ERROR, TECO_ERROR_CODEPOINT,
"Invalid UTF-8 sequence");
return FALSE;
}
if (!teco_cmdline_keypress_wc(chr, error))
return FALSE;
}

return TRUE;
}

gboolean
teco_cmdline_fnmacro(const gchar *name, GError **error)
{
Expand Down Expand Up @@ -361,7 +385,7 @@ teco_cmdline_cleanup(void)
*/

gboolean
teco_state_process_edit_cmd(teco_machine_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_process_edit_cmd(teco_machine_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
switch (key) {
case '\n': /* insert EOL sequence */
Expand Down Expand Up @@ -431,23 +455,30 @@ teco_state_process_edit_cmd(teco_machine_t *ctx, teco_machine_t *parent_ctx, gch
}

teco_interface_popup_clear();
return teco_cmdline_insert(&key, sizeof(key), error);

gchar buf[6];
gsize len = g_unichar_to_utf8(key, buf);
return teco_cmdline_insert(buf, len, error);
}

gboolean
teco_state_caseinsensitive_process_edit_cmd(teco_machine_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_caseinsensitive_process_edit_cmd(teco_machine_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
/*
* Auto case folding is for syntactic characters,
* so this could be done by working only with a-z and A-Z.
* However, it's also not speed critical.
*/
if (teco_ed & TECO_ED_AUTOCASEFOLD)
/* will not modify non-letter keys */
key = g_ascii_islower(key) ? g_ascii_toupper(key)
: g_ascii_tolower(key);
key = g_unichar_islower(key) ? g_unichar_toupper(key)
: g_unichar_tolower(key);

return teco_state_process_edit_cmd(ctx, parent_ctx, key, error);
}

gboolean
teco_state_stringbuilding_start_process_edit_cmd(teco_machine_stringbuilding_t *ctx, teco_machine_t *parent_ctx,
gchar key, GError **error)
gunichar key, GError **error)
{
teco_state_t *current = ctx->parent.current;

Expand Down Expand Up @@ -597,7 +628,7 @@ teco_state_stringbuilding_start_process_edit_cmd(teco_machine_stringbuilding_t *

gboolean
teco_state_stringbuilding_qreg_process_edit_cmd(teco_machine_stringbuilding_t *ctx, teco_machine_t *parent_ctx,
gchar chr, GError **error)
gunichar chr, GError **error)
{
g_assert(ctx->machine_qregspec != NULL);
/* We downcast since teco_machine_qregspec_t is private in qreg.c */
Expand All @@ -606,15 +637,15 @@ teco_state_stringbuilding_qreg_process_edit_cmd(teco_machine_stringbuilding_t *c
}

gboolean
teco_state_expectstring_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_expectstring_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
teco_machine_stringbuilding_t *stringbuilding_ctx = &ctx->expectstring.machine;
teco_state_t *stringbuilding_current = stringbuilding_ctx->parent.current;
return stringbuilding_current->process_edit_cmd_cb(&stringbuilding_ctx->parent, &ctx->parent, key, error);
}

gboolean
teco_state_insert_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_insert_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
teco_machine_stringbuilding_t *stringbuilding_ctx = &ctx->expectstring.machine;
teco_state_t *stringbuilding_current = stringbuilding_ctx->parent.current;
Expand Down Expand Up @@ -650,7 +681,7 @@ teco_state_insert_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *par
}

gboolean
teco_state_expectfile_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_expectfile_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
teco_machine_stringbuilding_t *stringbuilding_ctx = &ctx->expectstring.machine;
teco_state_t *stringbuilding_current = stringbuilding_ctx->parent.current;
Expand Down Expand Up @@ -720,8 +751,8 @@ teco_state_expectfile_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t
gboolean unambiguous = teco_file_auto_complete(ctx->expectstring.string.data, G_FILE_TEST_EXISTS, &new_chars);
teco_machine_stringbuilding_escape(stringbuilding_ctx, new_chars.data, new_chars.len, &new_chars_escaped);
if (unambiguous && ctx->expectstring.nesting == 1)
teco_string_append_c(&new_chars_escaped,
ctx->expectstring.machine.escape_char == '{' ? '}' : ctx->expectstring.machine.escape_char);
teco_string_append_wc(&new_chars_escaped,
ctx->expectstring.machine.escape_char == '{' ? '}' : ctx->expectstring.machine.escape_char);

return teco_cmdline_insert(new_chars_escaped.data, new_chars_escaped.len, error);
}
Expand All @@ -731,7 +762,7 @@ teco_state_expectfile_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t
}

gboolean
teco_state_expectdir_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_expectdir_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
teco_machine_stringbuilding_t *stringbuilding_ctx = &ctx->expectstring.machine;
teco_state_t *stringbuilding_current = stringbuilding_ctx->parent.current;
Expand Down Expand Up @@ -773,7 +804,7 @@ teco_state_expectdir_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *
}

gboolean
teco_state_expectqreg_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_expectqreg_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
g_assert(ctx->expectqreg != NULL);
/*
Expand All @@ -785,7 +816,7 @@ teco_state_expectqreg_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t
}

gboolean
teco_state_qregspec_process_edit_cmd(teco_machine_qregspec_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_qregspec_process_edit_cmd(teco_machine_qregspec_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
switch (key) {
case '\t': { /* autocomplete Q-Register name */
Expand Down Expand Up @@ -820,7 +851,7 @@ teco_state_qregspec_process_edit_cmd(teco_machine_qregspec_t *ctx, teco_machine_
}

gboolean
teco_state_qregspec_string_process_edit_cmd(teco_machine_qregspec_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_qregspec_string_process_edit_cmd(teco_machine_qregspec_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
teco_machine_stringbuilding_t *stringbuilding_ctx = teco_machine_qregspec_get_stringbuilding(ctx);
teco_state_t *stringbuilding_current = stringbuilding_ctx->parent.current;
Expand Down Expand Up @@ -860,7 +891,7 @@ teco_state_qregspec_string_process_edit_cmd(teco_machine_qregspec_t *ctx, teco_m
}

gboolean
teco_state_execute_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_execute_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
teco_machine_stringbuilding_t *stringbuilding_ctx = &ctx->expectstring.machine;
teco_state_t *stringbuilding_current = stringbuilding_ctx->parent.current;
Expand Down Expand Up @@ -905,7 +936,7 @@ teco_state_execute_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *pa
}

gboolean
teco_state_scintilla_symbols_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_scintilla_symbols_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
teco_machine_stringbuilding_t *stringbuilding_ctx = &ctx->expectstring.machine;
teco_state_t *stringbuilding_current = stringbuilding_ctx->parent.current;
Expand Down Expand Up @@ -950,7 +981,7 @@ teco_state_scintilla_symbols_process_edit_cmd(teco_machine_main_t *ctx, teco_mac
}

gboolean
teco_state_goto_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_goto_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
teco_machine_stringbuilding_t *stringbuilding_ctx = &ctx->expectstring.machine;
teco_state_t *stringbuilding_current = stringbuilding_ctx->parent.current;
Expand Down Expand Up @@ -997,7 +1028,7 @@ teco_state_goto_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *paren
}

gboolean
teco_state_help_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gchar key, GError **error)
teco_state_help_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gunichar key, GError **error)
{
teco_machine_stringbuilding_t *stringbuilding_ctx = &ctx->expectstring.machine;
teco_state_t *stringbuilding_current = stringbuilding_ctx->parent.current;
Expand Down Expand Up @@ -1028,8 +1059,8 @@ teco_state_help_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *paren
gboolean unambiguous = teco_help_auto_complete(ctx->expectstring.string.data, &new_chars);
teco_machine_stringbuilding_escape(stringbuilding_ctx, new_chars.data, new_chars.len, &new_chars_escaped);
if (unambiguous && ctx->expectstring.nesting == 1)
teco_string_append_c(&new_chars_escaped,
ctx->expectstring.machine.escape_char == '{' ? '}' : ctx->expectstring.machine.escape_char);
teco_string_append_wc(&new_chars_escaped,
ctx->expectstring.machine.escape_char == '{' ? '}' : ctx->expectstring.machine.escape_char);

return new_chars_escaped.len ? teco_cmdline_insert(new_chars_escaped.data, new_chars_escaped.len, error) : TRUE;
}
Expand Down
12 changes: 2 additions & 10 deletions src/cmdline.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,16 +64,8 @@ gboolean teco_cmdline_insert(const gchar *data, gsize len, GError **error);

gboolean teco_cmdline_rubin(GError **error);

gboolean teco_cmdline_keypress_c(gchar key, GError **error);

static inline gboolean
teco_cmdline_keypress(const gchar *str, gsize len, GError **error)
{
for (guint i = 0; i < len; i++)
if (!teco_cmdline_keypress_c(str[i], error))
return FALSE;
return TRUE;
}
gboolean teco_cmdline_keypress_wc(gunichar key, GError **error);
gboolean teco_cmdline_keypress(const gchar *str, gsize len, GError **error);

gboolean teco_cmdline_fnmacro(const gchar *name, GError **error);

Expand Down
25 changes: 12 additions & 13 deletions src/core-commands.c
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
#include "goto-commands.h"
#include "core-commands.h"

static teco_state_t *teco_state_control_input(teco_machine_main_t *ctx, gchar chr, GError **error);
static teco_state_t *teco_state_control_input(teco_machine_main_t *ctx, gunichar chr, GError **error);

/*
* NOTE: This needs some extra code in teco_state_start_input().
Expand Down Expand Up @@ -1049,7 +1049,7 @@ teco_state_start_get(teco_machine_main_t *ctx, GError **error)
}

static teco_state_t *
teco_state_start_input(teco_machine_main_t *ctx, gchar chr, GError **error)
teco_state_start_input(teco_machine_main_t *ctx, gunichar chr, GError **error)
{
static teco_machine_main_transition_t transitions[] = {
/*
Expand Down Expand Up @@ -1388,7 +1388,7 @@ teco_state_fcommand_cond_else(teco_machine_main_t *ctx, GError **error)
}

static teco_state_t *
teco_state_fcommand_input(teco_machine_main_t *ctx, gchar chr, GError **error)
teco_state_fcommand_input(teco_machine_main_t *ctx, gunichar chr, GError **error)
{
static teco_machine_main_transition_t transitions[] = {
/*
Expand Down Expand Up @@ -1512,7 +1512,7 @@ teco_state_changedir_done(teco_machine_main_t *ctx, const teco_string_t *str, GE
TECO_DEFINE_STATE_EXPECTDIR(teco_state_changedir);

static teco_state_t *
teco_state_condcommand_input(teco_machine_main_t *ctx, gchar chr, GError **error)
teco_state_condcommand_input(teco_machine_main_t *ctx, gunichar chr, GError **error)
{
teco_int_t value = 0;
gboolean result = TRUE;
Expand Down Expand Up @@ -1800,7 +1800,7 @@ teco_state_control_glyphs2bytes(teco_machine_main_t *ctx, GError **error)
}

static teco_state_t *
teco_state_control_input(teco_machine_main_t *ctx, gchar chr, GError **error)
teco_state_control_input(teco_machine_main_t *ctx, gunichar chr, GError **error)
{
static teco_machine_main_transition_t transitions[] = {
/*
Expand Down Expand Up @@ -1841,10 +1841,10 @@ teco_state_control_input(teco_machine_main_t *ctx, gchar chr, GError **error)
TECO_DEFINE_STATE_CASEINSENSITIVE(teco_state_control);

static teco_state_t *
teco_state_ascii_input(teco_machine_main_t *ctx, gchar chr, GError **error)
teco_state_ascii_input(teco_machine_main_t *ctx, gunichar chr, GError **error)
{
if (ctx->mode == TECO_MODE_NORMAL)
teco_expressions_push((guchar)chr);
teco_expressions_push(chr);

return &teco_state_start;
}
Expand Down Expand Up @@ -1877,7 +1877,7 @@ TECO_DEFINE_STATE(teco_state_ascii);
* only be seen when executing the following command.
*/
static teco_state_t *
teco_state_escape_input(teco_machine_main_t *ctx, gchar chr, GError **error)
teco_state_escape_input(teco_machine_main_t *ctx, gunichar chr, GError **error)
{
/*$ ^[^[ ^[$ $$ terminate return
* [a1,a2,...]$$ -- Terminate command line or return from macro
Expand Down Expand Up @@ -2700,7 +2700,7 @@ teco_state_ecommand_exit(teco_machine_main_t *ctx, GError **error)
}

static teco_state_t *
teco_state_ecommand_input(teco_machine_main_t *ctx, gchar chr, GError **error)
teco_state_ecommand_input(teco_machine_main_t *ctx, gunichar chr, GError **error)
{
static teco_machine_main_transition_t transitions[] = {
/*
Expand Down Expand Up @@ -2874,10 +2874,9 @@ teco_state_insert_indent_initial(teco_machine_main_t *ctx, GError **error)
len -= teco_interface_ssm(SCI_GETCOLUMN,
teco_interface_ssm(SCI_GETCURRENTPOS, 0, 0), 0) % len;

gchar spaces[len];

memset(spaces, ' ', sizeof(spaces));
teco_interface_ssm(SCI_ADDTEXT, sizeof(spaces), (sptr_t)spaces);
gchar space = ' ';
while (len-- > 0)
teco_interface_ssm(SCI_ADDTEXT, 1, (sptr_t)&space);
}
teco_interface_ssm(SCI_ENDUNDOACTION, 0, 0);
teco_ring_dirtify();
Expand Down
2 changes: 1 addition & 1 deletion src/core-commands.h
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ gboolean teco_state_insert_process(teco_machine_main_t *ctx, const teco_string_t
gsize new_chars, GError **error);

/* in cmdline.c */
gboolean teco_state_insert_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gchar chr, GError **error);
gboolean teco_state_insert_process_edit_cmd(teco_machine_main_t *ctx, teco_machine_t *parent_ctx, gunichar chr, GError **error);

/**
* @class TECO_DEFINE_STATE_INSERT
Expand Down
4 changes: 2 additions & 2 deletions src/error.h
Original file line number Diff line number Diff line change
Expand Up @@ -61,10 +61,10 @@ typedef enum {
} teco_error_t;

static inline void
teco_error_syntax_set(GError **error, gchar chr)
teco_error_syntax_set(GError **error, gunichar chr)
{
g_set_error(error, TECO_ERROR, TECO_ERROR_SYNTAX,
"Syntax error \"%c\" (%d)", chr, chr);
"Syntax error \"%C\" (U+%04" G_GINT32_MODIFIER "X)", chr, chr);
}

static inline void
Expand Down
Loading

0 comments on commit 6857807

Please sign in to comment.