Skip to content

Commit

Permalink
docs: tweaks
Browse files Browse the repository at this point in the history
  • Loading branch information
favonia committed Sep 23, 2023
1 parent 6d389b7 commit 77f59c8
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions docs/design.mld
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,13 @@ In addition to the main message, the API should allow implementers to easily spe

There is a long history of using ASCII printable characters and ANSI escape sequences, and recently also non-ASCII Unicode characters, to draw pictures on terminals. To display compiler diagnostics, this technique has been used to assemble line numbers, code from end users, code highlighting, and other pieces of information in a visually pleasing way. Non-ASCII Unicode characters (from implementers or from end users) greatly expand the vocabulary of ASCII art, and we will call the new art form {i Unicode art} to signify the use of non-ASCII characters. However, these Unicode characters also impose new challenges as their visual widths are unpredictable without knowing the exact terminal (emulator), the exact font, etc. Unicode emoji sequences might be one of the most challenging cases: a pirate flag (🏴‍☠️) may be shown as a single flag on supported platforms but as a sequence with a black flag (🏴) and a skull (☠️) on other platforms. This means the visual width of the pirate flag is unpredictable. (See {{: https://unicode.org/reports/tr51/#Display}UTS #51 Section 2.2}.) The rainbow flag (🏳️‍🌈), skin tones, and many other emoji sequences have the same issue. Other less chaotic but still challenging cases include {{: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:ea=A:]}characters whose East Asian width is Ambiguous.}

It is thus wise for implementers to think twice before using emoji sequences and other tricky characters in Unicode art. To quantify the degree to which a Unicode art can remain visually pleasing on different platforms, we specify the following four levels of stability. Note that if implementers decide to integrate content from end users into their Unicode art, the end users should have the freedom to include arbitrary emoji sequences and tricky characters in their content, and the final Unicode art must remain visually pleasing as defined by the stability level.
It is thus wise for implementers to think twice before using emoji sequences and other tricky characters in Unicode art. To quantify the degree to which a Unicode art can remain visually pleasing on different platforms, we specify the following four levels of stability. Note that if implementers decide to integrate content from end users into their Unicode art, the end users should have the freedom to include arbitrary emoji sequences and tricky characters in their content, and the final Unicode art must remain visually pleasing as defined by the stability levels.

- {b Level 0 (the least stable):} Stability under the assumption that every character occupies exactly the same width. Thanks to the popularity of Unicode, programs of this level are mostly considered outdated.
- {b Level 0 (the least stable):} Stability under the assumption that every character occupies exactly the same visual width. Thanks to the popularity of Unicode, programs of this level are mostly considered outdated.

- {b Level 1:} Stability under the assumption each Unicode string visually occupies a multiple of some fixed width, where the multiplier is determined by heuristics (such as various implementations of [wcwidth] and [wcswidth]). These heuristics are created to help programmers handle more characters, in particular CJK characters, without dramatically changing the code. They however do not solve the core problem (that is, visual width is fundamentally ill-defined) and they often could not handle tricky cases such as emoji sequences at all. Many compilers are at this level.
- {b Level 1:} Stability under the assumption each Unicode string visually occupies a multiple of some fixed width, where the multiplier is determined by heuristics (such as various implementations of [wcwidth] and [wcswidth]). These heuristics are created to help programmers handle more characters, in particular CJK characters, without dramatically changing the code. They however do not solve the core problem (that is, visual width is fundamentally ill-defined) and they often could not handle tricky cases such as emoji sequences. Many compilers are at this level.

- {b Level 2a:} Stability under very limited assumptions on which characters should have the same widths. For example, if a Unicode art only assumes Unicode box-drawing characters are of the same width (which is the case in all conceivable situations), then its stability is at this level. However, the phrase "very limited" is somewhat subjective, and thus we present a more precise version below.
- {b Level 2a:} Stability under very limited assumptions on which characters should have the same widths. For example, if a Unicode art only assumes Unicode box-drawing characters are of the same visual width (which is the case in all conceivable situations), then its stability is at this level. However, the phrase "very limited" is somewhat subjective, and thus we present a more precise version below.

- {b Level 2b:} Stability under only theses assumptions:
{ul
Expand All @@ -29,18 +29,18 @@ It is thus wise for implementers to think twice before using emoji sequences and
}
This is making explicit what Level 2a means; however, we might update the details of Level 2b later to better match our understanding of Level 2a. Collectively, Levels 2a and 2b are called "Level 2".

- {b Level 3 (the most stable):} Stability under only one assumption that the same graphic cluster will have the same width regardless of the context. This means that the Unicode art will remain visually pleasing in almost all situations. It can even be rendered with a variable-width font.
- {b Level 3 (the most stable):} Stability under only one assumption that the same grapheme clusters will have the same width regardless of the context. This means that the Unicode art will remain visually pleasing in almost all situations. It can even be rendered with a variable-width font.

Unlike most implementations, which are at Level 1, our {{!module:Asai.Tty}terminal backend} strives to achieve Level 2. That means we must not make any assumption about the visual width of the end user's code and must abandon the idea of {i column numbers.} As a result, our terminal backend {i never} shows column numbers and we consider that as a significant improvement. We believe Level 3 is too restricted for terminals because we cannot show line numbers along with the end user's code. (We cannot assume numbers "10" and "99" will have the same visual width at Level 3.)
Unlike most implementations, which are at Level 1, our {{!module:Asai.Tty}terminal backend} strives to achieve Level 2. That means we must not make any assumption about the visual width of end users' code and must abandon the idea of {i column numbers.} As a result, our terminal backend {i never} shows column numbers and we consider that as a significant improvement. We believe Level 3 is too restricted for compiler diagnostics because we cannot show line numbers along with the end users' code. (We cannot assume the numbers "10" and "99" will have the same visual width at Level 3.)

Note: a fixed-with Unicode font is often technically duospaced, not monospaced, because many CJK characters would occupy a double character width. Thus, we do not use the terminology "monospaced".
Note: a fixed-width Unicode font is often technically duospaced, not monospaced, because many CJK characters would occupy a double character width. Thus, we do not use the terminology "monospaced".

{1 Raw Bytes as Positions}

All positions are {b byte-oriented.} Here are some popular alternatives which we think are worse:

+ {b Unicode characters} (which may not match user-perceived characters).
+ {b Unicode grapheme clusters,} or user-perceived characters. See the {{: https://erratique.ch/software/uuseg}uuseg} library.
+ {b Unicode grapheme clusters} or user-perceived characters. See the {{: https://erratique.ch/software/uuseg}uuseg} library.
+ {b Column numbers,} the visual width of a string in display.

It takes at least linear time to count Unicode characters (except when UTF-32 is in use) or Unicode grapheme clusters from raw bytes. Column numbers are even worse because they are not well-defined, as elaborated in the previous section. The only well-defined unit that also admits an efficient implementation is {i raw byte}.
Expand Down

0 comments on commit 77f59c8

Please sign in to comment.