Emphasis in link URL and reference #247

faelys · 2023-09-30T09:14:24Z

faelys
Sep 30, 2023

Hello, I'm here to question the parsing of *[link](url*), and not only because my WIP parser has some issue parsing it.

First let's consider a regular emphasis instead of a strong one, because these are much more common in URLs. Then let's consider a slightly more contrived example:

_``foo_bar``{foo_bar=baz} [link](foo_bar) [link][foo_bar] end._

Can you spot where emphases start and end?

In more abstract (but slightly biased) terms, the rational mentions three types of containers: block-level, inline-level, and raw text. In my (young) mental model, a _ in raw text is just a _ and has nothing to with emphasis, so I don't expect ``foo_bar`` to open or close an emphasis. I can easily classify attributes in the same raw text type, because of the "low-level" and "ast-leaf" feel of attributes.

And since references and direct links also don't contain anything other than a string and are not "real" text, I would be tempted to classify them as raw text as well.

It turns out that currently, they are not raw text, but they are (obviously) not inline-level either. They are in a weird fourth type, where emphasis can be closed but not opened (see #88). This fourth type is a significant burden of my mental model, and I think it would be a good thing to see it gone.

I found and understand the issue of "infinite look-ahead" issue of ](...), and yet AFAICT we already have the same issue with attributes, looking all the way to } or the end of the current block before deciding whether foo_bar=baz contains an emphasis delimiter.

So at this point, as a user wanting a lightweight cognitive overload from her lightweight markup language, my backwards-incompatible proposition is to treat ](, ][, and attribute-opening { the same way we treat inline code spans: they start a URL/reference/attribute span, without any emphasis or any other inline-element delimiter, all the way to their corresponding closing marker or to the end of the block. Maybe ]( should be implicitly closed by the next ASCII space or tab instead of the end of the block.

I think there could be a case to expand this implicitly-closing scheme to all inline elements, so that there is no spooky interaction at a distance which makes _foo open an emphasis or not depending on whether a match can be found before the end of the block (but if looking all the way to the end of the block is too much cognitive overload, your block is too long, so it's more a parser-writer matter than a user matter). I guess it would be a mater of the trade-off between consistency and false-positive rate. And we can't avoid spooky interaction in foo_bar anyway.

matklad · 2023-09-30T11:35:31Z

matklad
Sep 30, 2023

I personally like this very much! As a user, I was also surprised that djot parses markup in links, but then does nothing with it (see the third bullet from https://github.com/jgm/djot/issues/232). I wasn’t able to pin down exactly what’s wrong, but I think you nailed it perfectly — indeed, the problem seems to be that the link text should be treated as raw, verbatim text, rather than as something which can have markup inside. This makes sense because, like raw inlines, links are leaf lelements and can’t further be nested.

…

On Saturday, 30 September 2023, Natasha Kerensikova < ***@***.***> wrote: Hello, I'm here to question the parsing of *[link](url*), and not only because I'm WIP parser has some issue parsing it. First let's consider a regular emphasis instead of a strong one, because these are much more common in URLs. Then let's consider a slightly more contrived example: _``foo_bar``{foo_bar=baz} [link](foo_bar) [link][foo_bar] end._ Can you spot where emphases start and end? In more abstract (but slightly biased) terms, the rational mentions three types of containers: block-level, inline-level, and raw text. In my (young) mental model, a _ in raw text is just a _ and has nothing to in emphasis, so I don't expect ``foo_bar`` to open or close an emphasis. I can easily classify attributes in the same raw text type, because of the "low-level" and "ast-leaf" feel of attributes. And since references and direct links also don't contain anything other than a string and are not "real" text, I would be tempted to classify them as raw text as well. It turns out that currently, they are not raw text, but they are (obviously) not inline-level either. They are in a weird fourth type, where emphasis can be closed but not opened (see #88 <#88>). This fourth type is a significant burden of my mental model, and I think it would be a good think to see it gone. I found and understand the issue of "infinite look-ahead" issue of ](...), and yet AFAICT we already have the same issue with attributes, looking all the way to } or the end of the current block before deciding whether foo_bar=baz contains an emphasis delimiter. So at this point, as a user wanting a lightweight cognitive overload from her lightweight markup language, my backwards-incompatible proposition is to treat ](, ][, and attribute-opening { the same way we treat inline code spans: they start a URL/reference/attribute span, without any emphasis or any other inline-element delimiter, all the way to their corresponding closing marker or to the end of the block. Maybe ]( should be implicitly closed by the next ASCII space or tab instead of the end of the block. I think there could be a case to expand this implicitly-closing scheme to all inline elements, so that there is no spooky interaction at a distance which makes _foo open an emphasis or not depending on whether a match can be found before the end of the block (but if looking all the way to the end of the block is too much cognitive overload, your block is too long, so it's more a parser-writer matter than a user matter). I guess it would be a mater of the trade-off between consistency and false-positive rate. And we can avoid spooky interaction in foo_bar anyway. — Reply to this email directly, view it on GitHub <#247>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANB3M7EDJIPJQTKOSBOWE3X47PHXANCNFSM6AAAAAA5NNTWBA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

faelys · 2023-09-30T13:05:17Z

faelys
Sep 30, 2023
Author

#232 shows a blindspot in my proposal, in that some kind of escape of closing parenthesis would be needed:

using the backlash is the obvious choice, but it makes this raw text subtly different from other raw texts, which I think is extra-bad;
the percent-escape is the second obvious choice, but it's a bit of a burden on the user (I frequently cut-and-paste wikipedia links in my markdown);
going for a deeper syntax change, the parentheses could be replaced by something invalid in URL, like an ASCII blank, but I'm afraid it wouldn't be enough visual separation from the surronding text;
so far I can't come up with anything better than [text](url) for URLs without any ), and [text](<url>) for others URLs (at least Firefox automatically cut-and-pastes > as %3E); maybe [text]<url> could be allowed too.

There is no such issue with references, because references already appear only within brackets, so the syntax naturally prevents any embedded ], we just have to ensure automatically-generated references don't embed any.

There is a somewhat related issue with attributes: if we don't backtrack, what to do when encountering a parse error? I would be tempted to extend attribute syntax to make any text parsable, but including all punctuation feels messy. Maybe using for key and bareval the same set as name, allowing valueless keys (maybe defaulting to an empty string value?), and seeing an implicit end-of-attribute-span mark before other ASCII punctuation.

0 replies

jgm · 2023-10-09T16:23:03Z

jgm
Oct 9, 2023
Maintainer

my backwards-incompatible proposition is to treat ](, ][, and attribute-opening { the same way we treat inline code spans: they start a URL/reference/attribute span, without any emphasis or any other inline-element delimiter, all the way to their corresponding closing marker or to the end of the block

There's an important disanalogy: anything can count as verbatim text (inline code), but there are restrictions on what can be a link destination, reference, or attribute. Suppose someone writes

[hello]{.this is *not* a valid attribute!

How, exactly, do we treat this an an attribute?

0 replies

faelys · 2023-10-09T17:40:16Z

faelys
Oct 9, 2023
Author

How, exactly, do we treat this an an attribute?

That is indeed what is left to debate to turn this proposition into a specification.

I wouldn't mind treating it as Unspecified Behavior, since the construct is not valid, and let whatever result be implementation-defined. However I agree there is value (for users) in consistency across implementations, I just think it places an upper bound to the amount of efforts to specify what happens in anomalous cases.

What i had in mind was to parse URL/reference/attribute all way to their corresponding closing marker, or to the end of the block, or to where it can't be parsed anymore.

In that case it depends on how much attribute syntax is extended. With current attribute restrictions, this would backtrack (only) from the space after is to the space before, making the following result:

<span class="this">hello</span> is <strong>not</strong> a valid attribute!

If we extended attribute syntax to allow null values, that would make the following (which I think is still valid) HTML:

<span class="this" is>hello</span> <strong>not</strong> a valid attribute!

If we extended attribute syntax to allow the same character set for keys and barevalues as we already do for names, it would be a span with 5 null-valued attributes, but I don't how a HTML render could make something useful with them.

Similarly, for URLs an implicit closing would be added before the first whitespace, while for references AFAICT there is no way to include something invalid.

I agree that none of these ideas are really useful outcomes for the user, at least directly. However, having an invisible part of the construct eat everything until the end of the block makes debugging much easier by pointing exactly where the syntax error is. As a user, this is something I prefer by far compared to trying to second-guess what I could have meant and getting it almost-but-not-quite right (to the point that I have removed all brackets from my website so I can use a text-search to find markdown links I mistyped).

0 replies

jgm · 2023-10-09T17:50:18Z

jgm
Oct 9, 2023
Maintainer

What i had in mind was to parse URL/reference/attribute all way to their corresponding closing marker, or to the end of the block, or to where it can't be parsed anymore.

This seems reasonable.

What would be really nice (in terms of avoiding implementation complexity) would be to eliminate the need for any backtracking in the attribute parser. For example we could treat an unclosed quoted value as implicitly closed.

Another alternative would be to throw an error in any of these conditions. That's not something that markdown parsers have ever traditionally done -- every document is valid markdown, it's just a question of what it means. And it can be nice in many contexts not to have to worry about the possibility of an error. But arguably this would be more helpful to the user.

0 replies

faelys · 2023-10-09T18:08:59Z

faelys
Oct 9, 2023
Author

What would be really nice (in terms of avoiding implementation complexity) would be to eliminate the need for any backtracking in the attribute parser. For example we could treat an unclosed quoted value as implicitly closed.

Assuming we allow null-valued attributes (promoting them to empty-string-valued attribute if needed), we can end attribute parsing on the first unrecognized character, if that happens in a value that ends the key-value pair, and if that happens in a key that ends a null-valued key, and the unrecognized character can be handed over to the inline parser. Depending on the parser architecture, that might count as a one-character backtrack or a natural data flow.

That is the simplest I can imagine, which would lead to the slightly-worse rendering for your example:

<span class="this" is>hello</span><strong>not</strong> a valid attribute!

Being a -Werror kind of person, I would welcome wholeheartedly a parse-error mechanism, but I'm used to being in the minority there, that's why I haven't mentioned it earlier.

0 replies

david-christiansen · 2023-10-13T04:22:59Z

david-christiansen
Oct 13, 2023

We're considering Djot syntax for a documentation tool, and at least having a version of the spec in which these cases are simply parse errors would really increase its appeal. I hear feedback from users today, who are mostly writing in Markdown and ReST using other tools, that they have to spend a bunch of time double-checking output because typos in their markup silently lead to unintended HTML rather than to red squiggle underlines.

0 replies

faelys · 2023-10-30T06:47:58Z

faelys
Oct 30, 2023
Author

Thinking about it a more, I would like to add two more points.

First, inline code spans are not the only precedent where Djot has opening markers which open unconditionally, and a lack of closing marker automatically closes the element when its parent is closed. This is also the case at block level with fenced div and fenced code blocks:

> :::
> this is a one-line fenced div in a blockquote
- this is a top-level list
> ```
> this is a one-line code block
- this ends both the code block and its enclosing blockquote

So I'm tempted to argue that the logic can be extended not only to attribute spans and inline URLs, but also to all span elements. Let *x always start a strong element (up to the end of the enclosing element if there is no explicit closing match) and x* always close a strong element (and making the * disappear altogether when there is no current strong element, just like orphan attribute spans). In x*x it could still to be a closing mark if it has a match and turn into an opening mark if not.

I have no idea what impact it would have on parsers (mine or any other), I came up with this idea considering only the parser in my brain when I look at some Djot source (to be fair, that brain is usually relies syntax coloration, so that would mean Djot source in a terminal or a textarea).

The second point is an answer to the comment raised by @david-christiansen I think the official parsers as well as mine already have some kind of warning mechanism (at least for unmatched link references), that could be leveraged by having the parser in the documentation tool have a kind of -Werror (this is a gcc option which turns warnings into compilation-failing errors) option, or having that behavior always enforced.

What I mean is that the specific parser could help with that situation, while still allowing others parsers to be more lax and something it guessed the used might have meant for any malformed input.

So at Djot specification level, the question becomes whether to standardize only correct input, and let parsers do whatever they want in other cases, or go further and specify a set of errors and/or warnings and fallbacks for some or all incorrect input.

I don't have a strong opinion either way, I'd rather let that to the vision of those here before me. A tighter specification is more work for better interoperability, at the risk of mandating a certain parser architecture (I saw that in CommonMark), while a looser specification allows more parser diversity at the risk of surprising users when going from one parser to another (which is one of the main issues with Markdown).

2 replies

jgm Oct 30, 2023
Maintainer

You're right to point out what one might think of as an inconsistency: verbatim openers are closed implicitly, while openers that contain formatted text are not.

There is a practical reason for this: in the case of openers containing formatted text, we're already parsing the formatted text, so there's an easy fallback if we don't encounter a closer that doesn't require backtracking and reparsing. With verbatim openers, by contrast, we haven't been parsing the content as formatted text, so if we don't encounter a closer we'd need to backtrack and reparse. This kind of thing can lead to quadratic (or worse?) performance bugs, and that's why I just made the practical decision to close implicitly. There isn't the same motivation for doing that with openers that contain formatted text.

faelys Oct 30, 2023
Author

I think I understand the practical reason, that's why my original proposition generalized this only to attributes and URLs, which also have a special parsing that has to be backtracked and reparsed on error.

My point in that comment was to also consider the parser embedded inside the reader's brain, especially unassisted by syntax coloration, which probably benefits more from lifting the ambiguity between matched markers and mismatched markers, than a computer parser. The consistency argument is similar, in that having the same rules for everything reduces the cognitive load of running the parser-in-the-brain.

However my point only mentioned a benefit from such a change (because the parser-in-the-brain point of view was new to me and I thought maybe you didn't consider it either), without any considerations to the costs or the compromise of whether to go with it or not.

It's just that if you considered widening the implicit-closing rule to attributes and URLs (which still seems to me like a pretty big “if” at this point), you might also consider the pros and cons of widening it even further while there.

itraveller1 · 2023-12-10T20:12:52Z

itraveller1
Dec 10, 2023

Currently, any non-closing marker can open an emphasis/strong, and any non-opening marker can close it. Perhaps it would be more logical if only opening marker could open inline block, and only closing one - close.

And this will look more natural, since in modern texts the underscore, for example, is standard practice for connecting words.

The only “inconvenient” option in this case will look like:

_aaa bbb [ccc_](ddd) eee_

But for such a rare situation, escape comes in handy.

6 replies

itraveller1 Dec 11, 2023

Yes you are right. All I'm suggesting is that you stop treating a simple underscore or asterisk (without a curly brace) inside a word as any kind of marker. This greatly simplifies the visual interpretation of the text, since words that include an underscore are very common (for example, hashtags or compound names), while those starting or ending with an underscore are very rare.

It’s very tempting to completely abandon single (without curly brace) underscores as markers, but I’m afraid many users will throw tomatoes at us :).

The peculiarity you found in the interpretation of the escaped underscore in the link text is interesting.

As for the possibility of using curly braces to better identify links and spans, I think this is a great idea. Moreover, it follows the same logic with other inline block markers.

And again, here it is appropriate to require that single characters embedded inside a word not be used as markers. This will immediately remove various kinds of collisions with complex indexes in a text, for example:

aaa{[bbb](ccc)}

I really like the underlying idea of unifying all inline blocks in Djot, where they are not ranked by priority. For its sake, you can forgive a lot, but you should just avoid using common constructions of ordinary text as syntax elements, such as built-in underscores, asterisks and parentheses.

jgm Dec 12, 2023
Maintainer

All I'm suggesting is that you stop treating a simple underscore or asterisk (without a curly brace) inside a word as any kind of marker.

What is "inside a word," though? We have to consider things like

“*word*”
(*word*)
pause---*word* after pause
pause—*word* after pause
sub-*word*

If we wanted to implement this in a way that allows the above cases but forbids

ab*cd*ef

then we'd run up against desideratum 6 from the README:

Parsers should not be forced to recognize unicode character classes,
HTML tags, or entities, or perform unicode case folding.
That adds a lot of complexity.

you should just avoid using common constructions of ordinary text as syntax elements, such as built-in underscores, asterisks and parentheses.

Arguably these don't occur in ordinary text. Internal underscores are common in identifiers used in programming languages, but when you're talking about these you should always use backticks: my_underscore_word. Asterisks are very rare in ordinary text; they were once used mainly for footnote references, but we don't use them for that.

itraveller1 Dec 12, 2023

By the term “inside a word” I meant being surrounded by non-whitespace characters. Sorry for the bad terminology.

Using backsticks certainly eliminates such problems. You should probably just make it a rule that all such words should always be enclosed in backsticks.

bpj Dec 12, 2023

It practically already is a rule, albeit an unenforced stylistic rule rather than an enforced syntactic one, that identifiers should be shown as “typewriter text” in programming documentation. Part of my daytime job is to proofread/edit/translate documentation written by programmers¹ and I correct this all the time, so it falls under people should learn to use it. On the (very) rare occasions where it is appropriate not to mark identifiers as “code” one should simply use a backslash escape like\_this. The only situation I can think of is where the style guide calls for option names/arguments to be bold/italic and either the name uses underscores or the value is a (pseudo)identifier, but in that case it is better to mark both as code as well *`like`*_`=this`_. It’s a little more work to type but enhances readability in the rendered text considerably.²

It’s a bit complicated: I get plaintext/Markdown written in what can best be described as a mixture of Swedish/English or Danish/English. I add/correct markup and English, convert to docx with Pandoc and send it off to an Irish guy (not a tech guy!) who corrects any faulty English which I let slip through or added! ↩
FWIW I have written a Pandoc filter, using an lpeg/re parser which allows to use “djot-style” delimiters like {*...*} (including {{/}} for literal braces) inside backticks so that the filter “transforms” something like
```
`{*-f, --foo*} key={_val_}`
```
into (shown as Pandoc Markdown)
```
[**`-f, --foo`**` key=`*`val`*]{.code-style}
```
which I find enhances readability of the Markdown a lot!

Maybe I should port this to a djot filter but JavaScript and me is a bad fit! (I should work on that of course… :-) ↩

jgm Dec 12, 2023
Maintainer

By the term “inside a word” I meant being surrounded by non-whitespace characters.

But then the examples I gave, e.g.

“*word*”
(*word*)
pause---*word* after pause
pause—*word* after pause
sub-*word*

would all require the curly braces. I think that's suboptimal, because from the human point of view, the asterisks in “*word*” are not "inside a word."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emphasis in link URL and reference #247

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Emphasis in link URL and reference #247

faelys Sep 30, 2023

Replies: 9 comments · 8 replies

matklad Sep 30, 2023

faelys Sep 30, 2023 Author

jgm Oct 9, 2023 Maintainer

faelys Oct 9, 2023 Author

jgm Oct 9, 2023 Maintainer

faelys Oct 9, 2023 Author

david-christiansen Oct 13, 2023

faelys Oct 30, 2023 Author

jgm Oct 30, 2023 Maintainer

faelys Oct 30, 2023 Author

itraveller1 Dec 10, 2023

itraveller1 Dec 11, 2023

jgm Dec 12, 2023 Maintainer

itraveller1 Dec 12, 2023

bpj Dec 12, 2023

Footnotes

jgm Dec 12, 2023 Maintainer

faelys
Sep 30, 2023

Replies: 9 comments 8 replies

matklad
Sep 30, 2023

faelys
Sep 30, 2023
Author

jgm
Oct 9, 2023
Maintainer

faelys
Oct 9, 2023
Author

jgm
Oct 9, 2023
Maintainer

faelys
Oct 9, 2023
Author

david-christiansen
Oct 13, 2023

faelys
Oct 30, 2023
Author

jgm Oct 30, 2023
Maintainer

faelys Oct 30, 2023
Author

itraveller1
Dec 10, 2023

jgm Dec 12, 2023
Maintainer

jgm Dec 12, 2023
Maintainer