Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source position information supported? #36

Closed
digitalmoksha opened this issue Dec 16, 2022 · 18 comments
Closed

Source position information supported? #36

digitalmoksha opened this issue Dec 16, 2022 · 18 comments

Comments

@digitalmoksha
Copy link

Are source maps supported, and if so to what degree? Blocks only, or full (including embedded HTML)?

@wooorm
Copy link
Owner

wooorm commented Dec 17, 2022

I haven’t seen anyone ever generating source maps for a compile-to-HTML language, and I don’t believe browsers support source maps on HTML.

Your question is very short. Perhaps you can spend some time framing it in more detail? https://github.com/wooorm/markdown-rs/blob/main/.github/support.md

@digitalmoksha
Copy link
Author

digitalmoksha commented Dec 17, 2022

Sorry, my fault, I really wasn't very clear. I really mean source position information, not source maps.

I'm looking for source position information, similar to what is provided in https://github.com/commonmark/commonmark.js, which outputs a data-sourcepos for block level elements.

I believe your remark package (or one of the combinations) provides this support - sorry it's been awhile since I've directly worked with remark.

Been using cmark-gfm for quite awhile, and rely on the data-sourcepos for a variety of things. I'm currently looking at Rust implementations to possibly replace it. Source position information, hopefully down to the inline element (and even non-markdown elements if possible) is a requirement.

I'm also hoping to find some level of extensibility, hopefully at the markdown grammar level as well as the AST level.

Edit: Another thing driving this is having a relatively common parser in Rust for backend and one in JS for the frontend.

@digitalmoksha digitalmoksha changed the title Source maps supported? Source position information supported? Dec 17, 2022
@wooorm
Copy link
Owner

wooorm commented Dec 18, 2022

Yes, an AST is supported here too, and the AST contains positional info. See the docs for more info! https://docs.rs/markdown/1.0.0-alpha.5/markdown/fn.to_mdast.html

hopefully at the markdown grammar level

No Rust parser provides this to my knowledge. I don’t think it can be achieved in Rust.

the AST level

See the other open issues/PRs, and the issue tracker on https://github.com/wooorm/mdxjs-rs, for more on this!

@wooorm wooorm closed this as completed Dec 18, 2022
@digitalmoksha
Copy link
Author

Yes, an AST is supported here too, and the AST contains positional info. See the docs for more info! https://docs.rs/markdown/1.0.0-alpha.5/markdown/fn.to_mdast.html

Ah ok, I see it in the AST now. But I still would need HTML output. It looks like to_html and to_html_with_options only take strings as inputs, not an AST. So what would be the next step? Am I missing something obvious.

No Rust parser provides this to my knowledge. I don’t think it can be achieved in Rust.

Have you taken a look at https://github.com/rlidwka/markdown-it.rs - it seem it been able to achieve it, or at least it seems like it. Or are we talking about two different thing?

@wooorm
Copy link
Owner

wooorm commented Dec 19, 2022

So what would be the next step

at least it seems like it

Maybe, I’d like to see more examples, e.g., math, frontmatter, directives.

The extensions they show as examples, are each better done on the AST I believe.

It will still likely be impossible to support here: this is based on enums to switch between states. Enums cannot be extended from outside of a project in Rust AFAIK.

@digitalmoksha
Copy link
Author

Maybe, I’d like to see more examples, e.g., math, frontmatter, directives.

One example of something that would be difficult to implement on the AST would be supporting multiline blockquotes. They should behave similarly to a code block in terms of syntax, but instead of wrapping as a code block, the content would be wrapped in a <blockquote>.

I don't see a way that this could be implemented properly on an existing AST. We currently implement it as pre-processing step on the raw markdown, but it's not at all ideal.

@wooorm
Copy link
Owner

wooorm commented Mar 20, 2023

True, that’s not really possible on an AST.

I am personally pretty strongly against adding more syntax extensions to markdown. I think it makes markdown less portable. I wrote a bunch about it here: https://github.com/micromark/micromark#extending-markdown.
I prefer directives or MDX, which solve the need for all future syntax extensions.
I haven‘t had the time to implement directives here yet tho.

@digitalmoksha
Copy link
Author

digitalmoksha commented Mar 20, 2023

In general I agree with you. I prefer not adding to the markdown syntax unless it can be portable. However multiline blockquotes in GitLab have been around awhile (and are actually very useful). We pre-process the raw markdown, which is prone to errors. Being able to have the parser handle it in the proper context is useful.

But I do think a couple extensions should be added to markdown, such as definition lists, math (which you already have implemented). emojis and an attribute syntax so that image sizes can be specified. I just wish CommonMark would move forward on some common extension syntax. 😢

I like some of the work that @jgm has done in https://github.com/jgm/commonmark-hs/tree/master/commonmark-extensions/src/Commonmark/Extensions. As the author of CommonMark, I feel a little more comfortable following those extensions. For example our image sizing follows along with his attribute syntax

I don't know a lot about MDX, but it seems to be geared toward javascript, and basically looks like XML/HTML. In that case, it seems like just using HTML would be better - I'm not sure how MDX would make HTML definitions lists any better/less complicated. And I admit I don't know anything about directives yet.

Edit: I think directives would certainly solve some markdown extension problems. There is still an issue regarding portability, since similar behaving directives could be named differently. But it would provide well-behaved hooks for custom rendering. But I'm not sure it would help with say definition lists.

@wooorm
Copy link
Owner

wooorm commented Mar 21, 2023

I prefer not adding to the markdown syntax unless it can be portable

That’s why I try to push people towards directives. They are one syntax that solves all other syntax extensions.
If a heavyweight like GH jumps on it, we‘re done.

And MDX is an alternative for that, useful for programmers.

attribute syntax

Too could be done with directives

I just wish CommonMark would move forward on some common extension syntax.

You probably guessed, I think the main thing is to add directives, then there might be some small improvements, but otherwise I think the stretch is out of markdown, and it isn’t going to see new syntax extensions ever.

Like, can you imagine HTML being extended with a new syntax? Suddenly having a + somewhere turns it into this new thing! That’s impossible, too much existing HTML would break.
What can change in HTML, is new elements being added, a <slideshow> element or so.
When we have directives, we have this power in markdown too.

As the author of CommonMark, I feel a little more comfortable following those extensions

Aside, but that’s also a problem. I don’t believe a single person shouldn’t impact a language used by zillions like that. I’d prefer CommonMark being more of a committee, with more formalized governance. And not having a (very smart) single person typing up some things that aren’t specced and having it become “de facto”.

I don't know a lot about MDX, but it seems to be geared toward javascript, and basically looks like XML/HTML.

Yes

In that case, it seems like just using HTML would be better - I'm not sure how MDX would make HTML definitions lists any better/less complicated.

It does for literate programming cases.
Because HTML in markdown is a black box.
Markdown sees something that looks like maybe XML? And it gives up: it passes that string to a browser.
No tool can normally access the information in there: it’s just a string.
MDX completely parses those tags.
Tools can access the information in there. It’s the same as directives. Arbitrary extension mechanism with tag/component names, attributes, and children.

And MDX adds JavaScript that can be evaluated.

And I admit I don't know anything about directives yet.

They solve all your needs! 😅

Edit: I think directives would certainly solve some markdown extension problems. There is still an issue regarding portability, since similar behaving directives could be named differently. But it would provide well-behaved hooks for custom rendering. But I'm not sure it would help with say definition lists.

True!
I think that’s a bridge to cross when we get there.
Similarly, HTML has “custom elements” for users (tag names with a dash). That gives them the freedom to add <slideshow>.
We could do something similarly!
Or, we could use uppercase-first for users?
Good to have something like that baked in from the start!

@digitalmoksha
Copy link
Author

That’s why I try to push people towards directives. They are one syntax that solves all other syntax extensions

I took a look at https://github.com/commonmark/commonmark-spec/wiki/Generic-Directive-Extension-List, and I kept wanting each proposal to have an actual example, to make it concrete.

I think it opens up the possibility of adding certain custom rendering, but it's not effective for portability unless there is a standard agreement for how a specific directive works. Take the YouTube example from remark-directive, ::youtube[Video of a cat in a box]{#01ab2cd3efg}

Without consensus, it's entirely possible and likely that someone will code their implementation to use {src=#01ab2cd3efg} instead of {#01ab2cd3efg} because using src fits better. Now it's no longer portable. Whereas if it was an extension of the image syntax, for example, how the src is encoded is well specced out, and the fallback rendering is obvious.

I also don't see how it would address a definition list syntax. I suppose you could have a block that starts with :::definition_list or something. You would still need to decide, and everyone agree, on how that's specified within the block.

On the flip side, there is already a format for definition lists that have been widely used. I use them all the time, and I think it's a relatively elegant solution to that particular problem.

I do agree that it's not feasible to keep adding new syntax ad infinitum. But HTML does in fact add new syntax. For example <figure> was not part of the HTML4 spec, but it is for HTML5. It's documented how to use it, has a specific well defined syntax and behavior. And eventually browsers supported it. And I would venture that it's only because it was part of the spec, not just a Mozilla thing.

Having something well defined in the CommonMark spec, or a CommonMark extension spec, would actually bringi wider adoption and portability. GFM is a great example.

But I'm not sure it would help with say definition lists.

I think that’s a bridge to cross when we get there

The problem is I'm already there. I have quite a few requests for definition list support. Same way I had a huge number of requests to provide a way to size markdown specified images.

The community hasn't made much progress in the last decade on agreeing to a syntax for these extensions. I've been waiting, hoping. At some point, one has to decide to move forward anyway, picking the best, most commonly adopted syntax. That's why I tend to look to https://github.com/jgm/commonmark-hs/tree/master/commonmark-extensions. I know that most things he puts together has thought and portability in mind. Whether it ends up being "the final" spec for something, who knows.

I’d prefer CommonMark being more of a committee, with more formalized governance

I would agree with that, as long as they had a mandate to actually make decisions. Take everyone's input on proposed extensions, but finally take a decision.

So while I think directives are interesting and useful, and in general I support them, I don't think they are a panacea. I think having the ability to add extensions to the parser when other options don't make sense, is important. And when you have a large legacy of markdown data, as we do, it's important to be able to continue to support it. Which is what extensions would enable us to do.

@wooorm
Copy link
Owner

wooorm commented Mar 23, 2023

Hi again! :)

I kept wanting each proposal to have an actual example, to make it concrete.

You may have seen this but an example of how to use them is in https://github.com/micromark/micromark-extension-directive. A bit down there’s a description of the syntax.

it's not effective for portability unless there is a standard agreement for how a specific directive works

That’s why I want this in a spec or embraced by GH!

Whereas if it was an extension of the image syntax, for example, how the src is encoded is well specced out, and the fallback rendering is obvious.

I fail to see how some new syntax wouldn’t have the same problems as directives that you describe?

You would still need to decide, and everyone agree, on how that's specified within the block.

Agreed. You need consensus. With any syntax

there is already a format for definition lists that have been widely used.

It has some problems:
a) there’s no consensus: because it’s not in CM or GFM. Directives have the same problem
b) it’s ambiguous: there’s existing markdown out there that will break. Directives have a funky enough grammar that it’s less likely existing markdown will break
c) new syntax extensions are not very scalable: sure we can have a couple extensions like this, but not too many. Directives solve that: one syntax for multiple extensions

But HTML does in fact add new syntax. For example <figure> was not part of the HTML4 spec, but it is for HTML5

Strong disagree: that isn’t a new syntax, it’s new semantics.
In HTML 4, <figure> could be used:
a) it was understood how it should be handled by a parser, namely as an unknown element, basically as a div
b) there was tooling to warn that you shouldn’t use it

Importantly, <figure> being added in HTML 5 was new semantics. It didn’t affect the syntax of HTML. For that, you need a syntax and a definition of how to handle unknown semantics.
We need that in markdown. That’s why directives are a panacea!

Perhaps of interest: some similar discussion is here: https://github.com/orgs/community/discussions/16925#discussioncomment-2791869 :)

@digitalmoksha
Copy link
Author

Hey

Strong disagree: that isn’t a new syntax, it’s new semantics.
In HTML 4,

could be used:
a) it was understood how it should be handled by a parser, namely as an unknown element, basically as a div
b) there was tooling to warn that you shouldn’t use it

Importantly,

being added in HTML 5 was new semantics. It didn’t affect the syntax of HTML. For that, you need a syntax and a definition of how to handle unknown semantics.
We need that in markdown. That’s why directives are a panacea!

Ok, the difference between syntax and semantics. So we agree that you need consensus on the syntax - the actual syntax of how a directive is written.

But the HTML spec is also built on a consensus of semantics, such as <figure>. Without that agreement, then one person can implement a tag called <figure>, another uses <photo>. The portability is broken.

Same with directives. Even with an agreed upon syntax, you would need some consensus on the semantics - is it :::spoiler or a :::reveal? A :::figure or a ::: photo? Otherwise it's not portable. And that's ok for some things. And maybe it just "works itself out", like GFM. Though I would like to see more intentionality.

That’s why I want this in a spec or embraced by GH!

I can't make any promises, but it's something I would consider looking at adding to GitLab.

there is already a format for definition lists that have been widely used.

It has some problems:
a) there’s no consensus: because it’s not in CM or GFM. Directives have the same problem
b) it’s ambiguous: there’s existing markdown out there that will break. Directives have a funky enough grammar that it’s less likely existing markdown will break
c) new syntax extensions are not very scalable: sure we can have a couple extensions like this, but not too many. Directives solve that: one syntax for multiple extensions

Well, it's an extension for a reason. I'm not advocating adding it to CommonMark core. And there wasn't a consensus on markdown until CommonMark arrived. But there are many people using a specific syntax, and by settling on that, you can serve a lot of people, and drive wider adoption. Most implementations I've seen for definitions lists use that syntax. For example wataru-chocola/remark-definition-list

I would just like the option to be there, for an organization that wants to use remark and/or markdown-rs, to be able to allow extensions such as wataru-chocola/remark-definition-list.

And I have yet to see a better syntax for it. I have no clue how directives would even approach this, without driving the document author crazy.

@wooorm
Copy link
Owner

wooorm commented Mar 24, 2023

But the HTML spec is also built on a consensus of semantics, such as <figure>. Without that agreement, then one person can implement a tag called <figure>, another uses <photo>. The portability is broken.

Yep.
This may indeed be something to solve.
But there’s two ways of thinking about this:
a) XML style, where there are no semantics, users could write whatever
b) JSX or modern HTML style: there’s a difference between “standard” and “custom”. In HTML that’s whether there’s a dash in there. In JSX it’s complex but in short capital-first is custom and lowercase-first is standard. <search> vs <my-search> in HTML, and div vs MyDiv in JSX.

Might be useful to bake that in from the start!
Assuming there’s a syntax, then “we” can have names for what’s standard and what’s custom, and
a registry of currently know “standard” things.

Without it, we at least have the syntax. And everything is “custom”. Better than before in my opinion, but not super portable.

I can't make any promises, but it's something I would consider looking at adding to GitLab.

I didn’t know your “we” was GitLab. Interesting! Yes, please do! :)

And I have yet to see a better syntax for it. I have no clue how directives would even approach this, without driving the document author crazy.

A particular problem exists around definition lists: it’s basically an alternative for writing HTML
tags. Which are, as we’ve earlier discussed, viable in markdown. GitHub allows them!
They need titles and definitions. Just like the corresponding HTML tags.
But directives particularly solve the thing where that isn’t viable. Like a :youtube or so component.
A :youtube component really abstract complex handling away.
But :dl, :dt, and :dd components are basically the same as HTML, just a slightly different syntax.

by settling on that, you can serve a lot of people

The complexity here for me, while I understand it’s useful to you, is how to best serve the markdown world? Less extensions is in my head better. Some extensions (e.g., math) are okay. Tough to weigh!

@digitalmoksha
Copy link
Author

The complexity here for me, while I understand it’s useful to you, is how to best serve the markdown world? Less extensions is in my head better. Some extensions (e.g., math) are okay. Tough to weigh!

I understand what you're saying, and I don't think there is anything wrong with having the crate be opinionated. I would add your directive functionality as a part of this crate, controlled via a switch as you do the math support. Assuming you feel the spec of it is complete enough.

It would be best if the syntax could be accepted as a core of CommonMark, so that there would be a defined fallback if a parser didn't support a particular directive. Even showing as a code block would be sufficient. As it stands it would just be run-on text. But you'll have to win that battle on the CommonMark forum.

But in my own opinion, serving the CommonMark community is also supporting the ability for devs to extend via their own extensions. If someone is writing something green, brand new, maybe they have the luxury of not needing any extensions. But if they need to support any legacy data, that may use extensions (maybe coming from the remark/micromark ecosystem), then I don't think cutting those off is the best.

At GitLab we've been very limited in the extensibility of our current parser. This hampers us in being able to support not only features that customers want, but in performance and correctness.

Here's an example. We need to be able to know when a character has been escaped. This allows us to short circuit certain handling, such as user mentions. This is very difficult to do without access to the parser, requiring a pre-processing step, and a post-processing step. And even then I think it's missing a couple corner cases. This type of work is much better suited for the parser/renderer.

Heck, ideally, we'd build an extension specifically for user mentions (and our other special syntax, not unlike GH's # for issues/PRs) and deal with escaping right at that level.

Anyway, at least for my case, an extension system is important. And I would venture that it would be important for a lot in the community as well.

@wooorm
Copy link
Owner

wooorm commented Mar 28, 2023

his hampers us in being able to support not only features that customers want, but in performance and correctness.

A bit snarky, but the user is often wrong (and also often right, at the same time).

We need to be able to know when a character has been escaped

GH implements references (to users, to issues, to commits, to CVEs) on an HTML AST. It doesn’t know about escapes either: \@wooorm -> @wooorm.
It’s not perfect. But maybe it doesn‘t have to be?

Anyway, at least for my case, an extension system is important. And I would venture that it would be important for a lot in the community as well.

I understand this! I don’t think you’re wrong. I think there are trade-offs. I think it’s better for markdown to not add a lot of syntax extensions. I think it’s better for vendors to not add custom syntax extensions that don’t work in other places.

@digitalmoksha
Copy link
Author

digitalmoksha commented Mar 28, 2023

A bit snarky, but the user is often wrong (and also often right, at the same time).

Yeah but when they're right, they are right. And it's then incumbent on me, as a provider, to make things work as best they can to solve their problems.

For example the escaping issue. They are absolutely right - when you write \@user you expect, based on the rigorous CommonMark rules, that it's going to show a @ and not some other special link. I think that's a very fair assumption. And there are bug reports for it:

I also posted on the cmark forum a couple years ago about it: commonmark/cmark#366

I'm incredibly disappointed that it was impractical to fix this any other way than we did. It's a real hack. But in most cases, it solves a customer problem and annoyance. One less of a thousand cuts.

If the library supported extensions, after failing to get the library authors to add the capability, I could have added it myself.

Anyway, at least for my case, an extension system is important. And I would venture that it would be important for a lot in the community as well.

I understand this! I don’t think you’re wrong. I think there are trade-offs. I think it’s better for markdown to not add a lot of syntax extensions. I think it’s better for vendors to not add custom syntax extensions that don’t work in other places.

I'm very reticent on adding any new syntax or AST transformations, which tend to be just as unportable. I spend a lot of time looking at alternatives, most commonly accepted solutions, as well as pushing back. And there are times, and customer requirements, that require a solution. Plain and simple.

Adding something via the AST is useful in many cases, and a hack in others. If something needs to get added, it can many times be made more CommonMark complaint by having the option of adding it at the proper place in the parsing chain. And many AST transformations are indeed adding syntax which is not portable. And remember, some syntax is not meant to be portable. Some features, such as @ user mentioning or issue referencing, make no sense in other contexts.

In any case, I'm not sure I've moved the needle at all in this discussion, which is fine. I do think it's a bummer that you provide the ability to have a rich ecosystem for remark/micromark, and that it won't carry to the Rust version.

I know for us, based on our requirements and experience thus far, using a system that doesn't provide us with that capability is a tough sell.

I think it’s better for markdown to not add a lot of syntax extensions.

I think it's better for markdown to have the CommonMark community/writers push forward on finally solving some of the many discussions around extensions, various proposed syntaxes, etc. I will work hard to have our implementation fall in line with any real consensus. Until then, features will continue to be added by implementors, such as the proposed note syntax, that don't really line up well with CommonMark. 🤷

@wooorm
Copy link
Owner

wooorm commented Mar 29, 2023

Yeah but when they're right, they are right. And it's then incumbent on me, as a provider, to make things work as best they can to solve their problems.

I argue they are typically right about the problem. Not right about what they propose as a solution.
A common convention I know for not mentioning people, is to use @\wooorm. I personally think that this is an acceptable solution.

If the library supported extensions, after failing to get the library authors to add the capability, I could have added it myself.

There are significant benefits to traversing syntax trees for several features as opposed to plugging into the parser. (Not always: the math extension supported by GitHub is terrible!). Especially in tools that support a subset of HTML. It makes, for example <div>@wooorm</div> work. Syntax extensions to markdown can’t see this. Programs that traverse trees can.

Note, we already have character escapes. What you might want in this case, is a CST. We expose all this info (it’s not obvious and pretty yet):

I’ve kept this somewhat hidden until people need it. With those needs, we can design good APIs.
Without good needs, we’d get bad APIs.

I do think it's a bummer that you provide the ability to have a rich ecosystem for remark/micromark, and that it won't carry to the Rust version.

I argue that the rich ecosystem is due to syntax trees, which we have some of already, and plugins, which I want to add here too.

using a system that doesn't provide us with that capability is a tough sell.

I think forks might be quite fine for the needs of GitLab. That’s what GitHub does too with cmark-gfm.
Syntax extensions aren’t fine for most folks to manage. They’re very hard to get right. They’re likely to be buggy. They need active work because their tight integration with the internals of the host project will break often.

No parser that I am aware of outside of markdown support syntax extensions. Babel doesn’t support this. Nobody extends HTML with new syntax.
There’s only really JSX as far as I am aware, which requires lots of FAANG money and a giant userbase to get done.

In any case, I'm not sure I've moved the needle at all in this discussion, which is fine.

I’m happy to discuss this. I discuss it with many people. For years. I don’t always hold the same opinion as other times. So yes, the needle moves. But not too much haha!

@digitalmoksha
Copy link
Author

I argue they are typically right about the problem. Not right about what they propose as a solution. A common convention I know for not mentioning people, is to use @\wooorm. I personally think that this is an acceptable solution.

Wow, no, totally disagree. Yeah when that's your only option, sure that's what someone has to do. But making someone write "Firehouse #52" or "Firehouse #\52" when I feel the CommonMark escape rules kinda cover it, nah. It's fixable, provides a much better and consistent user experience. I side with the user on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants