-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support anchor link targets #31
Comments
cool! :-) You are right.. no file-system caching of downloaded files is necessary. scanning for references once downloaded and cachign that in memory makes more sense... sounds all good that you write here. PS: The link sin your issue do not work. Looks like you used relative links, and instead of them going to the soures, they apply directly to this URL ( Here the (absolute) working links:
Thanks a lot! great issue.. I hope I get to look into it these days (I think I will :-) ). |
A question ... |
Thanks! Don't know what I thought when I created those links. Broken links in a link checker tool issue... 🤦♂️ |
Good question, now that I look at those markdown files I wonder why I did not put them in their corresponding dirs in the first place? Would be great if you add tests for anchor links in markdown and html dir. I will cleanup here. One more thing is that there is already a warn message in https://github.com/becheran/mlc/tree/master/src/link_validator/file_system.rs line 65 were I print a message that everything after |
a question ... You have code scanning Markdown files for links (src/markdown_link_extractor.rs) and code scanning HTML files for links (src/link_extractors/html_link_extractor.rs). Now we need code scanning for anchor targets in Markdown and HTML too. An other question: should we make the anchor checking optional? (I would say yes, because.. why not?) |
Wow just had a look at my parser code again I have to admit it is a bit hard to read with all the conditions... But in the end, both extractors do a push of a new link to the result vectors at one or several points. For example line 159 in The actual anchor check and file load/download should happen in the link_validtor somewhere I would say (without having a 100% clear strategy yet).
Yes. Though, I would prefer it to be enabled as default and can be disabled via a command line arg such as |
Speaking of hard to read... I am trying to get rid of the a bit strange looking and not so performant throttle code in |
ok.. I do not understand that, but.. good luck! :-) |
You can work on that part right now. My changes should be far away in lib.rs and if we need to merge I will do that. No worries |
As you write, this is the case for GFM, but I just found out, that Pandoc Markdown does it differently. :/ |
That's the problem of not having a md standard. Every implementation can do it differently :-( I guess there is no way around having a command line option. But what would be your most relevant usecase? Personally I use Just re-visited issue #19 where someone states:
Never saw this |
@becheran It was me writing about the |
My main use cases are GitHub, GitLab and Pandoc too. I also use the
Without that syntax, the anchors would just be enumerated ( |
.. I am not dead! |
@hoijui take all the time you want. There are no deadlines. We do this for fun. At least this is how I treat coding in my spare time. It should be at least as enjoyable as watching a Netflix series 😛 If you need/want any help, plz let me know. Rust rocks! 🦀 |
:-) And yes.. rust can be fun I learned now, cool! :-) } else if line_chars.get(column) == Some(&'<')
&& line_chars.get(column + 1) == Some(&'/')
&& line_chars.get(column + 2) == Some(&'s')
&& line_chars.get(column + 3) == Some(&'c')
&& line_chars.get(column + 4) == Some(&'r')
&& line_chars.get(column + 5) == Some(&'i')
&& line_chars.get(column + 6) == Some(&'p')
&& line_chars.get(column + 7) == Some(&'t')
&& line_chars.get(column + 8) == Some(&'>')
{ obviously, I am way too newb to decide, it just looks too verbose to me. That is a minor thing though, I am not much concerned about it, but I am about this:
The second issue there, makes it practically unusable. :/ ... Yet an other approach would be, to use a general parsing library (lexer, PEG, LR, LL, ... I don't know what all that means ;-) ). I did not check for HTML, but it being much more standardized, wider spread and so on, I bet there would be one or two good rust HTML parsers that could be used, though even editing your code for that looks easy enough for me to do it. But yeah... I am a bit frustrated with the Markdown issue. Of course it all comes down in the end, to markdown not being one standard (plus pandocs internal issues). ... breathing again |
The links-in-title thing is something I came about purely by accident, and I know I would have missed much more if I had written the parser; I just mentioned it because ... unelegantly put, it made me feel my fear of this being complex is valid. |
I had some bad days. am better now... ignore the above! ;-) |
@hoijui sorry for late reply. You are right. The parser part is nothing to be proud of. Was kind of the first rust code that I wrote and it looks pretty ugly. I am sorry. My thoughts when I started this was this:
Another issue with this code is that I did not knew the very well documented CommonMark and GitHub favor Markdown specs. Should have read them before. Anyways... too late and it does work, except for edge cases (aka. bugs) when it doesn't :-P. |
It is a bug? I wonder what I thought when I wrote the |
OK. And for the future of this issue... The usage of pandoc seems unfortunately not an option due to the issue that you pointed out. Without the source file metadata the parser is not really practical here. :-( So I assume there is no real way around writing an own parser? Just did a quick search and found this lib which looks as if it does a good job in extracting tokens. I still believe that extracting links and headlines from markdown flavors should be a solvable problem. What do you thing @hoijui? Do you think it makes sense if I re-write the parser using the logos lexer? |
Hi all, I recently wrote a very simple quick and dirty intra-links checking for my markdown documents. I'm not sure if it's of any utility for you, but just maybe. Here's the code. The use std::collections::HashSet;
use itertools::Itertools;
/// Checks that each `[Foo](#foo)` link has a corresponding `# Foo header {#foo}` header
fn check_markdown_anchor_links(document: &str) -> Result<(), String> {
fn words<'a>(s: &'a str, starting_with: &str, ending_with: char) -> HashSet<&'a str> {
s.split(starting_with).skip(1).filter_map(|s| s.split_once(ending_with).map(|s| s.0)).collect()
}
let anchors = words(document, "{#", '}');
let links = words(document, "(#", ')');
let links_without_anchors = links.difference(&anchors).map(|link| *link).collect_vec();
if links_without_anchors.is_empty() {
Ok(())
} else {
Err(format!("The following [](#) intra-document links don't have a corresponding {{#}} header anchor: {}", links_without_anchors.join(",")))
}
}
/// Suppose that the file contains[Internal link to headers][] as per https://pandoc.org/MANUAL.html#extension-implicit_header_references.
/// Check that all such internal links refer to existing headers. See the test.
fn check_markdown_implicit_header_references(document: &str) -> Result<(), String> {
let anchors: HashSet<&str> = {
document.lines().filter_map(|l|
if l.starts_with('#') {
Some(l.trim_start_matches('#').trim())
} else {
None
}
).collect()
};
let links: HashSet<&str> = {
let ending = "][]";
document
.split_inclusive(ending)
.filter_map(|s| if s.ends_with(ending) {
s.trim_end_matches(ending).rsplit_once('[').map(|s| s.1)
} else {
None
})
.collect()
};
let links_without_anchors = links.difference(&anchors).map(|link| *link).collect_vec();
if links_without_anchors.is_empty() {
Ok(())
} else {
Err(format!("The following [...][] intra-document links don't have a corresponding # header: {}", links_without_anchors.join(",")))
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_check_markdown_implicit_header_references() {
let document = "\
# Header
This is a link to [Header][]
This is [Ignored, no trailing empty square brackets]
This is a [link to a non-existing anchor][]
Another [Ignored, this time at the end of the document]
";
assert_eq!(
&check_markdown_implicit_header_references(document).unwrap_err(),
"The following [...][] intra-document links don't have a corresponding # header: link to a non-existing anchor"
)
}
#[test]
fn test_check_markdown_anchor_links() {
let document = "\
# Header {#header}
This is a [link to the header](#header)
This is a [link to a non-existing anchor](#non-existing)
This is [link to a non-existing anchor](#non-existing)
";
assert_eq!(
&check_markdown_anchor_links(document).unwrap_err(),
"The following [](#) intra-document links don't have a corresponding {#} header anchor: non-existing"
)
}
} |
:D :D |
Sounds good! I guess I'll try to wrap-up what I have so far into commits, even though it is of course not functioning without the parsing part, and.. later retro-fit it to your changes... ? |
@hoijui No magic, I've been watching the issue because I'm also interested in that :-) You're right about the code just looking for opening and closing sequences regardless of context. Adding support for code blocks shouldn't be too hard, though, I can imagine parsing the document line by line in a similar fashion my code does, and adding a simple state indicator inside_block that would be triggered on and off when triple backtick is encountered (and probably ignoring the line if it starts with a four-spaces indent, that's another way of expressing a code block). I'm not saying that it's the best solution, but it is a simple one. Tests on real-world documents would probably reveal if it is viable. |
@dvtomas lists also use indenting, and code withing lists and lists in lists may use more indents ... My use case is pretty much the opposite: |
Sure, I'm all for a robust solution and looking forward it. It's just that you could devise a simple solution in like a day, and have an already useful utility covering an overwhelming majority of use-cases. If some broken link inside a deeply nested list isn't reported, that's of course bad, but not a tragedy. You could work towards the perfect solution meanwhile. Btw, have you considered adapting (perhaps improving) some existing parser, e.g. https://crates.io/crates/markdown ? Sorry if this was already discussed, I haven't checked the whole discussion. |
.. .ahhhh good idea! |
Nice find. comrak README mentions https://github.com/raphlinus/pulldown-cmark , that might also be worth looking at. |
So I just tried the comark lib and commited on a branch. It does the job of extracing the links. One thing that puzzles me is the start_line metadata info of parsed Nodes which seems to be zero for all inline elements. Would need to save the line info for the last container node which should be doable. Another issue is that the column info is missing completely. I will ask the comark author if this info is easy to add and whether the authors would like to add this needed info to their lib. |
@dvtomas I also tried pulldown_cmark and for me this looks even more promising. I do like the lib interface a bit more. More important it is possible to retrieve both the line and column info from the parser result. I created another branch which already contains the replacement. Two things are still missing though, the reference links seems to be handled differently. Inline html links need also be parsed seperatelly after they where detected |
Wooow! :O
Would you like to work further on any of this? |
regarding Raw-HTML ... With Pandoc, one has to do the same like with pulldown-cmark - to parse inline HTML separately. One issue I found there, which we might also see here, is that a normal HTML link within markdown: Some text <a href="some_url">link text</a> more text. gets parsed into an AST roughly like this: MarkdownText("Some text")
RawHTML("<a href="some_url">")
MarkdownText("link text")
RawHTML("</a>")
MarkdownText("more text.") So you can't just parse individual RawHTML tokens with the HTML parser and look for complete link definitions like Or maybe we can use a hack, like |
@hoijui About the open points:
Still todo.
We could wait or parse a potential text part in the headline to get this? Should be doable even without the pulldown-cmark pull-request being merge. Still TODO
I already did that. I re-used my own link extractor code. Replacing this can also be done if we need it.
Maybe not even required? Do you see issues with the html parser? A search for rust html parsers did not reveal any which do keep track of metadate (line/col) in the parsed structure... Or I did not search long enough... |
I did add your code example as a unit test and my current solution with pulldown_cmark and my own html parser resolves this as expected. See this commit. |
NIICE! :D It seems to be possible to have a cargo dependency on a specific repo and commit, so I think I will do that for my branch.
|
alternatively to that PR, we could also use this filter: I will for now sticking with the PR, as I have that setup already. |
I am still on it! |
I am kind of stuck with.. rust related issue(s). ;-) I created the
do I have to put Also, I realized, that surely you will have to refactor a lot, after I am done. I have no problem with that; I see this as a learning experience. If it were not actually something useful coming out of this, I would probably give up on it. ;-) |
oh, thanks! :-) |
I think one of the problematic things is the type signature of
I may be very wrong on these points, but they are something that seems funny to me. Maybe there's a legitimate reason for all of them I don't see from a distant glance. |
ahh thanks @dvtomas ! :-)
Can;t think of anything rihgt now.. I think you are right. ;-)
Yes, I want to cache errors: say, a URL is unavailable, i want to store that, and not retry fetching it all the time.
I think I just wanted to get .. something to compile, and then would have added this later.
Indeed... ;-)
got it, thanks!
All good you did.. thank you! :-) |
slightly change /// If a URL is not stored in the map (the URL does not appear as a key),
/// it means that URL has not yet been checked.
/// If the Result is Err, it means the URL has been checked,
/// but was not available, or anchor parsing has failed.
/// If the Option is None, it means the URL was checked and evaluated as for available,
/// but no parsing of anchors was tried.
/// If the Vec is empty, it means that the document was parsed, but no anchors were found.
type AnchorsCache = HashMap<reqwest::Url, reqwest::Result<Option<Vec<MarkupAnchorTarget>>>>; |
.. ok, getting the grasph o fit now.. thanks! :-) |
Most of it makes sense. Maybe you'll want to have a dedicated error like enum Error {
RetrieveError(reqwest::Error),
ParseError(...)
} but you'll yet see if that's really necessary. What really looks odd to me is the If Option is None, it means the URL was checked and evaluated as for available, Anyway, perhaps even replacing the whole compound value type with something like enum CacheEntry {
Parsed {
errors: Vec<String>,
anchors: Vec<MarkupAnchorTarget>,
},
RetrieveError(reqwest::Error),
RetrievedButNotParsed,
} would be a more fitting and expressive description of the relevant states? It even allows to remember errors during parsing, while also storing at least some of the anchors if the parser is lenient enough. There are no Results of Options or whatever, everything is flat and readable. What do you think? |
I do like the proposal of @dvtomas using an enum return value. I should use the enums type much more frequently in rust. It is so much more powerful compared to other languages. Using enums makes the state clean even without documenting it. I am also not 100% sure why you would need the |
ok, will do too! :-)
If the cache is going to be serialized and reloaded, for example. |
Ah got it. Did not have this usecase in mind. Though, I wonder for how long those caches will be valid? How can you tell that in the next run the file did not change? |
I can not tell that, but the user should have the choice, through an optional flag. |
@hoijui are you still working on this issue? |
hey! |
@hoijui hey, nice to hear. Congrats for the new job! Sounds great! |
NOTE (mostly) to self: NOTE twoo: |
..looks like I am working on this again now... ... oh NOTEs, great! |
I just rebased my stuff on top of your changes until now; was some work. :-) one question: Why do you squash commits, e.g. from pull-requests or your own feature branches? (with messy I mean stuff like: "Try to fix 1", "Try to fix again", "Fix typo in try to fix 1", ... ) |
Hey, great to hear. 😌 I don't have too many feelings about my history. For feature and bugfix branches I simply squash most of the time because I think a single commit in the history is enough to explain what I did. But I am also fine to not squash yours. |
Anchor links
The part after the
#
called anchor link is currently not checked. A markdown link including an anchor link target is for example[anchor](#go-to-anchor-on-same-page)
, or[external](http://external.com#headline-1)
.How do anchor links work
HTML defines anchors targets via the anchor
name
tag (e.g.<a id="generator"></a>
). An anchor target can also be any html tag with anid
attribute (e.g.<div id="fol-bar"></div>
).The official markdown spec does not define anchor targets. But most interpreters and renderer support the generation of default anchor targets in markdown documents. For example the github markdown flawor supports auto generated link targets for all headline (
h1
toh6
) to with the following rules:Implementation hints
A first good step would be to add valid anchor links example markdown files to the benches dir which will be used for the [end-to-end unit tests[(./tests/end_to_end.rs).
The library run method is the most important method which will use all submodules and does the actual execution.
In the link extractor module the part after the
#
needs to be extracted and saved in theMarkupLink
struct.The lilnk validator module is responsible for the actual resolving and check whether a resource exists (either on disk or as URL. This code needs to be enhanced to not only check for existence if an
anchor
link was extracted, but also actually parse the target file and extract all anchor targets. Same must be done for we links. Here a HEAD request is send right now and only of that failes a GET is send. If an achor link needs to be followed a GET request is needed and the resulting page needs to be parsed for all anchors.Besides the already existing grouping of same links which are only checked once for performance boost, it would also make sense to parse an document wich contains an anchor to it only once and reuse the parse result for others references to the same doc, Also for performacne reasons it would be great to only download and parse documents which actually have an anchor link to them and not all docs for all links.
The text was updated successfully, but these errors were encountered: