Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix srcset attribute parsing according to the specifications #399

Merged
merged 2 commits into from
Aug 15, 2024

Conversation

rakhnin
Copy link
Contributor

@rakhnin rakhnin commented Aug 15, 2024

According to the srcset attribute specification

value must consist of one or more image candidate strings, separated from the next by a U+002C COMMA character (,)

Current implementation doesn't process srcset attribute properly for URLs separated by , ONLY.

This problem also was mentioned in #339

Copy link
Member

@snshn snshn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Andriy,

thank you for this! I asked for a couple of cosmetic changes, will merge after that for the next release.

src/html.rs Outdated
@@ -28,6 +28,8 @@ struct SrcSetItem<'a> {
descriptor: &'a str,
}

const WHITESPACES: &'static [char] = &['\t', '\n', '\x0c', '\r', ' '];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need \x0b here by any chance as well? Python's string.whitespace includes \x0b and https://doc.rust-lang.org/reference/whitespace.html is even wider, but the reference spec https://infra.spec.whatwg.org/#ascii-whitespace doesn't include it, so we're probably fine.

I also think it needs to be renamed to something like SRCSET_WHITESPACES since it's only used for srcset, and the file is html.rs (all things related to HTML).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @snshn,

so we're probably fine.

seems yes, \x0b is out of scope HTML5 spec.

I also think it needs to be renamed to something like SRCSET_WHITESPACES since it's only used for srcset, and the file is html.rs (all things related to HTML).

But on the other hand, while it's currently used ONLY for srcset, the WHITESPACES list makes sense for whole HTML5 spec. I would name it HTML5_WHITESPACES, but as you mentioned it's already defined inside html.rs, I ignored the prefix. So if you still insist on renaming, let me know I'll do it.

Copy link
Member

@snshn snshn Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's part of the HTML5 spec, I guess it can be just WHITESPACES in this file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move it below ICON_VALUES though, to keep it alphabetical?

src/html.rs Outdated
url_end -= 1;
}
offset = url_end;
// If the URL wasn't terminated by a U+002C COMMA character (,) there may also be a descriptor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think saying just "comma" should be enough here

nit: and let's remove the period at the end, since other comments don't have it

}

#[test]
fn the_latest_without_descriptor() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think last_without_descriptor() might be a better name

@@ -145,7 +195,7 @@ mod failing {

assert_eq!(
embedded_css,
format!("{} 1x, {} 2x,", EMPTY_IMAGE_DATA_URL, EMPTY_IMAGE_DATA_URL),
format!("{} 1x, {} 2x", EMPTY_IMAGE_DATA_URL, EMPTY_IMAGE_DATA_URL),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, now it automatically fixes the format 👏🏻

src/html.rs Outdated
if url_start >= size {
break;
}
// A valid non-empty URL that does not start or end with a U+002C COMMA character (,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saying just "comma" should be enough

src/html.rs Outdated

while offset < size {
let mut has_descriptor = true;
// Zero or more ASCII whitespace + skip leading U+002C COMMA character (,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just "comma" should be enough. And "ASCII whitespace" is too specific, we have those whitespaces defined in the array, we have that part described in the file already, just "whitespace" would be easier to read.

Something like this:

Zero or more whitespaces + skip leading comma

@snshn snshn merged commit 64e84e4 into Y2Z:master Aug 15, 2024
9 checks passed
@rakhnin rakhnin deleted the srcset-fix branch August 15, 2024 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants