Non-space whitespace characters are removed from anchor URL #266

ranvis · 2018-12-04T10:11:44Z

Leading and trailing whitespace characters are removed from the link value during the removal of space characters, making extracting/following the link fail.

my $mech = WWW::Mechanize->new();
$mech->update_html(qq'<a href="\x0b">link</a>');
say length $mech->links->[0]->URI->as_string; # 0
$mech->update_html(qq'<a href="\x{3000}">link</a>');
say length $mech->links->[0]->URI->as_string; # 0

According to HTML5 spec, space characters are /[\x09\x0a\x0c\x0d\x20]/:

https://www.w3.org/TR/html52/infrastructure.html#infrastructure-urls
A string is a valid URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid URL.
A string is a valid non-empty URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid non-empty URL.

Re: stripping leading and trailing white space
https://www.w3.org/TR/html52/infrastructure.html#strip-leading-and-trailing-white-space
When a user agent is to strip leading and trailing white space from a string, the user agent must remove all space characters that are at the start or end of the string.

Re: space characters
https://www.w3.org/TR/html52/infrastructure.html#space-characters
The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

URI->new() is causing this, as its document says: it removes white space characters (\s,) which depends on a version of Unicode spec each version of Perl confirms.

The text was updated successfully, but these errors were encountered:

oalders · 2018-12-04T13:40:00Z

So, is the behaviour of URI incorrect here or do we need an option to define what URI considers to be whitespace at https://metacpan.org/source/ETHER/URI-1.74/lib/URI.pm#L43-44?

ranvis · 2018-12-05T13:59:49Z

The stripping code was committed in 1996
https://metacpan.org/source/GAAS/libwww-perl-5.00/lib/URI/URL.pm#L90-93
(aside from libwww-perl 0.20~0.30)
because old RFC 1738 appendix says URLs may have extra characters around in email or something which themselves are not a part of URL.
Now in 2018, I think the behavior can still be said as a consistent one if URI is trimming spaces like how the location bar of a web browser does (for it no longer mentions RFC.) But as a module it is taking too good care in Unicode regex era?

The following crafted example does not work either. I think that now URI is more widely used than first designed to be, and that the current stripping is kind of obsolete.

$mech->update_html(qq'<a href="&lt;URL:&gt;">link</a>');
say length $mech->links->[0]->URI->as_string; # 0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-space whitespace characters are removed from anchor URL #266

Non-space whitespace characters are removed from anchor URL #266

ranvis commented Dec 4, 2018

oalders commented Dec 4, 2018

ranvis commented Dec 5, 2018 •

edited

Loading

Non-space whitespace characters are removed from anchor URL #266

Non-space whitespace characters are removed from anchor URL #266

Comments

ranvis commented Dec 4, 2018

oalders commented Dec 4, 2018

ranvis commented Dec 5, 2018 • edited Loading

ranvis commented Dec 5, 2018 •

edited

Loading