You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Leading and trailing whitespace characters are removed from the link value during the removal of space characters, making extracting/following the link fail.
my $mech = WWW::Mechanize->new();
$mech->update_html(qq'<a href="\x0b">link</a>');
say length $mech->links->[0]->URI->as_string; # 0
$mech->update_html(qq'<a href="\x{3000}">link</a>');
say length $mech->links->[0]->URI->as_string; # 0
According to HTML5 spec, space characters are /[\x09\x0a\x0c\x0d\x20]/:
https://www.w3.org/TR/html52/infrastructure.html#infrastructure-urls
A string is a valid URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid URL.
A string is a valid non-empty URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid non-empty URL.
Re: space characters https://www.w3.org/TR/html52/infrastructure.html#space-characters
The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).
URI->new() is causing this, as its document says: it removes white space characters (\s,) which depends on a version of Unicode spec each version of Perl confirms.
The text was updated successfully, but these errors were encountered:
The stripping code was committed in 1996 https://metacpan.org/source/GAAS/libwww-perl-5.00/lib/URI/URL.pm#L90-93
(aside from libwww-perl 0.20~0.30)
because old RFC 1738 appendix says URLs may have extra characters around in email or something which themselves are not a part of URL.
Now in 2018, I think the behavior can still be said as a consistent one if URI is trimming spaces like how the location bar of a web browser does (for it no longer mentions RFC.) But as a module it is taking too good care in Unicode regex era?
The following crafted example does not work either. I think that now URI is more widely used than first designed to be, and that the current stripping is kind of obsolete.
$mech->update_html(qq'<a href="<URL:>">link</a>');
say length $mech->links->[0]->URI->as_string; # 0
Leading and trailing whitespace characters are removed from the link value during the removal of space characters, making extracting/following the link fail.
According to HTML5 spec, space characters are /[\x09\x0a\x0c\x0d\x20]/:
URI->new()
is causing this, as its document says: it removes white space characters (\s,) which depends on a version of Unicode spec each version of Perl confirms.The text was updated successfully, but these errors were encountered: