Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attributes that have no value get their name as their value #17

Open
simbabque opened this issue Oct 3, 2020 · 6 comments
Open

Attributes that have no value get their name as their value #17

simbabque opened this issue Oct 3, 2020 · 6 comments

Comments

@simbabque
Copy link

When investigating libwww-perl/WWW-Mechanize#125 I noticed that the following HTML parses weirdly.

<input type="hidden" name="foo" value>

According to the HTML spec on an input element a value attribute that's not followed by an equals = should be empty, so we should be parsing it to an empty string.

Empty attribute syntax
Just the attribute name. The value is implicitly the empty string.

Instead of making it empty, we set it to "value".

I've looked into it, and got as far as that get_tag returns a data structure that contains the wrong value:

\ [
    [0] "input",
    [1] {
        /       "/",
        name    "foo",
        type    "hidden",
        value   "value"
    },
    [2] [
        [0] "type",
        [1] "name",
        [2] "value",
        [3] "/"
    ],
    [3] "<input type="hidden" name="foo" value />"
]

Unfortunately I am out of my depths with the actual C code for the parser. But I think, we should be returning an empty string for the value attribute, as well as all other empty attributes.


I wrote the following test to demonstrates the problem.

use strict;
use warnings;

use HTML::TokeParser ();
use Test::More;
use Data::Dumper;

ok(
    !get_tag(q{})->{value},
    'No value when there was no value'
);    # key does not exist

{
    # this fails because value is 'value'
    my $t = get_tag(q{value});
    ok(
        !$t->{value},
        'No value when value attr has no value'
    ) or diag Dumper $t;    
}

ok(
    !get_tag(q{value=""})->{value},
    'No value when value attr is an empty string'
);    # key is an empty string

is(
    get_tag(q{value="bar"})->{value}, 
    'bar', 
    'Value is bar'
);    # this obviously works

sub get_tag {
    my $attr = shift;
    return HTML::TokeParser->new(\qq{<input type="hidden" name="foo" $attr />})->get_tag->[1];
}

done_testing;
@simbabque simbabque changed the title input tag with value attribute that has no value should not parse as value="value" Attributes that have no value get their name as their value Oct 3, 2020
@oalders
Copy link
Member

oalders commented Oct 4, 2020

Thanks for the test case @simbabque!

@oalders
Copy link
Member

oalders commented Oct 4, 2020

It might be helpful to add this as a TODO test as well.

@andyjack
Copy link

I've also run into this issue a couple of times - digging deeper into HTML::Parser reveals that setting the value to the name is intentional! There is an option to control what value is "parsed" when an attribute has no value.

; perl -MHTML::TokeParser -MDDP -lE 'my $p = HTML::TokeParser->new( doc => \qq{<input type="text" name="abc123" value>}, boolean_attribute_value=>"no value!")->get_tag->[1]; p $p'
{
    name    "abc123",
    type    "text",
    value   "no value!"
}

The current design of "return the name" doesn't seem sensible to me - having the default setting for the option be undef or q{} would align the module with the HTML spec, but that would probably affect downstream code.

Here's where setting the value to the name happens in the C code.

@wolfsage
Copy link

This part of the parser specifically mentions 'boolean' - I believe it's referring to this:

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#boolean-attributes

If the attribute is present, its value must either be the empty string or a value that is an ASCII case-insensitive match for the attribute's canonical name, with no leading or trailing whitespace.

and

Example:

Here is an example of a checkbox that is checked and disabled. The checked and disabled attributes are the boolean attributes.

<label><input type=checkbox checked name=cheese disabled> Cheese</label>

This could be equivalently written as this:

<label><input type=checkbox checked=checked name=cheese disabled=disabled> Cheese</label>

I think what this means is that HTML::Parser needs to be aware of the types of the attributes its parsing, which makes it seem like the fix won't be so easy?

@oalders
Copy link
Member

oalders commented Jun 23, 2021

This blog post states that there are 25 attributes which are boolean. https://meiert.com/en/blog/boolean-attributes-of-html/

If that's correct, they could be special-cased, but from my quick digging I didn't find a definitive list elsewhere, so I'm not confident in this yet.

@andyjack
Copy link

Thanks for the info about boolean attributes - the "return the name" behavior makes sense now.

Does a user of HTML::Parser care about differentiating between <... checked ...>, <... checked="checked" ...> and <... checked="" ...> being parsed? This might affect the fix for the issue.

andyjack added a commit to andyjack/HTML-Parser that referenced this issue Sep 22, 2021
oalders pushed a commit that referenced this issue Jul 19, 2023
oalders pushed a commit that referenced this issue Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants