Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case with trailing text unreported #23

Open
jackdeguest opened this issue Jan 10, 2022 · 3 comments
Open

Edge case with trailing text unreported #23

jackdeguest opened this issue Jan 10, 2022 · 3 comments

Comments

@jackdeguest
Copy link

jackdeguest commented Jan 10, 2022

With HTML::Parser v3.76.

Consider the following chunk of data:

Hello world! <span class="highlight">Isn't this wonderful</span> really?

Creating an object, such as:

# using curry and assuming this runs within a module that acts as a simple wrapper, nothing fancy.
my $p = HTML::Parser->new(
        api_version => 3, 
        start_h => [ $self->curry::add_start, 'self, tagname, attr, attrseq, text, column, line, offset, offset_end'],
        end_h   => [ $self->curry::add_end,   'self, tagname, attr, attrseq, text, column, line, offset, offset_end' ],
        marked_sections => 1,
        comment_h => [ $self->curry::add_comment, 'self, text, column, line, offset, offset_end'],
        declaration_h => [ $self->curry::add_declaration, 'self, text, column, line, offset, offset_end'],
        default_h => [ $self->curry::add_default, 'self, tagname, attr, attrseq, text, column, line, offset, offset_end'],
        text_h => [ $self->curry::add_text, 'self, text, column, line, offset, offset_end'],
        empty_element_tags => 1,
        end_document_h => [ $self->curry::end_document, 'self, skipped_text'],
);
$p->parse( $html );
sub add_text
{
    my $self = shift( @_ );
    print( "got '", $_[1], "'\n" );
}

And this would yield:

got 'Hello world! '
got 'Isn't this wonderful'

However, ' really?' is not being reported.
One has to explicitly call $p->eof to have the trailing text reported.
If this is an intended feature, then it ought to be made clear in the documentation. However, I think one should not have to call eof to get that last trailing text.

@oalders
Copy link
Member

oalders commented Jan 10, 2022

The docs say:

Calling the $p->eof method outside a handler callback will flush any remaining buffered text (which triggers the text event if there is any remaining text).

I think that seems reasonable in your case, where the fragment is not wholly enclosed by tags. I had to go back and check one of my projects and I see that I'm calling eof here: https://metacpan.org/release/OALDERS/HTML-Restrict-v3.0.0/source/lib/HTML/Restrict.pm#L317 (likely for similar reasons).

eof is in the SYNOPSIS as well, having said that, if you'd like to suggest an improvement to the documentation, that would be welcome. I think it could be clearer.

@jackdeguest
Copy link
Author

The docs say:

Calling the $p->eof method outside a handler callback will flush any remaining buffered text (which triggers the text event if there is any remaining text).

I think that seems reasonable in your case, where the fragment is not wholly enclosed by tags. I had to go back and check one of my projects and I see that I'm calling eof here: https://metacpan.org/release/OALDERS/HTML-Restrict-v3.0.0/source/lib/HTML/Restrict.pm#L317 (likely for similar reasons).

eof is in the SYNOPSIS as well, having said that, if you'd like to suggest an improvement to the documentation, that would be welcome. I think it could be clearer.

I think the doc makes it clear for sure, but expectations might be different. If one calls parse on a string, and the parser reaches the end of that string, just like if it reaches the end of a file, wouldn't it be reasonable to assume the buffer would need to be flushed? Maybe there are good reason not to do so?

@oalders
Copy link
Member

oalders commented Jan 10, 2022

Maybe there are good reason not to do so?

Maybe some digging through the original commits would make this clear? I wrote none of this code, so I actually just don't know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants