Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizing bug. Some tokens are split into 2 #27

Open
florian-pe opened this issue Jul 14, 2022 · 11 comments
Open

Tokenizing bug. Some tokens are split into 2 #27

florian-pe opened this issue Jul 14, 2022 · 11 comments
Labels
bug Something isn't working

Comments

@florian-pe
Copy link

florian-pe commented Jul 14, 2022

This problem happens on a particular webpage
https://www.radiofrance.fr/franceinter/podcasts

This is my golfed script which shows the bug

#!/usr/bin/perl
package myparser;
use strict;
use warnings;
use v5.10;
use base qw(HTML::Parser);

sub text {
    my ($self, $text, $is_cdata) = @_;
    say "\"$text\"";
}

package main;
use strict;
use warnings;

my $p = myparser->new;
$p->parse_file(shift // exit);

Unfortunately, I can't post a golfed HTML snippet because when I try to reduce the size of the webpage, the bug disappear. So I will have to explain the exact steps I did to reproduce the bug.

In Chromium, go to https://www.radiofrance.fr/franceinter/podcasts.
Then load the entire webpage by going at the bottom, clicking on "VOIR PLUS DE PODCASTS" repetitively until everything is loaded.
Then save the webpage.

After that you just have to execute the script example with the downloaded page as argument.

The script prints all the text which is outside of any tag. Like this /tag>TEXT HERE<othertag

THE BUG
The bug is that some "text elements" are splitted in 2.
This happens for several podcast names. "Sur les épaules de Darwin" is one of those.

You can see that the script will output

"Sur les épaules de"
" Darwin"

instead of just "Sur les épaules de Darwin"
This also happens to "Sur Les routes de la musique" (just below) and a few others.

Now, I found that when deleting <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">, just at the top of <head></head>, the bug disappear. And It also happens when deleting just ; charset=UTF-8

The problem is that the bug also disappear when I leave the charset as is and when I delete a bunch of the stuff inside <head></head> or I delete a lot of the divs corresponding to the other podcasts entries of the index.

This is all the information that I have.

@oalders oalders added the bug Something isn't working label Jul 14, 2022
@oalders
Copy link
Member

oalders commented Jul 14, 2022

@florian-pe thanks for this. Out of curiosity, do you have the same issue if you change the length of $chunk at https://metacpan.org/release/OALDERS/HTML-Parser-3.78/source/lib/HTML/Parser.pm#L94?

@florian-pe
Copy link
Author

@oalders You are right, it does fix the problem.
If I set the chunk size to 1024 bytes instead of 512, I still get the bug. But if I set it to 1_000_000, which is superior to the of the webpage (about 873KB), then there is no more splitting. At least on the 2 particular elements that I cited above.

@oalders
Copy link
Member

oalders commented Jul 18, 2022

@florian-pe what happens if you set unbroken_text(1) on your parser object?

my $p = myparser->new;
$p->unbroken_text(1);
$p->parse_file(shift // exit);

@florian-pe
Copy link
Author

@oalders Yes it fixes the bug.
It also produces the exact same output as

my $p = myparser->new;
$p->parse(do { local $/; <> });

I have read the man page at $p->unbroken_text but I don't understand, at all, what it does.
But regardless, are you suggesting that I was misusing the library and that it is in fact "not a bug but a feature" ?
Which is entirely possible.

@oalders
Copy link
Member

oalders commented Jul 21, 2022

Yes @florian-pe, I think for your use case you want this option enabled in the parser. So, it does not appear to me to be a bug.

@florian-pe
Copy link
Author

@oalders I don't understand how it's not a bug if the problems originates from the subroutine parse_file() not correctly handling buffered input ? ie it does not deal with the fact that eventually, the boundary between 2 consecutively read chunks of bytes will be in the middle of a token, which apparently is not handled correctly because the end result is that some tokens are being splitted.

@oalders
Copy link
Member

oalders commented Jul 21, 2022

I didn't write the code, but just for some history, this sub enters into the codebase in 1996 with a chunk size of 2048:

aeb6d0ba14e680e6#diff-abe42eabebfc8528859aa468da65d562ea1c37c368905ddc25d8b10ad1f801b0R298

Not sure how relevant that is, but it's a fun fact!

I had a closer look at the docs for unbroken_text and as advertised, it does seem that this code should not be splitting tokens even with that option disabled. If you could distill this down to a small test case that demonstrates where tokens are being split, that would be the most helpful way to look at this, I think.

@florian-pe
Copy link
Author

florian-pe commented Jul 23, 2022

Alright, here's a simple example.
So here's the same golfed script I used to demonstrate the bug, but slighlty modified.

#!/usr/bin/perl
package myparser;
use strict;
use warnings;
use v5.10;
use base qw(HTML::Parser);

sub text {
	my ($self, $text, $is_cdata) = @_;
    say "\"$text\"";
}

package main;
use strict;
use warnings;

my $begin = <<'END';
<!DOCTYPE html>
<html>
<head>
</head>
<body>
END

my $end = <<'END';
<span>splitted token</span>
</body>
</html>	
END

my $num = shift // exit;

open my $fh, ">", "page_test.html";
print $fh $begin;
print $fh "<span>", ("a" x  $num) ,"</span>";
print $fh $end;
close $fh;

my $p = myparser->new;
$p->parse_file("page_test.html");

We can use this one-liner to find the number of "a" needed so that the token "splitted token" will be splitted.

$ perl -E 'for $num (0 .. 2000) { my @out = map { chomp; $_ } qx{ ./golfed.pl $num }; if (!grep {/splitted token/} @out) { say $num; last } }'
434

Then if we run $ ./golfed.pl 434,
we will see that the token splitted token is indeed splitted into 2.

If we count the number of characters with this

$ perl -E '$file = do { local $/; <> }; $count=1; for (split "", $file) { say "$count\t$_"; $count++ }' page_test.html | less

we see the 512th character happen to be the last "n" of "splitted token".

I redid that same little experiment, but removing <!DOCTYPE html> from the generated html page, and the bug happen for $num == 450.
And again, the 512th character is the same one that in the previous test, ie the last "n" / last character of the string "splitted token".

508     t
509     o
510     k
511     e
512     n
513     <
514     /
515     s
516     p
517     a
518     n

I hope that helps, and that it convinces you that it is indeed a bug.

@oalders
Copy link
Member

oalders commented Jul 28, 2022

Thanks for this @florian-pe. That does look like a bug to me. Are you motivated to fix this?

@florian-pe
Copy link
Author

florian-pe commented Aug 4, 2022

That's a nice challenge. I tried to add print statment and it compiles OK. The problem is that, when I try to use the hand compiled module in a script, even with use lib "/home/user/perl/scripts/parsing_html/New Folder/HTML-Parser_BUILD/blib/lib/";, the script uses the normal CPAN module installed with cpanm instead.

So I can not even begin to poke around the C code.

@florian-pe
Copy link
Author

Ok never mind.
I think this was the problem
https://perldoc.perl.org/XSLoader#LIMITATIONS
I have uninstalled my cpanm version, did sudo make install, and now I can add print statements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants