Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential speedups for HTML::Entities::encode_entities #30

Open
petdance opened this issue Aug 13, 2022 · 2 comments
Open

Potential speedups for HTML::Entities::encode_entities #30

petdance opened this issue Aug 13, 2022 · 2 comments

Comments

@petdance
Copy link
Contributor

I think we could get some speedups in encode_entities by caching some common operations. The examples below would most benefit users encoding lots of data that is heavy on non-named, numeric entities. We may have similar benefits elsewhere.

The first is to cache the results of the sprintf in num_entity. This could be done with no effect on behavior, in exchange for some hash entries. Here are my experiments.

use Benchmark ':all';
use HTML::Entities;

timethese( 300_000,
    {
        original => sub {
            map { HTML::Entities::num_entity($_) } 1..100
        },
        cached => sub {
            map { cached_num_entity($_) } 1..100
        },
    }
);


my %cache;
sub cached_num_entity {
    return $cache{$_[0]} ||= sprintf("&#x%X;", ord($_[0]));
}

gives these results:

    cached: 10 wallclock secs (10.52 usr +  0.00 sys = 10.52 CPU) @ 28517.11/s (n=300000)
  original: 15 wallclock secs (14.28 usr +  0.00 sys = 14.28 CPU) @ 21008.40/s (n=300000)

Bottom line: Hash lookup is faster than the sprintf, so let's cache it.


The other tweak would be to cache the call to num_entity inside the main regex in encode_entities. Swap this:

$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} || num_entity($1)/ge;

for this

$$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} ||= num_entity($1)/ge;

This would have the side effect of modifying the %char2entity hash, which is visible to the outside world. If that wasn't OK, we could have a private copy of the hash specifically so it would be modifiable. The potential downside (or upside?) of that would be that if someone outside the module modified %char2entity, it would have no effect on encode_entities.

For benchmarking encode_entities, I used this:

my $ent = chr(8855);
my $num = chr(8854);
my $unencoded = "$ent$num" x 10;

my $text = <<"HTML";
text in "$unencoded"
HTML

timethese( 1_000_000,
    {
        encode => sub { my $x = encode_entities($text) },
    }
);

Results:

42,281/s for the original unmodified encode_entities.

52,746/s if the encode_entities used the caching num_entity first mentioned, but the main regex is unchanged.

64,769/s if the main conversion regex caches the results of calls to num_entity in %char2entity. Changing this to call the caching num_entity gave no noticeable improvement.


I hope these give some ideas. encode_entities is an absolute workhorse at my job (we generate everything with Template Toolkit), and I'm sure for many many others. Any speedup would have wide-ranging benefits.

@oalders
Copy link
Member

oalders commented Aug 21, 2022

How do you envision the cache? An in-memory LRU or is this a hash that can grow quite large? Have you tried seeing what kind of speed gains can be had via Memoize?

@petdance
Copy link
Contributor Author

Just a hash that could theoretically but not likely grow quite large, as I showed in the code above.

I didn't bother with Memoize because I figure we wouldn't want to add another dependency, but now I see that Memoize has been core since 5.7.3, so not an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants