Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML5/UTF-8 spec-compliant text decoder. #14927

Draft
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

dmsnell
Copy link
Contributor

@dmsnell dmsnell commented Jul 11, 2024

This is a first draft implementing a spec-compliant HTML5 text decoder. Unlike the other HTML/XML decoding functions, this is non-configurable.

  • Calling code must first convert input if it's not already in UTF-8.
  • All outputs are in UTF-8.
  • Calling code must indicate if the input was found inside an HTML attribute or a text node, as this determines how certain character references are decoded.

It is assumed that the input represents the full attribute value or text node, and any invocations with a substring of the full value may decode improperly at the end of the string.

The upsides are

  • This decodes text the same way an HTML5-compliant browser would.
  • This step method can be used to identify where character references are and if they are valid. This makes it possible to analyze a string without allocating a full copy.

This PR only currently demonstrates the step function which decodes a character reference at the given string cursor. A full decoder should only involve a loop, code to find the next &, and a string builder.

'Cats & Dogs ¬ <tags>' === decode_html(HTML_TEXT, '<img alt="Cats &amp; Dogs &not; &lt;tags&gt">', 10, 33);

'🅰' === decode_html_ref(HTML_TEXT, '&#x1F170;', 0, $match_length);
9 === $match_length;

'🅰' === decode_html_ref(HTML_TEXT, '&#x1F170', 0, $match_length);
8 === $match_length;

'🅰' === decode_html_ref(HTML_ATTRIBUTE, '&#x1F170;', 0, $match_length);
9 === $match_length;

null === decode_html_ref(HTML_ATTRIBUTE, '&#x1F170');

null === decode_html_ref(HTML_ATTRIBUTE, '&hellip,');
'' === decode_html_ref(HTML_TEXT, '&hellip,');
'' === decode_html_ref(HTML_TEXT, '&hellip;');
null === decode_html_ref(HTML_TEXT, '&hellip=');

Performance

I've tested different methods of decoding raw content as well as normative HTML document.
Testing was performed on an EPYC 2.0 GHz server with 64 GB ram.

Decoder MB/s Numeric MB/s Named Accurate?
html_entity_decode( …, ENT_HTML5 | ENT_QUOTES | ENT_SUBSTITUTE ) 103 78.5 🚫
decode_html() 96 29.2
\DOM\HTMLDocument 43 17.1
WordPress' WP_HTML_Decoder 6.1 1.28

Questions

  • When providing offset, must length always be present? the character reference decoder should still know where the span of text comprising the attribute or text node ends, otherwise it might assume that the end is at the end of the full string, not just the local region (e.g. within an attribute)

This is a first draft implementing a spec-compliant HTML5 text decoder.
Unlike the other HTML/XML decoding functions, this is non-configurable.
@alecpl
Copy link

alecpl commented Jul 14, 2024

My 2cents. The function name is bad and too long. It has some commonalities with html_entity_decode(), but is very specific. I think it's too specific. Also, since we have HTML5 parser in the dom extension now, I wonder whether something like this should be part of this extension. Maybe with some additional functions similar to what the standard extension has, but better.

Anyway, I suggest to discuss this on the internal mailing list. In any case this will require an RFC.

@dmsnell
Copy link
Contributor Author

dmsnell commented Jul 14, 2024

thank you @alecpl - for now this is intended to support an RFC once I can get one written up.

it's very specific right now and I assume the name will change, but the specifics are there to clarify what it does and doesn't do.

in the RFC, I will draw a distinction in usability over the DOM solution, which would require additional setup that is easy to overlook, such as duplicating the string inside additional HTML inside the right tag or attribute, only then to parse the structure again as a DOM, then to extract the original string through DOM methods.

for the same reason that html_entity_decode() is chosen over the existing DOMDocument, this addresses the need to conveniently decode text, but is designed to do so in a spec-compliant way. (once porting from the user-space PHP version is complete).

@nielsdos
Copy link
Member

nielsdos commented Jul 14, 2024

You can reuse the internals of the tokenizer and parser in ext-dom, otherwise we duplicate part of parsing logic which doesn't seem optimal.

```
In file included from /home/runner/work/php-src/php-src/Zend/zend.h:27,
                 from /home/runner/work/php-src/php-src/main/php.h:31,
                 from /home/runner/work/php-src/php-src/ext/standard/html.c:37:
/home/runner/work/php-src/php-src/ext/standard/html.c: In function ‘zif_decode_html’:
/home/runner/work/php-src/php-src/Zend/zend_types.h:1066:36: error: ‘matched_byte_length’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
 1066 |                 Z_TYPE_INFO_P(__z) = IS_LONG;   \
      |                                    ^
/home/runner/work/php-src/php-src/ext/standard/html.c:1648:11: note: ‘matched_byte_length’ was declared here
 1648 |     zval *matched_byte_length;
      |           ^~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
```
@nielsdos
Copy link
Member

nielsdos commented Aug 3, 2024

You should not edit the arginfo files directly because it's a generated file.
You should edit the .stub.php file and then either running make or running ./build/gen_stub.php will update the arginfo file.
Furthermore, you must assign references via ZEND_TRY_ASSIGN_REF_LONG to make sure no violations against typed properties take place.

@dmsnell
Copy link
Contributor Author

dmsnell commented Aug 3, 2024

thank you @nielsdos! I'm reading all sorts of Zend docs trying to figure this out. I will attempt to follow your instructions.

thankfully I found the silly mistakes in my decoder logic. one thing I'm wanting to do with this is compare it against lexbor. I eventually want to do that as a test, as I think all the major browsers plus lexbor are using a fairly sub-optimal lookup for the named character references

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants