Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HTML5/UTF-8 spec-compliant text decoder. #14927

Draft
wants to merge 38 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
170d313
Add HTML5/UTF-8 spec-compliant text decoder.
dmsnell Jul 11, 2024
f4a03da
Replace hex codes with octal codes, fix bug in test.
dmsnell Jul 11, 2024
e5a5549
Surround expression in parens as CI job suggests.
dmsnell Jul 11, 2024
5f45785
Remove space in tests, oops.
dmsnell Jul 11, 2024
9d079be
Split HTML5 character reference table into <2048-byte literals
dmsnell Jul 12, 2024
f2607c8
Directly encode bytes for U+FFFD
dmsnell Jul 12, 2024
b944305
Testing oops
dmsnell Jul 12, 2024
14fc23b
Three bytes not two.
dmsnell Jul 12, 2024
b16fcb1
Rename to decode_html and fix numeric character reference decoder.
dmsnell Aug 2, 2024
7436c99
Update large word matching, add more info to test output
dmsnell Aug 2, 2024
b5e7d5c
Fix missing offset in html5_find_large_reference_name
dmsnell Aug 2, 2024
6aac25a
Always initialize matched_byte_length.
dmsnell Aug 2, 2024
ba6878b
Change things: I don't know why this is warning me.
dmsnell Aug 2, 2024
e0148fa
Try again, another flag.
dmsnell Aug 2, 2024
8c2305b
Avoid misaligned 16-bit read
dmsnell Aug 2, 2024
2c30402
Use a nicer array lookup.
dmsnell Aug 2, 2024
21ae751
Table-based approach.
dmsnell Aug 2, 2024
10d1676
Try that la la la
dmsnell Aug 2, 2024
5ddad49
Replace HTML5_ prefix with HTML_
dmsnell Aug 2, 2024
0da4b59
HTML_TEXT_NODE is taken by libxml2
dmsnell Aug 2, 2024
fec733d
Try another, why not.
dmsnell Aug 3, 2024
1adce7d
Build with `./build/gen_stub.php`
dmsnell Aug 3, 2024
d301202
Update with help.
dmsnell Aug 3, 2024
9116ff7
Merge remote-tracking branch 'upstream/master' into add-html5-decoder
dmsnell Aug 3, 2024
470c812
Rebuild arginfo
dmsnell Aug 3, 2024
4e0ca8c
Set ref to NULL.
dmsnell Aug 3, 2024
17adec4
Separate step function from full decoder.
dmsnell Aug 3, 2024
78437dc
Add tests for new full decoder
dmsnell Aug 3, 2024
9cefb60
Rename step function to `html_decode_ref`
dmsnell Aug 3, 2024
45564a4
Fix missing rename
dmsnell Aug 3, 2024
9b7cdd5
Release the replacement strings.
dmsnell Aug 3, 2024
05decf6
Remove const qualifier
dmsnell Aug 3, 2024
83582c9
Optimize group lookup
dmsnell Aug 4, 2024
1cee0fc
More group lookup optimization.
dmsnell Aug 4, 2024
4879530
Switch to a new HtmlContext enum for the context parameter
dmsnell Aug 19, 2024
ba3b97c
fixup! Switch to a new HtmlContext enum for the context parameter
dmsnell Aug 19, 2024
0a171ce
Merge remote-tracking branch 'upstream/master' into add-html5-decoder
dmsnell Aug 19, 2024
93844af
WIP: Add more contexts, handle CR and NULL bytes.
dmsnell Aug 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ext/standard/basic_functions.c
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,7 @@ PHP_MINIT_FUNCTION(basic) /* {{{ */

assertion_error_ce = register_class_AssertionError(zend_ce_error);

html_context_ce = register_class_HtmlContext();
rounding_mode_ce = register_class_RoundingMode();

BASIC_MINIT_SUBMODULE(var)
Expand Down
13 changes: 13 additions & 0 deletions ext/standard/basic_functions.stub.php
Original file line number Diff line number Diff line change
Expand Up @@ -541,6 +541,14 @@
*/
const ENT_HTML5 = UNKNOWN;

enum HtmlContext {
case Attribute;
case BodyText;
case ForeignText;
case Script;
case Style;
}

/* image.c */

/**
Expand Down Expand Up @@ -2264,6 +2272,11 @@ function htmlspecialchars(string $string, int $flags = ENT_QUOTES | ENT_SUBSTITU

function htmlspecialchars_decode(string $string, int $flags = ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401): string {}

function decode_html(HtmlContext $context, string $html, int $offset = 0, ?int $length = null): ?string {}

/** @param int $matched_byte_length */
function decode_html_ref(HtmlContext $context, string $html, int $offset = 0, &$matched_byte_length = null): ?string {}

function html_entity_decode(string $string, int $flags = ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401, ?string $encoding = null): string {}

/** @refcount 1 */
Expand Down
41 changes: 40 additions & 1 deletion ext/standard/basic_functions_arginfo.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading