Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reminder: revisit yield from tokenization #529

Open
jrfnl opened this issue Jul 1, 2024 · 8 comments
Open

Reminder: revisit yield from tokenization #529

jrfnl opened this issue Jul 1, 2024 · 8 comments

Comments

@jrfnl
Copy link
Member

jrfnl commented Jul 1, 2024

Just a reminder to self to revisit the yield from tokenization and add more extensive tests for this.

Things to have a critical look at whether they are tokenized correctly/how they are tokenized:

  • yield and from with only whitespace between the words, but the words are on different lines
  • yield and from with a comment between the words

Also needs a check on how PHP natively handles this.

@jrfnl
Copy link
Member Author

jrfnl commented Jul 4, 2024

Loosely related to: squizlabs/PHP_CodeSniffer#3808

@jrfnl
Copy link
Member Author

jrfnl commented Jul 4, 2024

Okay, so I've had a look at this and as I suspected there are some issues which should be fixed. I'm currently undecided on how to fix these.

I'm fully aware that the issues found are edge cases and will rarely, if ever, be found in real life codebases. Having said that, that's all the more reason to fix these tokenizer issues as PHPCS kicking in and flagging things correctly is especially important for those edge cases. Issue squizlabs/PHP_CodeSniffer#3808 is a case in point for this.

Tokenizer basic principles

Guiding principles:

  • After the PHPCS tokenization layer has been applied, the token stream should be consistent across all supported PHP versions.
  • The token stream should match the PHP native tokenization (if available) for any new tokens, unless that tokenization doesn't make sense in the context of PHPCS.

Additionally:

  • If a tab width is set, any tabs in the content should be replaced by the appropriate number of spaces (and the original content with tabs will be made available in a orig_content key).
  • Any (PHP native) token which spans multiple lines, will be split into one token per line and these tokens will include the relevant new line character and indentation spacing,
    In practice, this typically applies to multi-line text strings/heredoc/nowdocs and multi-line comments.

Findings

For "plain" yield, everything is fine.
For "plain" yield from everything is fine.
Case-insensitivity of the keywords is also handled correctly.

Things get interesting once we add a comment/ignore annotation between the yield and the from keywords and/or have a tab or a new line character (and indentation whitespace) between the keywords.

Test set up

I've used the following code sample to examine the tokenization:

<?php
function generator()
{
    yield from gen2();

    yield /* comment */ from gen2();

    yield
    from
    gen2();

    yield // comment
	from gen2();

    yield
    /* comment */
    from
    gen2();
}

The above code sample is valid in PHP 8.3, but shows parse errors in PHP < 8.3 for those code samples with a comment between the yield and the from keywords.
https://3v4l.org/2SI2Q#veol

PHP native tokenization

Taking the PHP native tokenization as a basis, there are four (five) different tokenizations to take into account.

  • PHP 5.4 in which yield didn't exist as a keyword.
  • PHP 5.5 in which yield did exist, but yield from did not.
  • PHP 7.0 in which both yield as well as yield from exist.
  • PHP 8.0 in which the comment tokenization was changed (handled separately).
  • PHP 8.3, in which the handling of comments within the yield from keyword was changed (improved/fixed).

Details about the differences in PHP native tokenization

The PHP native tokenization is as follows:

	yield from gen2();

# PHP 5.4:
# Line 4: T_STRING ('yield')
# Line 4: T_WHITESPACE (' ')
# Line 4: T_STRING ('from')
#
# PHP 5.5 - 5.6:
# Line 4: T_YIELD ('yield')
# Line 4: T_WHITESPACE (' ')
# Line 4: T_STRING ('from')
#
# PHP 7.0 - 8.2:
# Line 4: T_YIELD_FROM ('yield from')
#
# PHP 8.3:
# Line 4: T_YIELD_FROM ('yield from')

	yield /* comment */ from gen2();

# PHP 5.4:
# Line 6: T_STRING ('yield')
# Line 6: T_WHITESPACE (' ')
# Line 6: T_COMMENT ('/* comment */')
# Line 6: T_WHITESPACE (' ')
# Line 6: T_STRING ('from')
#
# PHP 5.5 - 5.6:
# Line 6: T_YIELD ('yield')
# Line 6: T_WHITESPACE (' ')
# Line 6: T_COMMENT ('/* comment */')
# Line 6: T_WHITESPACE (' ')
# Line 6: T_STRING ('from')
#
# PHP 7.0 - 8.2:
# Line 6: T_YIELD ('yield')
# Line 6: T_WHITESPACE (' ')
# Line 6: T_COMMENT ('/* comment */')
# Line 6: T_WHITESPACE (' ')
# Line 6: T_STRING ('from')
#
# PHP 8.3:
# Line 6: T_YIELD_FROM ('yield /* comment */ from')

    yield
    from
    gen2();

# PHP 5.4:
# Line 8: T_STRING ('yield')
# Line 8: T_WHITESPACE ('
#     ')
# Line 9: T_STRING ('from')
#
# PHP 5.5 - 5.6:
# Line 8: T_YIELD ('yield')
# Line 8: T_WHITESPACE ('
#     ')
# Line 9: T_STRING ('from')
#
# PHP 7.0 - 8.2:
# Line 8: T_YIELD_FROM ('yield
#     from')
#
# PHP 8.3:
# Line 8: T_YIELD_FROM ('yield
#     from')

    yield // comment
	from gen2();

# PHP 5.4:
# Line 12: T_STRING ('yield')
# Line 12: T_WHITESPACE (' ')
# Line 12: T_COMMENT ('// comment
# ')
# Line 13: T_WHITESPACE ('	')
# Line 13: T_STRING ('from')
#
# PHP 5.5 - 5.6:
# Line 12: T_YIELD ('yield')
# Line 12: T_WHITESPACE (' ')
# Line 12: T_COMMENT ('// comment
# ')
# Line 13: T_WHITESPACE ('	')
# Line 13: T_STRING ('from')
#
# PHP 7.0 - 8.2:
# Line 12: T_YIELD ('yield')
# Line 12: T_WHITESPACE (' ')
# Line 12: T_COMMENT ('// comment
# ')
# Line 13: T_WHITESPACE ('	')
# Line 13: T_STRING ('from')
#
# PHP 8.3:
# Line 12: T_YIELD_FROM ('yield // comment
# 	from')

    yield
    /* comment */
    from
    gen2();

# PHP 5.4:
# Line 15: T_STRING ('yield')
# Line 15: T_WHITESPACE ('
#     ')
# Line 16: T_COMMENT ('/* comment */')
# Line 16: T_WHITESPACE ('
#     ')
# Line 17: T_STRING ('from')
#
# PHP 5.5 - 5.6:
# Line 15: T_YIELD ('yield')
# Line 15: T_WHITESPACE ('
#     ')
# Line 16: T_COMMENT ('/* comment */')
# Line 16: T_WHITESPACE ('
#     ')
# Line 17: T_STRING ('from')
#
# PHP 7.0 - 8.2:
# Line 15: T_YIELD ('yield')
# Line 15: T_WHITESPACE ('
#     ')
# Line 16: T_COMMENT ('/* comment */')
# Line 16: T_WHITESPACE ('
#     ')
# Line 17: T_STRING ('from')
#
# PHP 8.3:
# Line 15: T_YIELD_FROM ('yield
#     /* comment */
#     from')

To summarize: it looks like PHP natively did not support a comment between the yield and from keywords in PHP 7.0 - 8.2 and does since PHP 8.3.

PHPCS tokenization

As things are, the PHPCS tokenization is consistent for PHP 5.4 - 8.2 and in line with a comment between the keywords being a parse error.
For PHP 8.3, the PHPCS tokenization is different due to the PHP 8.3 native tokenization having changed.

The PHPCS tokenization is as follows:

	yield from gen2();

# PHP 5.4 - 8.2:
# 10 | L04 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 10]: yield from
#
# PHP 8.3:
# 10 | L04 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 10]: yield from

	yield /* comment */ from gen2();

# PHP 5.4 - 8.2:
# 19 | L06 | C  5 | CC 1 | ( 0) | T_YIELD                    | [  5]: yield
# 20 | L06 | C 10 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
# 21 | L06 | C 11 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
# 22 | L06 | C 24 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
# 23 | L06 | C 25 | CC 1 | ( 0) | T_STRING                   | [  4]: from
#
# PHP 8.3:
# 19 | L06 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 24]: yield /* comment */ from

    yield
    from
    gen2();

# PHP 5.4 - 8.2:
# 32 | L08 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
#
# 33 | L09 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  8]:     from
#
# PHP 8.3:
# 28 | L08 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
#
# 29 | L09 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  8]:     from

    yield // comment
	from gen2();

# PHP 5.4 - 8.2:
# 43 | L12 | C  5 | CC 1 | ( 0) | T_YIELD                    | [  5]: yield
# 44 | L12 | C 10 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
# 45 | L12 | C 11 | CC 1 | ( 0) | T_COMMENT                  | [ 10]: // comment
#
# 46 | L13 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
# 47 | L13 | C  2 | CC 1 | ( 0) | T_STRING                   | [  4]: from
#
# PHP 8.3:
# 39 | L12 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 16]: yield // comment
#
# 40 | L13 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: 	from

    yield
    /* comment */
    from
    gen2();

# PHP 5.4 - 8.2:
# 56 | L15 | C  5 | CC 1 | ( 0) | T_YIELD                    | [  5]: yield
# 57 | L15 | C 10 | CC 1 | ( 0) | T_WHITESPACE               | [  0]:
#
# 58 | L16 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  4]:
# 59 | L16 | C  5 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
# 60 | L16 | C 18 | CC 1 | ( 0) | T_WHITESPACE               | [  0]:
#
# 61 | L17 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  4]:
# 62 | L17 | C  5 | CC 1 | ( 0) | T_STRING                   | [  4]: from
#
# PHP 8.3:
# 49 | L15 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
#
# 50 | L16 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [ 17]:     /* comment */
#
# 51 | L17 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  8]:     from

To summarize:

  • PHPCS on PHP 5.4 - 8.2 tokenizes the yield and from in a yield from with a comment between the keywords as T_YIELD and T_STRING respectively, and tokenizes the comment as as T_COMMENT and treats any whitespace as T_WHITESPACE.
  • PHPCS on PHP 8.3 tokenizes a yield from with a comment between the keywords as one T_YIELD_FROM token (possibly split into multiple tokens when multi-line) and includes the comment in the T_YIELD_FROM.

Conclusions

As things currently are, there are the following issues in the PHPCS polyfill/re-tokenization:

  • Tab replacement is not handled for the whitespace between the yield and from keywords.
    This means that sniffs which check for indentation/inline tabs will not flag tabs between the keywords or indentation tabs when the keywords are across multiple lines (though those sniffs may need updating too to listen to the T_YIELD_FROM token).
    Sniffs which will need tests/updates for this: Generic.WhiteSpace.DisallowTabIndent, Generic.WhiteSpace.LanguageConstructSpacing and (PHPCSExtra) Universal.WhiteSpace.DisallowInlineTabs.
  • The PHP 8.3 tokenization change is not handled, which means comments between the yield and from keywords are not being handled consistently and that the keywords are not always tokenized consistently/correctly.

Possible mitigations

How to consistently tokenize yield from with a comment and/or new line between the keywords, knowing this is a parse error in PHP < 8.3, is a little challenging.

There are three options I can see:

  1. Adhere to the PHP 8.3 tokenization - tokenize everything from yield to from as T_YIELD_FROM.
    This will make it more difficult for comment sniffs to act on potential comments in the token. It would also prohibit/disregard PHPCS annotations inside the yield from.
    Potentially the LanguageConstructSpacing sniff could be updated to flag/forbid comments in yield from.
    yield from gen2();
    # 10 | L04 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 10]: yield from
    
    yield /* comment */ from gen2();
    # 19 | L06 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 24]: yield /* comment */ from
    
    yield
    /* comment */
    from
    gen2();
    # 49 | L15 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
    #
    # 50 | L16 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [ 17]:     /* comment */
    #
    # 51 | L17 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  8]:     from
  2. Tokenize as a single token when there is only non-new line whitespace between the keywords, tokenize as multiple tokens in all other cases, but consistently tokenize the yield and from in those cases as T_YIELD_FROM.
    yield from gen2();
    # 10 | L04 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 10]: yield from
    
    yield /* comment */ from gen2();
    # 19 | L06 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
    # 20 | L06 | C 10 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
    # 21 | L06 | C 11 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
    # 22 | L06 | C 24 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
    # 23 | L06 | C 25 | CC 1 | ( 0) | T_YIELD_FROM               | [  4]: from
    
    yield
    /* comment */
    from
    gen2();
    # 56 | L15 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
    # 57 | L15 | C 10 | CC 1 | ( 0) | T_WHITESPACE               | [  0]:
    #
    # 58 | L16 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  4]:
    # 59 | L16 | C  5 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
    # 60 | L16 | C 18 | CC 1 | ( 0) | T_WHITESPACE               | [  0]:
    #
    # 61 | L17 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  4]:
    # 62 | L17 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  4]: from
  3. Hybrid - tokenize the yield and from keywords and any directly related whitespace as T_YIELD_FROM, but tokenize comments separately.
    This would be more in line with how other potential multi-line tokens, like T_COMMENT and T_CONSTANT_ENCAPSED_STRING are handled, but makes the whitespace handling around (single-line) yield from with a comment a bit weird/inconsistent as those would now potentially include whitespace.
    yield from gen2();
    # 10 | L04 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 10]: yield from
    
    yield /* comment */ from gen2();
    # 19 | L06 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  6]: yield
    # 20 | L06 | C 11 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
    # 21 | L06 | C 25 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]:  from
    
    yield
    /* comment */
    from
    gen2();
    # 49 | L15 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
    #
    # 50 | L16 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  4]:
    # 51 | L16 | C  5 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
    #
    # 52 | L17 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  8]:     from

Independently of the chosen solution, tab replacement should be enabled for the T_YIELD_FROM token, though whether this will only act on yield[tab]from or also on potential indentation whitespace will depend on the chosen solution.

Aside from the above, it will probably be a good idea to add a sniff to PHPCompatibility to check for comments between the yield and from keywords and flag that as unsupported prior to PHP 8.3. I've opened issue PHPCompatibility/PHPCompatibility#1720 for this.

Opinions

I'd love to hear some opinions on the above and preferences for the solution directions (or even potential alternative
solutions which I haven't thought of yet...)

I'm personally leaning towards solution 2️⃣ , but am still not 100% happy with this.

/cc @asispts @dingo-d @fredden @GaryJones @greg0ire @kukulich @michalbundyra @Ocramius @sirbrillig @stronk7 @weierophinney @wimg

@fredden
Copy link
Member

fredden commented Jul 5, 2024

Another option not listed:
4. Consistently tokenise yield and from as separate tokens (both T_YIELD_FROM), and tokenise anything in between "normally." This is in line with how a function declaration is tokenised: the function keyword (T_FUNCTION) doesn't include any white-space, the actual white space is tokenised as T_WHITESPACE and the name of the function (T_STRING) also doesn't include any white-space.

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [  5]: <?php

  1 | L2 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

  2 | L3 | C  1 | CC 0 | ( 0) | T_FUNCTION                 | [  8]: function
  3 | L3 | C  9 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  4 | L3 | C 10 | CC 0 | ( 0) | T_STRING                   | [  5]: thing
  5 | L3 | C 15 | CC 0 | ( 0) | T_OPEN_PARENTHESIS         | [  1]: (
  6 | L3 | C 16 | CC 0 | ( 0) | T_CLOSE_PARENTHESIS        | [  1]: )
  7 | L3 | C 17 | CC 0 | ( 0) | T_SEMICOLON                | [  1]: ;
  8 | L3 | C 18 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

We can link the two T_YIELD_FROM tokens together with a property (similar to how we do scope/parenthesis open/close currently).

Proposed:

T_YIELD_FROM         yield
T_WHITESPACE         ⸱      
T_YIELD_FROM         from  
T_WHITESPACE         ⸱      
T_STRING             thing  
T_OPEN_PARENTHESIS   (      
T_CLOSE_PARENTHESIS  )      
T_SEMICOLON          ;      

@jrfnl
Copy link
Member Author

jrfnl commented Jul 5, 2024

@fredden Good input. Thanks for that. I'll have to think it over. Main reason why I didn't include that option in the original list is because if makes the gap between the PHP native tokenization and the PHPCS tokenization larger/makes it more inconsistent.

As for the comparison with function declarations: that's not a comparable syntax as PHP natively does not tokenize the function keyword and the function name as one token.

The only multi-character tokens which AFAICS come close are cast tokens which can contain arbitrary whitespace on the inside of the parentheses, though those still can't contain new lines or comments.
See:
https://3v4l.org/A6Sgj and https://3v4l.org/nE9H8

@jrfnl
Copy link
Member Author

jrfnl commented Jul 11, 2024

I've opened an issue in PHP Core to ask whether the PHP 8.3 change was intentional or should be regarded as a bug. The answer to that should help inform the solution direction for this issue.

@jrfnl
Copy link
Member Author

jrfnl commented Jul 18, 2024

The issue in PHP Core has now escalated to PHP Internals: https://externals.io/message/124462

@jrfnl
Copy link
Member Author

jrfnl commented Jul 21, 2024

There is a proposal open to change the tokenization (yet) again now in PHP 8.4, without reverting the previous change: php/php-src#15041

This is just complicating things even more as it will cause an even larger BC break (removal of the T_YIELD_FROM token, adding of new T_FROM[_FOR_YIELD] token and always having two tokens).

We'll have to wait and see what happens, but if that gets merged, the problem will be even larger as that will break every sniff which looks for T_YIELD_FROM, so in that case, I think we'll have to "undo" the PHP 8.3 and the 8.4 tokenization for PHPCS 3.x and only move to the new tokenization in PHPCS 4.0 - similar to what was done for the PHP 8.0 namespaced names tokenization change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants