Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHP 8.3: "yield /*comment*/ from" is no longer a parse error ? #14926

Open
jrfnl opened this issue Jul 11, 2024 · 31 comments · May be fixed by #15041 or #15276
Open

PHP 8.3: "yield /*comment*/ from" is no longer a parse error ? #14926

jrfnl opened this issue Jul 11, 2024 · 31 comments · May be fixed by #15041 or #15276

Comments

@jrfnl
Copy link
Contributor

jrfnl commented Jul 11, 2024

Description

The following code:

<?php
function generator()
{
    yield from gen2();

    yield /* comment */ from gen2();

    yield
    from
    gen2();

    yield // comment
	from gen2();

    yield
    /* comment */
    from
    gen2();
}

https://3v4l.org/2SI2Q#veol

Resulted in this output:

_no output_

But I expected this output instead:

Parse error: syntax error, unexpected identifier "gen2" in /in/2SI2Q on line 7

For a full analysis of the differences in tokenization, see: PHPCSStandards/PHP_CodeSniffer#529 (comment)

While I'm not necessarily challenging this change, I'd like to know if this was a deliberate/intentional change.

I have not been able to find anything in the PHP 8.3 CHANGELOG about this change, nor in the NEWS file. I can't even seem to find the commit which caused this change.

If this change was intentional, this is probably a documentation issue and the change should be annotated in the PHP 8.3 CHANGELOG.

If this change was unintentional, I believe it may be prudent to revert the change (or at least revert the side-effects of the commit which incidentally caused this change).

I'd like to get some clarity about this as it will inform how the PHP_CodeSniffer issue linked above should be fixed.

PHP Version

PHP 8.3.8

Operating System

Not relevant

@damianwadley
Copy link
Member

This was changed by @iluuu1994 during f291d37 in response to #10083...

Tip: for grammar questions, one of the first places to check for answers is zend_language_scanner.

@nielsdos
Copy link
Member

nielsdos commented Jul 11, 2024

Damnit, beat me on commenting literally by seconds 😂
Indeed, this is a side effect of the bugfix and imo not a bug.

@jrfnl
Copy link
Contributor Author

jrfnl commented Jul 11, 2024

I do kind of consider it a bug as it is an undocumented change in behaviour - the changelog entry really doesn't cover the potential/actual impact of the change.

Knowing where this is coming from, I'm surprised I haven't seen more breakage due to this change in PHPCS. Then again, that would require the PHPCS tests to hit the changed tokenization behaviour and as this kind of thing was previously a parse error, I can understand why this is not covered by the existing tests.

If this change stays, in my opinion, the changelog entry needs to be updated to more comprehensively state what has actually changed and what the impact of this. I also think it should be mentioned in the CHANGELOG, not just in NEWS.

Additionally, I'd recommend for tests to be added with the above code sample for yield from to prevent a regression in the future. Preferably tests which would not just check that the above code sample is not regarded as a parse error, but tests which also check the tokenization is consistent.

@jrfnl
Copy link
Contributor Author

jrfnl commented Jul 11, 2024

P.S.: Thanks @damianwadley and @nielsdos for your fast response to my query!

@cmb69
Copy link
Member

cmb69 commented Jul 12, 2024

Since this affects PHP 8.3, I'm not sure whether updating CHANGELOG makes much sense now, but it's certainly worth documenting it in the migration guide of the PHP documentation.

@iluuu1994
Copy link
Member

@jrfnl I indeed didn't think about linters, this should have gotten an UPGRADING entry. Sorry about that.

@mvorisek
Copy link
Contributor

mvorisek commented Jul 14, 2024

IMO it is not a bug, as comments are (should be) allowed between any code token in general.

This is probably already documented or at least it is "generally expected". So f291d37 is a fix.

@iluuu1994
Copy link
Member

@mvorisek What Juliette references is the patch changing the output of PhpToken::tokenize(), with the tokenizer treating yield /* comment */ from as a singular token. This is the only feasible solution without switching to a GLR parser.

@mvorisek
Copy link
Contributor

mvorisek commented Jul 14, 2024

That is indeed not good and such token should be broken into separate tokens further.

repro: https://3v4l.org/7VmX1

@jrfnl
Copy link
Contributor Author

jrfnl commented Jul 14, 2024

Forgive me if I don't use the correct terminology for some things, I'm hoping that even if the terminology is not 100% correct for C, you can still understand what I'm trying to say.

IMO it is not a bug, as comments are (should be) allowed between any code token in general.

... which is exactly the reason why I was testing the Tokenizer behaviour with comments between the words.

Having said that, it's also not true....

In nearly all cases, tokens consist of a "single entity", so no comments can exist within a token.
This includes "compound operators", like ??, where no whitespace or comment is allowed between the first and the second ?.

A typical case where the tokenization changed from multiple tokens to single token is the PHP 8.0 "namespaced names" change.
However, that change also explicitly removed the ability to have whitespace or comments within a namespaced name.

Another one which comes to mind are cast tokens, like (bool). These tokens can contain whitespace, i.e. ( bool ), but they cannot contain comments. See: https://3v4l.org/A6Sgj and https://3v4l.org/nE9H8

So, in practice, yield from is (AFAICS) the only "multi-entity" token which can contain comments... (since PHP 8.3).

Retrospectively, it would have been better if yield from would have been introduced as yield_from (note the underscore). This would be in line with other multi-word keywords, like include_once or as yieldfrom, which would be in line with keywords like endif, but considering it has been in the language for such a long time now, changing this now would be a huge BC-break.

I consider the change in f291d37 a bug as it is an undocumented (breaking) change in behaviour for the Tokenizer for projects consuming the token output.
That bug could be fixed by either fixing/updating the documentation or by reverting the change.

So, what remains then is the discussion about how yield from should be tokenized.

I'm not sure about this and you can also see some thoughts about this in the linked PHPCS ticket (as PHPCS needs to ensure the token stream sniffs receive is the same on all PHP versions, so I need to make a choice on how to get round the changed tokenization).

On the one hand I agree that comments between the keywords should be ignored - as they mostly are elsewhere -, on the other hand, comments cause a parse error in other "multi entity" tokens, so this change introduces an inconsistency.

@iluuu1994
Copy link
Member

In nearly all cases, tokens consist of a "single entity", so no comments can exist within a token. This includes "compound operators", like ??, where no whitespace or comment is allowed between the first and the second ?.

That's true, but what qualifies as a token is usually very intuitive. yield from is one of the few exceptions, and it's primary rationale for being a single token to avoid making from itself a keyword.

So, in practice, yield from is (AFAICS) the only "multi-entity" token which can contain comments... (since PHP 8.3).

I think that's correct. The other cases only use lookahead for comments.

So, what remains then is the discussion about how yield from should be tokenized.

Unfortunately, for yield from to not cause parser ambiguitites, it must be a single token. Hence, when a comment does appear in that spot, it is implicitly part of the token. Since 8.3 has been out for a while, I don't think removing it now is feasible, and if it is removed from 8.4, then that still leaves 8.3 as an issue for you.

It might be possible for you to artificially split yield /* comment */ from into multiple tokens. It's not great to push that responsibility onto users, but I don't see a great alternative at the moment. If you have any other ideas that might help with your situation, please let me know.

@jrfnl
Copy link
Contributor Author

jrfnl commented Jul 14, 2024

Unfortunately, for yield from to not cause parser ambiguitites, it must be a single token. Hence, when a comment does appear in that spot, it is implicitly part of the token.

And that's what I don't necessarily agree with - and nor did PHP 7.0 - 8.2 in which this was a parse error.

Since 8.3 has been out for a while, I don't think removing it now is feasible

And again, I don't agree with this. I only discovered this bug now as the change wasn't documented. That doesn't make it any less of a bug/inconsistency with every single other token, as none of them allow for comments within the token.

If it had been included in the changelog, I would probably have questioned the change/introduction of this inconsistency before the release was out.

Note: I'm only challenging the change to the yield from tokenization. Not the other changes in the same commit.

@cmb69
Copy link
Member

cmb69 commented Jul 14, 2024

Well, we can't go back in time, so the question is what would break more code: reverting the change for yield from, or leaving it as is. I'd hope the former, but I'm afraid the latter might actually be the case (coders are so "creative": "let's just put a comment between yield and from, because we can").

@iluuu1994
Copy link
Member

iluuu1994 commented Jul 14, 2024

I would agree with Christoph. From what I understand, PHP_CodeSniffer would either not correctly format the comment, or mistakenly remove it. While that's not great, it's not breaking. Removing parsing at this point would fatal error on a PHP patch update, and I think that's generally a no-go. Please correct me if my understanding is incorrect.

@jrfnl
Copy link
Contributor Author

jrfnl commented Jul 14, 2024

The other side of that argument is that the chances of people having discovered that they can place a comment between yield and from now are slim as the change was not documented/publicized and has only been out 7 months...

@cmb69
Copy link
Member

cmb69 commented Jul 14, 2024

Maybe we should ask for opinions on the internals mailing list?

@nielsdos
Copy link
Member

But what is the "solution" anyway? Complicating the lexer to deal with "yield from" specially?

@cmb69
Copy link
Member

cmb69 commented Jul 15, 2024

But what is the "solution" anyway? Complicating the lexer to deal with "yield from" specially?

I think this is about reverting

<ST_IN_SCRIPTING>"yield"{WHITESPACE_OR_COMMENTS}"from"[^a-zA-Z0-9_\x80-\xff] {

to the previous

<ST_IN_SCRIPTING>"yield"{WHITESPACE}"from"[^a-zA-Z0-9_\x80-\xff] {

or not.

@bwoebi
Copy link
Member

bwoebi commented Jul 15, 2024

I think the proper way of fixing this would be introducing T_FROM, in a special lexer mode.

I.e.:

<ST_IN_SCRIPTING>"yield"{WHITESPACE_OR_COMMENTS}"from"[^a-zA-Z0-9_\x80-\xff] {
	yyless(5);
	yy_push_state(ST_LOOKING_FOR_FROM);
	RETURN_TOKEN(T_YIELD);
}

<ST_LOOKING_FOR_FROM>"from" {
	yy_pop_state();
	RETURN_TOKEN(T_FROM);
}

And then add ST_LOOKING_FOR_FROM to the whitespace, comment etc. targets.

Sort of related issue is #14961. Basically needs the same handling than that one.

@jrfnl
Copy link
Contributor Author

jrfnl commented Jul 15, 2024

@bwoebi Wouldn't that be a much larger BC-break as it would turn from into a (partially?) reserved keyword ? and would remove the T_YIELD_FROM token ?

@bwoebi
Copy link
Member

bwoebi commented Jul 15, 2024

@jrfnl It would be a keyword only if preceded by yield, i.e. changing nothing on the status quo. But yes, it would remove the T_YIELD_FROM token (and a T_COMMENT could now be between T_YIELD and T_FROM).

@cmb69
Copy link
Member

cmb69 commented Jul 17, 2024

@bwoebi, I guess that would cause even more trouble for @jrfnl and others.

Anyhow, we need a solution soon; otherwise the solution would be to stick with what we have since PHP 8.3.10 is already on it's way (RC1 has been tagged yesterday).

@jrfnl
Copy link
Contributor Author

jrfnl commented Jul 17, 2024

I've posted to Internals now: https://externals.io/message/124462 Let's see if we get a response.

bwoebi pushed a commit to bwoebi/php-src that referenced this issue Jul 20, 2024
@bwoebi bwoebi linked a pull request Jul 20, 2024 that will close this issue
bwoebi pushed a commit to bwoebi/php-src that referenced this issue Jul 20, 2024
@cmb69
Copy link
Member

cmb69 commented Aug 5, 2024

Hmm, the discussion on the internals mailing list ended without consensus, not even agreement that the current behavior is a bug; as such we can at most keep that as feature request, but I guess that would not help @jrfnl in any way, and may cause even more work(-arounds).

So, what to do, @jrfnl?

@jrfnl
Copy link
Contributor Author

jrfnl commented Aug 5, 2024

@cmb69 Ha, you beat me to it.

Interesting to see your take on it. The way I read it, is that there is consensus that there is a bug, just not whether the bug is the change in the tokenizer or the missing documentation ;-)

I also see consensus that the current tokenization is not ideal, but no consensus on how it should be changed and whether there should be consistency rules for whitespace/comments in tokens.

So, what to do, @jrfnl?

As I also shared on the list:

I agree the improved tokenization proposed in the PR makes better sense.

Having said that, it will make the breaking change much much larger as
it now would no longer just be a change which can be handled in the
PHPCS token stream, but one which will impact individual sniffs, what
with the removal and introduction of new tokens.

If that PR gets merged, it will mean that PHPCS will need to undo the
PHP 8.3 and PHP 8.4 tokenization for the time being (in the 3.x major)
and will only "polyfill" the new tokenization as of PHPCS 4.0. In
practice, we would be moving the breaking change to PHPCS 4.0 version to
not break sniffs.

@cmb69
Copy link
Member

cmb69 commented Aug 6, 2024

Interesting to see your take on it. The way I read it, is that there is consensus that there is a bug, just not whether the bug is the change in the tokenizer or the missing documentation ;-)

Oh, right. In this case I was referring to an implementation bug, since documenting the issue wouldn't really help you.

Regarding PR #15041: I feel that this is curing the symptoms, but not the desease, namely that from is not a keyword. Changing this in a minor version is off the table, and even changing this in a major may not be accepted. And anyway, it wouldn't help right now with the issue at hand.

@jrfnl
Copy link
Contributor Author

jrfnl commented Aug 6, 2024

And anyway, it wouldn't help right now with the issue at hand.

Well, until there is clarity where this is going, I can't fix things for PHPCS, so PHPCS will stay broken until a solution is found.

@cmb69
Copy link
Member

cmb69 commented Aug 6, 2024

Well, who is up to decide what we do about this? Maybe the 8.3 release managers have something to say: @adoy, @bukka, @ericmann, thoughts?

@bukka
Copy link
Member

bukka commented Aug 6, 2024

I don't think changing that in 8.3 is acceptable. It's been out for some time and doing such break in bug fixing release is not a good idea IMHO. We allow some breaks in minor version so this is just a break in minor version even though it is a bit unfortunate.

cmb69 added a commit to cmb69/php-src that referenced this issue Aug 7, 2024
cmb69 added a commit to cmb69/php-src that referenced this issue Aug 7, 2024
@cmb69
Copy link
Member

cmb69 commented Aug 7, 2024

Okay, if we don't change the behavior, we should document the issue. Thus, I've submitted PR #15276.

@iluuu1994
Copy link
Member

Thank you! That makes sense to me. Let's not merge #15041 without a further discussion or RFC then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants