Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document / Clarify naming rules for identifiers #883

Open
dselman opened this issue Jul 23, 2024 · 8 comments
Open

Document / Clarify naming rules for identifiers #883

dselman opened this issue Jul 23, 2024 · 8 comments
Labels
Difficulty: Medium Help Wanted 🆘 Extra attention is needed Type: Bug 🐛 Something isn't working Type: Discussion 🗣 Type: Documentation 📝 Information and guides for clarification

Comments

@dselman
Copy link
Contributor

dselman commented Jul 23, 2024

Bug Report 🐛

.cto (o String \u0032\u005C3D) --parse--> AST ("2\3D") --serialize--> JSON ("2\3D")
-serialize--> Concerto (\u0032\u005C3D)

The regex should apply to the name after parsing (i.e. 2\3D in this example), because that's the true name, and that's what's stored in the Concerto object for the AST. The name/AST can be rendered into JSON (using JSON escaping rules) or rendered into Concerto (using Concerto escaping rules), but the escaping is not part of the value. It's part of the serialization format (JSON, Concerto, whatever).

Expected Behavior

Unclear. We need to tighten / document the expectations for round tripping from CTO to JSON of escaped identifiers and ensure that the CTO parser grammar is in sync with the AST regex, and code that checks for valid identifiers.

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Desktop

  • OS: [e.g. macOS]
  • Browser: [e.g. Chrome, Safari]
  • Version: [e.g. 0.22.15]

Detailed Description

Possible Implementation

@dselman dselman added Type: Bug 🐛 Something isn't working Help Wanted 🆘 Extra attention is needed Difficulty: Challenging Type: Discussion 🗣 Type: Documentation 📝 Information and guides for clarification labels Jul 23, 2024
@DS-AdamMilazzo
Copy link

DS-AdamMilazzo commented Jul 23, 2024

The regular expression in the metamodel expresses the rules for a valid identifier in the Concerto (.cto) language, including the escaping rules allowed in the Concerto (.cto) language, but it shouldn't do that.

For example, you can define an enum in Concerto like:

enum Dimensions {
  o \u0031D // 1D
  o \u0032D // 2D
  o \u0033D // 3D
}

This is legal, and it gives you enum values that look like this in the AST (when saved as JSON):

[
  {"$class": "concerto.metamodel@1.0.0.EnumProperty", "name": "1D"},
  {"$class": "concerto.metamodel@1.0.0.EnumProperty", "name": "2D"},
  {"$class": "concerto.metamodel@1.0.0.EnumProperty", "name": "3D"}
]

This model works fine, and you can validate objects against it: {"$class":"...", "dims": "2D"}

Where it breaks down is when trying to validate the AST JSON against the metamodel, because the metamodel has this regex validator applied to the "name" field of the EnumProperty concept (as well as the Property and Declaration concepts):

/^(\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}|\p{Nl}|\$|_|\\u[0-9A-Fa-f]{4})(?:\p{Lu}|\p{Ll}|\p{Lt}|\p{Lm}|\p{Lo}|\p{Nl}|\$|_|\\u[0-9A-Fa-f]{4}|\p{Mn}|\p{Mc}|\p{Nd}|\p{Pc}|\u200C|\u200D)*$/u

But this regex is expressing the rules of (unparsed, potentially escaped) identifier strings in the Concerto (.cto) language, which it shouldn't be doing. The strings in the AST are already parsed and are not escaped. Because the value from the AST (e.g. "2D") doesn't match this regex, the model fails to validate against the metamodel, even though the model is valid.

Put graphically with a similar example, we have something like this:

Concerto language (\u0032\u005C3D) --parse--> AST ("2\3D") --serialize--> JSON ("2\\3D")
                                                          \--serialize--> Concerto language (\u0032\u005C3D)

The regex should apply to the name after parsing (i.e. 2\3D in this example), because that's the true name, and that's what's stored in the Concerto object for the AST. The name/AST can be rendered into JSON (using JSON escaping rules) or rendered into Concerto (using Concerto escaping rules), but the escaping is not part of the value. It's part of the serialization format (JSON, Concerto, whatever).

A somewhat analogous bug would be if the regex was trying to account for JSON escaping rules, when it would never receive a JSON-escaped value because the JSON unescaping would already have been done by the JSON parser. In this case, the regex accounts for Concerto escaping rules, but it never receives a Concerto-escaped value because the unescaping has already been done by the Concerto parser.

@DS-AdamMilazzo
Copy link

DS-AdamMilazzo commented Jul 23, 2024

I'm not sure the "Difficulty: Challenging" tag is quite warranted, unless there are backwards-compatibility concerns or you want to change the design. I'm not aware of any backwards-compatibility concerns, though. The parser seems to be parsing the name correctly, and the AST seems to be storing it correctly, and in most or all cases the Concerto tools work fine with the generated model and interpret the names from the AST correctly. It's just that some names which should be legal (according to the Concerto language) are considered to be invalid according to the metamodel. Fixing the regex just makes something work that didn't work before. Also, I think the design for Concerto names is okay and nothing needs to change except the regex.

The solution is simply to express any identifier naming rules that you want in the metamodel against the parsed, non-escaped name rather than the Concerto-escaped name. The current regex is purely syntactic and allows any non-empty name as long as it's appropriately escaped, so with escaping out of the picture "any non-empty name" can be matched by simply replacing the regex with a length validator ([1,]).

However, you have the option of making the regex smarter than that by incorporating some of your higher-level rules that are currently only expressed in Javascript code. For example, you probably have a rule that says "no declaration can be named '$class' or '$identifier' because those are reserved property names", and that rule could potentially be expressed in the regex. But that is optional. The current regex doesn't attempt that kind of validation at all, and attempting to add new validations into the regex is not really in the scope of fixing the existing regex bug.

@dselman
Copy link
Contributor Author

dselman commented Jul 24, 2024

Thank you @DS-AdamMilazzo that is very helpful. Another dimension to consider is all the code generation targets. An identifier like 2D would need to be escaped/mangled to become a valid identifier in C#, Java etc.

@DS-AdamMilazzo
Copy link

Thank you @DS-AdamMilazzo that is very helpful. Another dimension to consider is all the code generation targets. An identifier like 2D would need to be escaped/mangled to become a valid identifier in C#, Java etc.

Yes, but only the JSON property name truly needs to be preserved, and I think all those languages have a way to specify the JSON property name used for serialization. If 2D became _2D in C#/Java, I don't think anyone would mind. Like all escaping schemes, some cleverness is needed to avoid collisions (e.g. we can't blindly escape 2D into _2D because somebody could have also defined _2D in the Concerto model, in theory). But that's pretty easy to deal with, I think.

Copy link
Contributor

github-actions bot commented Dec 6, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Dec 6, 2024
@DS-AdamMilazzo
Copy link

DS-AdamMilazzo commented Dec 7, 2024

Okay, bot, I'm commenting...

@github-actions github-actions bot removed the Stale label Dec 14, 2024
@mttrbrts
Copy link
Member

The behaviour described here is a side-effect of an bug in the identifier grammar in the parser.
The grammar allows escaped unicode characters that are prohibited in their unescaped form, for example \u0031 is permitted as a leading character, but 1 (i.e. the unescaped form) is prohibited.

I see three options to resolve this issue:

  1. We tighten the parsing to test against the identifier regex after parsing and unescaping characters (a breaking change).
  2. We loosen the regex but only for values in the AST (as @DS-AdamMilazzo suggests).
  3. We leave the behaviour as is, but clearly document it.

Option two provides backwards compatibility, however it circumvents the desired behaviour of identifiers. This option essentially would allow leading numbers in identifiers, which forces us to solve the problem of codegen conflicts/escaping (ref @dselman).

@DS-AdamMilazzo
Copy link

DS-AdamMilazzo commented Jan 13, 2025

The grammar allows escaped unicode characters that are prohibited in their unescaped form, for example \u0031 is permitted as a leading character, but 1 (i.e. the unescaped form) is prohibited.

What counts as 'undesired behavior' here? Presumably the whole purpose of the escaping mechanism is to allow the specification of names that are prohibited in their unescaped form. (Otherwise, why have an escaping mechanism at all?) So what subset of the otherwise prohibited names do you think should continue to be allowed via escaping, and what subset do you think should be blocked?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Difficulty: Medium Help Wanted 🆘 Extra attention is needed Type: Bug 🐛 Something isn't working Type: Discussion 🗣 Type: Documentation 📝 Information and guides for clarification
Projects
None yet
Development

No branches or pull requests

3 participants