Introduce remove ambiguous grammar RFC #1

DelSkayn · 2023-10-24T15:40:32Z

This PR introduces the remove ambiguous grammar RFC.

This RFC is closely related to a PR on the main repository: surrealdb/surrealdb#2885

RFC Summary

The current version of SurrealQL grammar, as defined by what the parser currently accepts, contains several ambiguous productions. These productions are parsed differently depending on the context or can seem very similar to each other, but subtle differences can result in completely different semantics.

These ambiguities in the grammar complicate parser design, limit possible future extensions to SurrealQL that don't break, and could be confusing when using the language.

This RFC proposes several changes to the grammar to limit the present ambiguity:

Disallow a value from starting with an identifier which could, in its current position in the grammar, start a statement as a keyword.
Require that raw identifiers don't start with a digit.
Introduce strand prefixes for the specific types of strands
Introduce a syntax error for block record-id object ambiguity
Change the KNN operator from <3> to knn<3>

text/0001-remove_ambiguous_grammar.md

andar1an · 2023-11-03T15:23:38Z

I was uncertain what you meant by parser ambiguity, and am really glad you included examples and a thoughtful proposal layout because this ambiguity would definitely nip me as a user eventually.

andar1an · 2023-11-03T15:27:45Z

text/0001-remove_ambiguous_grammar.md

+
+And example is `"5:00"` which brought up in an issue. The user wanted to store this value as a plain strand but it happened to match a thing strand.
+
+I propose we introduce specific strand prefixes for specific strand types:


question: will single quotes or double quotes matter? I notice uuid uses single quotes and the other 2 examples use double. Assuming either will be acceptable?

Yes either is acceptable, the only difference is the u or t in front of the string.

What about ULIDs?? Taking a glance at SurrealDB's code, it seems that they are handled quite differently than UUIDs. Are they stored differently as well?

Also, I'm apprehensive about the record strand prefix.
I get it for datetime and UUIDs, but for solving the ambiguity between record IDs and object fields, I think the object syntax should change.

I think the way to go is to make objects have fields that are string literals.
(In the future there could be format strings, which would allow for dynamically setting the field name)

So, for your example:

{ a:b.c }

Is taking the field c of a record a:b.
To express an object with field a with a value of b.c, you would instead write:

{ "a":b.c }

This also mirrors the rules in JSON.

andar1an · 2023-11-03T15:29:54Z

text/0001-remove_ambiguous_grammar.md

+
+## Location specific reserved words.
+
+The first change proposed is to disallow raw identifiers in places where the same identifier could also be parsed as a keyword. For example, the following code would no longer be allowed:


Could it be more simple to just disallow certain keywords, and not allow for it to be wrapped? Would users be very upset by being unable to use a few keywords? In my case, if I saw an error message saying "can't use this keyword", I would just change it to something else. But I am only 1 person so not sure.

andar1an · 2023-11-03T15:42:28Z

text/0001-remove_ambiguous_grammar.md

+
+This can be parsed either as an object with a field `a` and value `b.c`, or as a block statement taking the field `c` of the record `a:b`. Removing this ambiguity completely would require making significant changes to how record IDs are joined (changing the `:` to something else) or requiring that either blocks or objects always be enclosed by parentheses. Both of these changes are quite drastic for a relatively minor ambiguity.
+
+Therefore, I propose that if the parser encounters `{` `Identifier` `:`, it raises a syntax error to notify the user of the ambiguity. The user would then need to either use a record ID strand if they intended to create a block, or make the identifier a strand if they intended to create an object.


To visualize, would it be like:

{ r"some_record_id":b.c }

or

{ a: u"some_uuid" }

This ambiguity is unrelated to the ambiguity of strands.

This is a ambiguity is caused by the fact that raw records are allowed. you can for example write CREATE some_table:some_specific_id;. The problem is the :. If a raw record is the first statement in an block statement it looks exactly like an object.

In the example { a:b.c }, a:b looks like a raw record but it could also be a field definition with a value.

Raw records will still be allowed in these proposed changes, so we need a way to define the value of { a:b.c }. Currently the parse will first try to match an object and then try to match a block statement. This results in strange behavior like {a:b.c; true } being parsed as an block but {a:b.c } is parsed as an object.

The proposed change here is to define that if the parsers sees { followed by and identifier followed by : it will always choose to parse an object and error if it encounters {a:b.c; true }. To make this statement work again you could use a record strand: { r"a:b".c; true }.

Ok, I think I see now. So the last code snippet at the end of your comment was what I was trying to understand.

So the 2 options you mentioned in the rfc would be visualized as { r"a:b".c; true } or { a:r"b.c"; true }?

andar1an · 2023-11-03T15:46:41Z

text/0001-remove_ambiguous_grammar.md

+
+- Some queries now require more type and syntax than they previously did.
+- These changes break existing code.
+- These new changes could be more difficuly to learn initially as it introduces more syntax rules to remember.


If error messaging is concise and explicit as you proposed, I don't think this would be too bad to learn, vs. hitting what seem like common ambiguity problems.

I agree! but it is still kind of a drawback.

Not arguing there, just commenting that in my personal opinion I think the benefit would outweigh the drawback :).

andar1an · 2023-11-03T15:48:07Z

text/0001-remove_ambiguous_grammar.md

+
+## Limited reserved word list
+
+Instead of location specific keywords we could instead specify a limited list of keywords which are disallowed. Keywords which can't start a statement like `EVENT` or `TABLE` would still be allowed, but `USE` would be disallowed everywhere.


❤️ - I personally prefer this to #3. I should have kept reading before leaving that comment! haha

andar1an · 2023-11-03T15:52:40Z

text/0001-remove_ambiguous_grammar.md

+The `knn<3>` operator seems a bit verbose and rather complex looking for an operator. 
+Is this the right syntax.
+
+## How do we differentiate an plain object from a geometry object?


I am very curious about this. I am not too sure what is currently considered a plain object and a geometry object. Are there any examples?

Ah, I see a Geometry example at the bottom of page.

Is there potential that one may want to do topological analysis on plain objects like vectors or matrices, because then maybe with Geometry as a supertype one can share topology methods with plain objects?

Geometry objects are objects which have a very specific layout. The following query returns one kind of geometry object.

RETURN { type: 'LineString', coordinates: [1,2,3,4] }

As you can see it looks exactly like an normal object but if you change even a small part like for example:

RETURN { type: 'LinesString', coordinates: [1,2,3,4] }

(Lines instead of Line) it is suddenly a plain object and no longer a geometry.

I shall add this example to the rfc

I think considering making the plain object the supertype may also be an option: one may be able to draw some tangents from geometry kernels and languages like apt for example: https://en.wikipedia.org/wiki/APT_(programming_language).

At the end of the day Geometry is comprised of plain objects. And this way plain objects are limited to methods that work for them, and geometry objects can be expanded.

Are there any semantic differences between geometry objects and plain objects (like operations)?
If not, then maybe the best thing to do is to treat geometries like plain objects by default and allow for defining the field as otherwise.
This would mean that an application like Surrealist can get information about the table to know how to display those fields.
We can even copy the record<table> syntax as object<schema>. There is even the potential to allow for user-defined schemas to validate against using a URI (maybe like object<"http://example.org/bibliography.schema.json">).

So for a geometry object this might look like:
DEFINE FIELD position ON TABLE location_history TYPE object<geometry>;

Oh, guess there are operators specialized for geometry objects. Maybe it would be better to move these operators to functions.
For example:

(-0.118092, 51.509865) INSIDE { type: "Polygon", coordinates: [[ [-0.38314819, 51.37692386], [0.1785278, 51.37692386], [0.1785278, 51.61460570], [-0.38314819, 51.61460570], [-0.38314819, 51.37692386] ]] };

Could be:

geometry::inside((-0.118092, 51.509865), { type: "Polygon", coordinates: [[ [-0.38314819, 51.37692386], [0.1785278, 51.37692386], [0.1785278, 51.61460570], [-0.38314819, 51.61460570], [-0.38314819, 51.37692386] ]] });

Then the INSIDE operator could be generalized to plain objects. For example:

{ "foo": "bar" } INSIDE { "foo: "bar", "biz": "fiz" }

andar1an · 2023-11-03T16:22:35Z

text/0001-remove_ambiguous_grammar.md

+{ a:b.c }
+```
+
+This can be parsed either as an object with a field `a` and value `b.c`, or as a block statement taking the field `c` of the record `a:b`. Removing this ambiguity completely would require making significant changes to how record IDs are joined (changing the `:` to something else) or requiring that either blocks or objects always be enclosed by parentheses. Both of these changes are quite drastic for a relatively minor ambiguity.


Is this a minor ambiguity? It seems like you can get completely different results from the same syntax? Or am I understanding wrong?

I called it minor, because I don't think there are a lot of situation in which this ambiguity happens, but you are correct that you can get completely different results for the same syntax.

text/0001-remove_ambiguous_grammar.md

Co-authored-by: Micha de Vries <mt.dev@hotmail.com>

JonahPlusPlus · 2024-01-02T15:43:17Z

Another ambiguity to think about: is the following a block or an empty object?

{}

The easiest solution (and what SurrealDB does right now) is to treat braces as objects until they can't be. It makes sense and I think it is the current right move, but it is a potential source of confusion (a bit more if block expressions get added to SurQL).

DelSkayn · 2024-01-26T15:54:13Z

Another ambiguity to think about: is the following a block or an empty object?
{}
The easiest solution (and what SurrealDB does right now) is to treat braces as objects until they can't be. It makes sense and I think it is the current right move, but it is a potential source of confusion (a bit more if block expressions get added to SurQL).

{} isn't handled by the proposal specifically but I think we should just handle it as an empty object as an empty block statement is not useful. Apart from that, the RFC says that { will start an object, and not a block, when it sees that the next tokens are { any object key like token :. Otherwise it will consider it a block statement.

This is I think the most minimal requirement to make objects and statements distinct. This will cause an error when a record id is the first item in a block statement, but you can use a record id string to solve that case.

I think this way of distinguishing between objects and block statements has less potential to lead to confusing situations where an object suddenly is parsed as a block statement because a user made a mistake while writing an query.

introduce remove-ambiguous-grammar rfc

a987efc

DelSkayn mentioned this pull request Oct 24, 2023

Introduce new experimental parser surrealdb/surrealdb#2885

Merged

11 tasks

Fix some formatting

f43a47c

andar1an reviewed Nov 3, 2023

View reviewed changes

text/0001-remove_ambiguous_grammar.md Show resolved Hide resolved

Move glossary, add open question about geometry objects

3c9badc

andar1an reviewed Nov 3, 2023

View reviewed changes

Improve some explanations

47fa19f

andar1an reviewed Nov 3, 2023

View reviewed changes

DelSkayn added 2 commits November 3, 2023 17:09

Add explanation of geometry objects.

1a6a21e

Remove lines left behind.

b94acbd

andar1an reviewed Nov 3, 2023

View reviewed changes

kearfy reviewed Dec 6, 2023

View reviewed changes

text/0001-remove_ambiguous_grammar.md Outdated Show resolved Hide resolved

kearfy reviewed Dec 6, 2023

View reviewed changes

text/0001-remove_ambiguous_grammar.md Outdated Show resolved Hide resolved

Apply suggestions from code review

5c8d953

Co-authored-by: Micha de Vries <mt.dev@hotmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce remove ambiguous grammar RFC #1

Introduce remove ambiguous grammar RFC #1

DelSkayn commented Oct 24, 2023 •

edited

Loading

andar1an commented Nov 3, 2023

andar1an Nov 3, 2023

andar1an Nov 3, 2023

DelSkayn Nov 3, 2023

JonahPlusPlus Jan 2, 2024

andar1an Nov 3, 2023 •

edited

Loading

andar1an Nov 3, 2023

DelSkayn Nov 3, 2023 •

edited

Loading

andar1an Nov 3, 2023

andar1an Nov 3, 2023 •

edited

Loading

DelSkayn Nov 3, 2023

andar1an Nov 3, 2023

andar1an Nov 3, 2023

andar1an Nov 3, 2023

andar1an Nov 3, 2023

DelSkayn Nov 3, 2023

DelSkayn Nov 3, 2023

andar1an Nov 3, 2023 •

edited

Loading

JonahPlusPlus Jan 2, 2024

JonahPlusPlus Jan 5, 2024

andar1an Nov 3, 2023

DelSkayn Jan 26, 2024

JonahPlusPlus commented Jan 2, 2024 •

edited

Loading

DelSkayn commented Jan 26, 2024 •

edited

Loading


		And example is `"5:00"` which brought up in an issue. The user wanted to store this value as a plain strand but it happened to match a thing strand.

		I propose we introduce specific strand prefixes for specific strand types:


		## Location specific reserved words.

		The first change proposed is to disallow raw identifiers in places where the same identifier could also be parsed as a keyword. For example, the following code would no longer be allowed:


		This can be parsed either as an object with a field `a` and value `b.c`, or as a block statement taking the field `c` of the record `a:b`. Removing this ambiguity completely would require making significant changes to how record IDs are joined (changing the `:` to something else) or requiring that either blocks or objects always be enclosed by parentheses. Both of these changes are quite drastic for a relatively minor ambiguity.

		Therefore, I propose that if the parser encounters `{` `Identifier` `:`, it raises a syntax error to notify the user of the ambiguity. The user would then need to either use a record ID strand if they intended to create a block, or make the identifier a strand if they intended to create an object.


		## Limited reserved word list

		Instead of location specific keywords we could instead specify a limited list of keywords which are disallowed. Keywords which can't start a statement like `EVENT` or `TABLE` would still be allowed, but `USE` would be disallowed everywhere.

Introduce remove ambiguous grammar RFC #1

Are you sure you want to change the base?

Introduce remove ambiguous grammar RFC #1

Conversation

DelSkayn commented Oct 24, 2023 • edited Loading

RFC Summary

andar1an commented Nov 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andar1an Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DelSkayn Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andar1an Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andar1an Nov 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JonahPlusPlus commented Jan 2, 2024 • edited Loading

DelSkayn commented Jan 26, 2024 • edited Loading

DelSkayn commented Oct 24, 2023 •

edited

Loading

andar1an Nov 3, 2023 •

edited

Loading

DelSkayn Nov 3, 2023 •

edited

Loading

andar1an Nov 3, 2023 •

edited

Loading

andar1an Nov 3, 2023 •

edited

Loading

JonahPlusPlus commented Jan 2, 2024 •

edited

Loading

DelSkayn commented Jan 26, 2024 •

edited

Loading