-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce remove ambiguous grammar RFC #1
base: main
Are you sure you want to change the base?
Conversation
I was uncertain what you meant by parser ambiguity, and am really glad you included examples and a thoughtful proposal layout because this ambiguity would definitely nip me as a user eventually. |
|
||
And example is `"5:00"` which brought up in an issue. The user wanted to store this value as a plain strand but it happened to match a thing strand. | ||
|
||
I propose we introduce specific strand prefixes for specific strand types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: will single quotes or double quotes matter? I notice uuid uses single quotes and the other 2 examples use double. Assuming either will be acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes either is acceptable, the only difference is the u
or t
in front of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ULIDs?? Taking a glance at SurrealDB's code, it seems that they are handled quite differently than UUIDs. Are they stored differently as well?
Also, I'm apprehensive about the record strand prefix.
I get it for datetime and UUIDs, but for solving the ambiguity between record IDs and object fields, I think the object syntax should change.
I think the way to go is to make objects have fields that are string literals.
(In the future there could be format strings, which would allow for dynamically setting the field name)
So, for your example:
{ a:b.c }
Is taking the field c
of a record a:b
.
To express an object with field a
with a value of b.c
, you would instead write:
{ "a":b.c }
This also mirrors the rules in JSON.
|
||
## Location specific reserved words. | ||
|
||
The first change proposed is to disallow raw identifiers in places where the same identifier could also be parsed as a keyword. For example, the following code would no longer be allowed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be more simple to just disallow certain keywords, and not allow for it to be wrapped? Would users be very upset by being unable to use a few keywords? In my case, if I saw an error message saying "can't use this keyword", I would just change it to something else. But I am only 1 person so not sure.
|
||
This can be parsed either as an object with a field `a` and value `b.c`, or as a block statement taking the field `c` of the record `a:b`. Removing this ambiguity completely would require making significant changes to how record IDs are joined (changing the `:` to something else) or requiring that either blocks or objects always be enclosed by parentheses. Both of these changes are quite drastic for a relatively minor ambiguity. | ||
|
||
Therefore, I propose that if the parser encounters `{` `Identifier` `:`, it raises a syntax error to notify the user of the ambiguity. The user would then need to either use a record ID strand if they intended to create a block, or make the identifier a strand if they intended to create an object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To visualize, would it be like:
{ r"some_record_id":b.c }
or
{ a: u"some_uuid" }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ambiguity is unrelated to the ambiguity of strands.
This is a ambiguity is caused by the fact that raw records are allowed. you can for example write CREATE some_table:some_specific_id;
. The problem is the :
. If a raw record is the first statement in an block statement it looks exactly like an object.
In the example { a:b.c }
, a:b
looks like a raw record but it could also be a field definition with a value.
Raw records will still be allowed in these proposed changes, so we need a way to define the value of { a:b.c }
. Currently the parse will first try to match an object and then try to match a block statement. This results in strange behavior like {a:b.c; true }
being parsed as an block but {a:b.c }
is parsed as an object.
The proposed change here is to define that if the parsers sees {
followed by and identifier followed by :
it will always choose to parse an object and error if it encounters {a:b.c; true }
. To make this statement work again you could use a record strand: { r"a:b".c; true }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I see now. So the last code snippet at the end of your comment was what I was trying to understand.
So the 2 options you mentioned in the rfc would be visualized as { r"a:b".c; true }
or { a:r"b.c"; true }
?
|
||
- Some queries now require more type and syntax than they previously did. | ||
- These changes break existing code. | ||
- These new changes could be more difficuly to learn initially as it introduces more syntax rules to remember. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If error messaging is concise and explicit as you proposed, I don't think this would be too bad to learn, vs. hitting what seem like common ambiguity problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree! but it is still kind of a drawback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not arguing there, just commenting that in my personal opinion I think the benefit would outweigh the drawback :).
|
||
## Limited reserved word list | ||
|
||
Instead of location specific keywords we could instead specify a limited list of keywords which are disallowed. Keywords which can't start a statement like `EVENT` or `TABLE` would still be allowed, but `USE` would be disallowed everywhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ - I personally prefer this to #3. I should have kept reading before leaving that comment! haha
The `knn<3>` operator seems a bit verbose and rather complex looking for an operator. | ||
Is this the right syntax. | ||
|
||
## How do we differentiate an plain object from a geometry object? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very curious about this. I am not too sure what is currently considered a plain object and a geometry object. Are there any examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see a Geometry example at the bottom of page.
Is there potential that one may want to do topological analysis on plain objects like vectors or matrices, because then maybe with Geometry as a supertype one can share topology methods with plain objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Geometry objects are objects which have a very specific layout. The following query returns one kind of geometry object.
RETURN {
type: 'LineString',
coordinates: [1,2,3,4]
}
As you can see it looks exactly like an normal object but if you change even a small part like for example:
RETURN {
type: 'LinesString',
coordinates: [1,2,3,4]
}
(Lines
instead of Line
) it is suddenly a plain object and no longer a geometry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I shall add this example to the rfc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think considering making the plain object the supertype may also be an option: one may be able to draw some tangents from geometry kernels and languages like apt for example: https://en.wikipedia.org/wiki/APT_(programming_language).
At the end of the day Geometry is comprised of plain objects. And this way plain objects are limited to methods that work for them, and geometry objects can be expanded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any semantic differences between geometry objects and plain objects (like operations)?
If not, then maybe the best thing to do is to treat geometries like plain objects by default and allow for defining the field as otherwise.
This would mean that an application like Surrealist can get information about the table to know how to display those fields.
We can even copy the record<table>
syntax as object<schema>
. There is even the potential to allow for user-defined schemas to validate against using a URI (maybe like object<"http://example.org/bibliography.schema.json">
).
So for a geometry object this might look like:
DEFINE FIELD position ON TABLE location_history TYPE object<geometry>;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, guess there are operators specialized for geometry objects. Maybe it would be better to move these operators to functions.
For example:
(-0.118092, 51.509865) INSIDE {
type: "Polygon",
coordinates: [[
[-0.38314819, 51.37692386], [0.1785278, 51.37692386],
[0.1785278, 51.61460570], [-0.38314819, 51.61460570],
[-0.38314819, 51.37692386]
]]
};
Could be:
geometry::inside((-0.118092, 51.509865), {
type: "Polygon",
coordinates: [[
[-0.38314819, 51.37692386], [0.1785278, 51.37692386],
[0.1785278, 51.61460570], [-0.38314819, 51.61460570],
[-0.38314819, 51.37692386]
]]
});
Then the INSIDE
operator could be generalized to plain objects. For example:
{ "foo": "bar" } INSIDE { "foo: "bar", "biz": "fiz" }
{ a:b.c } | ||
``` | ||
|
||
This can be parsed either as an object with a field `a` and value `b.c`, or as a block statement taking the field `c` of the record `a:b`. Removing this ambiguity completely would require making significant changes to how record IDs are joined (changing the `:` to something else) or requiring that either blocks or objects always be enclosed by parentheses. Both of these changes are quite drastic for a relatively minor ambiguity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a minor ambiguity? It seems like you can get completely different results from the same syntax? Or am I understanding wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I called it minor, because I don't think there are a lot of situation in which this ambiguity happens, but you are correct that you can get completely different results for the same syntax.
Co-authored-by: Micha de Vries <mt.dev@hotmail.com>
Another ambiguity to think about: is the following a block or an empty object?
The easiest solution (and what SurrealDB does right now) is to treat braces as objects until they can't be. It makes sense and I think it is the current right move, but it is a potential source of confusion (a bit more if block expressions get added to SurQL). |
This is I think the most minimal requirement to make objects and statements distinct. This will cause an error when a record id is the first item in a block statement, but you can use a record id string to solve that case. I think this way of distinguishing between objects and block statements has less potential to lead to confusing situations where an object suddenly is parsed as a block statement because a user made a mistake while writing an query. |
This PR introduces the remove ambiguous grammar RFC.
Rendered
This RFC is closely related to a PR on the main repository: surrealdb/surrealdb#2885
RFC Summary
The current version of SurrealQL grammar, as defined by what the parser currently accepts, contains several ambiguous productions. These productions are parsed differently depending on the context or can seem very similar to each other, but subtle differences can result in completely different semantics.
These ambiguities in the grammar complicate parser design, limit possible future extensions to SurrealQL that don't break, and could be confusing when using the language.
This RFC proposes several changes to the grammar to limit the present ambiguity:
<3>
toknn<3>