Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce remove ambiguous grammar RFC #1

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

DelSkayn
Copy link
Member

@DelSkayn DelSkayn commented Oct 24, 2023

This PR introduces the remove ambiguous grammar RFC.

Rendered

This RFC is closely related to a PR on the main repository: surrealdb/surrealdb#2885

RFC Summary

The current version of SurrealQL grammar, as defined by what the parser currently accepts, contains several ambiguous productions. These productions are parsed differently depending on the context or can seem very similar to each other, but subtle differences can result in completely different semantics.

These ambiguities in the grammar complicate parser design, limit possible future extensions to SurrealQL that don't break, and could be confusing when using the language.

This RFC proposes several changes to the grammar to limit the present ambiguity:

  • Disallow a value from starting with an identifier which could, in its current position in the grammar, start a statement as a keyword.
  • Require that raw identifiers don't start with a digit.
  • Introduce strand prefixes for the specific types of strands
  • Introduce a syntax error for block record-id object ambiguity
  • Change the KNN operator from <3> to knn<3>

@andar1an
Copy link

andar1an commented Nov 3, 2023

I was uncertain what you meant by parser ambiguity, and am really glad you included examples and a thoughtful proposal layout because this ambiguity would definitely nip me as a user eventually.


And example is `"5:00"` which brought up in an issue. The user wanted to store this value as a plain strand but it happened to match a thing strand.

I propose we introduce specific strand prefixes for specific strand types:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: will single quotes or double quotes matter? I notice uuid uses single quotes and the other 2 examples use double. Assuming either will be acceptable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes either is acceptable, the only difference is the u or t in front of the string.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ULIDs?? Taking a glance at SurrealDB's code, it seems that they are handled quite differently than UUIDs. Are they stored differently as well?

Also, I'm apprehensive about the record strand prefix.
I get it for datetime and UUIDs, but for solving the ambiguity between record IDs and object fields, I think the object syntax should change.

I think the way to go is to make objects have fields that are string literals.
(In the future there could be format strings, which would allow for dynamically setting the field name)

So, for your example:

{ a:b.c }

Is taking the field c of a record a:b.
To express an object with field a with a value of b.c, you would instead write:

{ "a":b.c }

This also mirrors the rules in JSON.


## Location specific reserved words.

The first change proposed is to disallow raw identifiers in places where the same identifier could also be parsed as a keyword. For example, the following code would no longer be allowed:
Copy link

@andar1an andar1an Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be more simple to just disallow certain keywords, and not allow for it to be wrapped? Would users be very upset by being unable to use a few keywords? In my case, if I saw an error message saying "can't use this keyword", I would just change it to something else. But I am only 1 person so not sure.


This can be parsed either as an object with a field `a` and value `b.c`, or as a block statement taking the field `c` of the record `a:b`. Removing this ambiguity completely would require making significant changes to how record IDs are joined (changing the `:` to something else) or requiring that either blocks or objects always be enclosed by parentheses. Both of these changes are quite drastic for a relatively minor ambiguity.

Therefore, I propose that if the parser encounters `{` `Identifier` `:`, it raises a syntax error to notify the user of the ambiguity. The user would then need to either use a record ID strand if they intended to create a block, or make the identifier a strand if they intended to create an object.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To visualize, would it be like:

{ r"some_record_id":b.c  }

or

{ a: u"some_uuid" }

Copy link
Member Author

@DelSkayn DelSkayn Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ambiguity is unrelated to the ambiguity of strands.

This is a ambiguity is caused by the fact that raw records are allowed. you can for example write CREATE some_table:some_specific_id;. The problem is the :. If a raw record is the first statement in an block statement it looks exactly like an object.

In the example { a:b.c }, a:b looks like a raw record but it could also be a field definition with a value.

Raw records will still be allowed in these proposed changes, so we need a way to define the value of { a:b.c }. Currently the parse will first try to match an object and then try to match a block statement. This results in strange behavior like {a:b.c; true } being parsed as an block but {a:b.c } is parsed as an object.

The proposed change here is to define that if the parsers sees { followed by and identifier followed by : it will always choose to parse an object and error if it encounters {a:b.c; true }. To make this statement work again you could use a record strand: { r"a:b".c; true }.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I see now. So the last code snippet at the end of your comment was what I was trying to understand.

So the 2 options you mentioned in the rfc would be visualized as { r"a:b".c; true } or { a:r"b.c"; true }?


- Some queries now require more type and syntax than they previously did.
- These changes break existing code.
- These new changes could be more difficuly to learn initially as it introduces more syntax rules to remember.
Copy link

@andar1an andar1an Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If error messaging is concise and explicit as you proposed, I don't think this would be too bad to learn, vs. hitting what seem like common ambiguity problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! but it is still kind of a drawback.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not arguing there, just commenting that in my personal opinion I think the benefit would outweigh the drawback :).


## Limited reserved word list

Instead of location specific keywords we could instead specify a limited list of keywords which are disallowed. Keywords which can't start a statement like `EVENT` or `TABLE` would still be allowed, but `USE` would be disallowed everywhere.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ - I personally prefer this to #3. I should have kept reading before leaving that comment! haha

The `knn<3>` operator seems a bit verbose and rather complex looking for an operator.
Is this the right syntax.

## How do we differentiate an plain object from a geometry object?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very curious about this. I am not too sure what is currently considered a plain object and a geometry object. Are there any examples?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see a Geometry example at the bottom of page.

Is there potential that one may want to do topological analysis on plain objects like vectors or matrices, because then maybe with Geometry as a supertype one can share topology methods with plain objects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geometry objects are objects which have a very specific layout. The following query returns one kind of geometry object.

RETURN { 
    type: 'LineString',
    coordinates: [1,2,3,4]
}

As you can see it looks exactly like an normal object but if you change even a small part like for example:

RETURN { 
    type: 'LinesString',
    coordinates: [1,2,3,4]
}

(Lines instead of Line) it is suddenly a plain object and no longer a geometry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shall add this example to the rfc

Copy link

@andar1an andar1an Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think considering making the plain object the supertype may also be an option: one may be able to draw some tangents from geometry kernels and languages like apt for example: https://en.wikipedia.org/wiki/APT_(programming_language).

At the end of the day Geometry is comprised of plain objects. And this way plain objects are limited to methods that work for them, and geometry objects can be expanded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any semantic differences between geometry objects and plain objects (like operations)?
If not, then maybe the best thing to do is to treat geometries like plain objects by default and allow for defining the field as otherwise.
This would mean that an application like Surrealist can get information about the table to know how to display those fields.
We can even copy the record<table> syntax as object<schema>. There is even the potential to allow for user-defined schemas to validate against using a URI (maybe like object<"http://example.org/bibliography.schema.json">).

So for a geometry object this might look like:
DEFINE FIELD position ON TABLE location_history TYPE object<geometry>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, guess there are operators specialized for geometry objects. Maybe it would be better to move these operators to functions.
For example:

(-0.118092, 51.509865) INSIDE {
    type: "Polygon",
    coordinates: [[
        [-0.38314819, 51.37692386], [0.1785278, 51.37692386],
        [0.1785278, 51.61460570], [-0.38314819, 51.61460570],
        [-0.38314819, 51.37692386]
    ]]
};

Could be:

geometry::inside((-0.118092, 51.509865), {
    type: "Polygon",
    coordinates: [[
        [-0.38314819, 51.37692386], [0.1785278, 51.37692386],
        [0.1785278, 51.61460570], [-0.38314819, 51.61460570],
        [-0.38314819, 51.37692386]
    ]]
});

Then the INSIDE operator could be generalized to plain objects. For example:

{ "foo": "bar" } INSIDE { "foo: "bar", "biz": "fiz" }

{ a:b.c }
```

This can be parsed either as an object with a field `a` and value `b.c`, or as a block statement taking the field `c` of the record `a:b`. Removing this ambiguity completely would require making significant changes to how record IDs are joined (changing the `:` to something else) or requiring that either blocks or objects always be enclosed by parentheses. Both of these changes are quite drastic for a relatively minor ambiguity.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a minor ambiguity? It seems like you can get completely different results from the same syntax? Or am I understanding wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called it minor, because I don't think there are a lot of situation in which this ambiguity happens, but you are correct that you can get completely different results for the same syntax.

Co-authored-by: Micha de Vries <mt.dev@hotmail.com>
@JonahPlusPlus
Copy link

JonahPlusPlus commented Jan 2, 2024

Another ambiguity to think about: is the following a block or an empty object?

{}

The easiest solution (and what SurrealDB does right now) is to treat braces as objects until they can't be. It makes sense and I think it is the current right move, but it is a potential source of confusion (a bit more if block expressions get added to SurQL).

@DelSkayn
Copy link
Member Author

DelSkayn commented Jan 26, 2024

Another ambiguity to think about: is the following a block or an empty object?

{}

The easiest solution (and what SurrealDB does right now) is to treat braces as objects until they can't be. It makes sense and I think it is the current right move, but it is a potential source of confusion (a bit more if block expressions get added to SurQL).

{} isn't handled by the proposal specifically but I think we should just handle it as an empty object as an empty block statement is not useful. Apart from that, the RFC says that { will start an object, and not a block, when it sees that the next tokens are { any object key like token :. Otherwise it will consider it a block statement.

This is I think the most minimal requirement to make objects and statements distinct. This will cause an error when a record id is the first item in a block statement, but you can use a record id string to solve that case.

I think this way of distinguishing between objects and block statements has less potential to lead to confusing situations where an object suddenly is parsed as a block statement because a user made a mistake while writing an query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants