Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve oneof support #40

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Conversation

m-o-e
Copy link
Contributor

@m-o-e m-o-e commented Jul 16, 2021

Given a message:

message Foo {
  oneof timestamp {
    int64 created;
    int64 updated;
  }
}

Previously the generated classes by crystal-protoc-gen
would allow to set multiple values of the oneof:

foo.created = 123
foo.updated = 345

foo.created # => 123
foo.updated # => 345

With this patch the sibling oneof-values
will be cleared when one is set:

foo.created = 123
foo.updated = 345

foo.created # => nil
foo.updated # => 345

Also when deserializing a message that contains
oneofs there was previously no way of knowing
which oneof value is populated.

With this patch we synthesize a new instance
variable by the name of the oneof to contain
the string-name of the member that is populated:

foo.created = 123
foo.updated = 345

foo.created # => nil
foo.updated # => 345
foo.timestamp # => "updated"
foo[foo.timestamp] # => 345

Given a message:

```
message Foo {
  oneof timestamp {
    int64 created;
    int64 updated;
  }
}
```

Previously the generated classes by crystal-protoc-gen
would allow to set multiple values of the oneof:

```
foo.created = 123
foo.updated = 345

foo.created # => 123
foo.updated # => 345
```

With this patch the sibling oneof-values
will be cleared when one is set:

```
foo.created = 123
foo.updated = 345

foo.created # => nil
foo.updated # => 345
```

Also when deserializing a message that contains
oneofs there was previously no way of knowing
which oneof value is populated.

With this patch we synthesize a new instance
variable by the name of the oneof to contain
the string-name of the member that is populated:

```
foo.created = 123
foo.updated = 345

foo.created # => nil
foo.updated # => 345
foo.timestamp # => "updated"
foo[foo.timestamp] # => 345
```
Copy link
Collaborator

@jgaskins jgaskins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's awesome that you're trying to fix this up! I've somehow never had to use oneof before, so I'm a bit less familiar with it. I had a look at the docs and it looks like unsetting sibling values matches the behavior of the Google-owned libraries, so this is great.

Just some notes about details, which might require some changes, but the premise is great. Also, if you could add the .proto files, that'd be awesome.

Comment on lines +1 to +25
#require "protobuf"

module TestMessagesProto2

struct TestOneof1
include Protobuf::Message

contract do
optional :a, :int32, 1

optional :oo0_a, :int32, 2, oneof_index: 0
optional :oo0_b, :int32, 3, oneof_index: 0
optional :oo0_c, :int32, 4, oneof_index: 0

optional :b, :int32, 5

optional :oo1_a, :int32, 6, oneof_index: 1
optional :oo1_b, :int32, 7, oneof_index: 1
optional :oo1_c, :int32, 8, oneof_index: 1

oneof 0, "oo0"
oneof 1, "oo1"
end
end
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you may have forgotten to commit the TestOneofMessage.proto here. That'd be helpful to understand what all of the oo0_* properties come from. 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I wrote that above file by hand, didn't use an input proto.
(the layout of the fixtures wasn't entirely clear to me so I just did what worked 😅)

The corresponding proto file would look like this:

syntax = "proto3";

message TestOneof1 {
  int32 a = 1;
  
  oneof oo0 {
     int32 oo0_a = 2;
     int32 oo0_b = 3;
     int32 oo0_c = 4;
  }
  
  int32 b = 5;

  oneof oo1 {
     int32 oo1_a = 6;
     int32 oo1_b = 7;
     int32 oo1_c = 8;
  }
}

The goal was to have more than one oneof and have it
interleaved with other fields for a meaningful test.

Lemme know if I should add that proto to the repo somewhere
(perhaps as a comment on top of TestOneofMessage.pb.cr? 🤔).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we'd have it there to verify that the generator generates working Crystal code.

Copy link
Contributor Author

@m-o-e m-o-e Jul 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spec/protobuf/generator_spec.cr is empty. 😬
I wasn't sure how these existing fixture files were generated, so I just put mine next to them.

I don't think there's currently anything in the specs that actually invokes the generator,
so not sure where I should put that above proto source. Just drop it into spec/fixtures/?

repeated :enum, SomeEnum, 15
# "enum" is a reserved keyword in
# crystal so we use "my_enum" for this one
repeated :my_enum, SomeEnum, 15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. It's a reserved word, definitely, but it seems to work okay as a method as long as you have an explicit receiver for the call.

struct Test
  property enum : String = ""
end

msg = Test.new
  
msg.enum # => ""
msg.enum = "hi"
msg.enum # => "hi"

This seems to be how the Range type can have begin and end methods despite them being keywords, as well.

Copy link
Contributor Author

@m-o-e m-o-e Jul 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm yes, I believe it may have to do with me calling the setters
from the constructor now (was necessary to get the new oneof-exclusive
behavior when doing e.g. MyMessage.new(a: 1, b: 2, c: 3)).

This produces code like self.enum = @enum which the compiler may not like.
(ah, PS: i think actually it produces code like self.enum = enum - perhaps adding the pretzel there would help)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps adding the pretzel there would help

Agreed. If this code breaks with that change, that's a regression with no workaround. Definitely don't want that! 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, adding the @ indeed fixes the enum issue.
But it also changes the behavior - the constructor then keeps the first value instead of the last.

Before:

MyMsg.new(a: 1, b: 2, c: 3) # => { a: nil, b: nil, c: 3 }

After:

MyMsg.new(a: 1, b: 2, c: 3) # => { a: 1, b: nil, c: nil }

No idea what that is about, gonna take a closer look tomorrow.
(maybe it doesn't even matter anyway as this is not behavior that should be relied on 😬)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pushed an update with the enum fix and the above changed behavior for now)

src/protobuf/generator.cr Outdated Show resolved Hide resolved
Comment on lines 154 to 156
{% for tag, field in FIELDS %}
self.{{field[:name].id}} = {{field[:name].id}}
{% end %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use the ONEOFS hash instead to avoid reassigning every property every time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Good question! 🤔

I guess the setter-behavior is only needed for the oneofs
so it could iterate over ONEOFS instead of FIELDS.

Don't really have a strong opinion there. I guess it would be slightly faster
but could also set a trap for people who subclass to add their own setters
and expect consistent behavior on initialization. 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support message inheritance. We implement messages as structs (with the sole exception being recursive messages) to reduce GC pressure and non-abstract structs can't be subclassed in Crystal:

struct Foo followed by struct Bar < Foo throws the compiler error: can't extend non-abstract struct Foo

The main reason I'm wondering about this is not so much performance but more that setting values in two places gives us two sources of truth. Ideally we'd only set them once. Maybe iterating over ONEOFS isn't the right approach in that case, either. I'm not sure.

Copy link
Contributor Author

@m-o-e m-o-e Jul 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point about the inheritance, yes, then that can indeed be ignored.

Hmm yes, I'm not too fond of that part of the PR either but couldn't think of
a better way to do it. But I can't see where a conflict could occur - when you
do Msg.new the setters are called, msg.foo = bar also uses the setter.

And when deserializing a message it doesn't matter unless your message is
corrupt to begin with (contains multiple values for the same oneof).

Comment on lines 64 to 65
# When multiple oneof values are populated
# by the ctor only the last survives
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about this verbiage. If I have the following protobuf message:

message MyMessage {
  oneof pick_one {
    int32 foo = 1;
    int32 bar = 2;
  }
}

And I create these 2 messages:

MyMessage.new(foo: 1, bar: 2)
MyMessage.new(bar: 2, foo: 1)

Is the expectation that these produce different results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yet another good question!

Well, we call the setters at initialization time in FIELDS-order;
{% for tag, field in FIELDS %}

So I believe both invocations should produce the same result.
Might be worth adding a test for this although it's not behavior
that should be relied on (strictly speaking we should probably raise
an Exception when someone tries to set multiple values of a oneof
during init).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I believe both invocations should produce the same result.

They do, it's why I asked 🙂 The hard part is that this would require someone working with message containing a oneof to know the order of the declarations in the proto source (assuming we output them in the same order in the Crystal code, which I believe we do). This is unfortunately not very ergonomic.

we should probably raise an Exception when someone tries to set multiple values of a oneof during init

This might be a good idea. Ideally, we'd use the type system to make this a compile-time error by defining two separate initialize methods. Unfortunately, that would only work if all of the members of the oneof are different types because Crystal sees def a(b : Int32) and def a(c : Int32) as the same method.

calling a(b: 42) and a(c: 42) results in the following error on the call with the argument b: missing argument: c.

Click for code
def a(b : Int)
  "B"
end

def a(c : Int)
  "C"
end

a(b: 42)
a(c: 42)

Maybe we could still do that, though. A while back, they added the ability to force some arguments to be passed as named arguments. So in the snippet pasted above, forcing them as named arguments allows the methods to be seen differently:

image

Click for code
def a(*, b : Int)
  "B"
end

def a(*, c : Int)
  "C"
end

a(b: 42)
a(c: 42)

This would require generating all permutations of initialize for all oneof declarations. So for example:

message MyMessage {
  oneof one {
    string foo = 1;
    string bar = 2;
  }

  oneof two {
    string baz = 3;
    string quux = 4;
    string omg = 5;
  }
}

The resulting Crystal code (after macro expansion) would need to have 6 initialize methods:

struct MyMessage
  include ::Protobuf::Message

  # ... property declarations ...

  def initialize(*, @foo : String, @baz : String)
  end

  def initialize(*, @foo : String, @quux : String)
  end

  def initialize(*, @foo : String, @omg : String)
  end

  def initialize(*, @bar : String, @baz : String)
  end

  def initialize(*, @bar : String, @quux : String)
  end

  def initialize(*, @bar : String, @omg : String)
  end
end

I think this would be an excellent way to enforce oneof at compile time, but requires significantly more work.

If we wanted to get fancy, we could even have a pseudo-initialize whose only purpose is to show a better error at compile time if you pass both. But that's probably not unnecessary. The compiler error is probably enough for someone using the message to look at the definition and realize "oh, right, I can only specify one".

Copy link
Contributor Author

@m-o-e m-o-e Jul 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy ravioli, that looks complicated! 🤯 😅

What exactly is this trying to solve, though?
(update: n/m, took me a second pass, now I get it)

Hmm, yes, having "too many enum values provided" a compile time error
would indeed seem luxurious. Not sure if it's worth the effort, though.

I think erroneously having an explicit Foo.new(a: 1, b:2) in the code
(where only one value should be given) is a fairly exotic case and would
normally be caught in testing.

The risk for exclusivity violations is probably higher when protobuf objects
are populated dynamically (e.g. from an ORM or from_json) - and from that
the compiler can't save us. An Exception could - but that's runtime.

Not sure how other protobuf impls behave in that regard but my concern
would be that it might add a bunch of complexity to the generator for not much gain. 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy ravioli, that looks complicated! 🤯 😅

It does when the macros are expanded, but it would all be abstracted from anyone reading the generated code.

The implementation could be relatively simple:

  • Are there any oneofs in this message?
    • If yes: iterate over them in a nested loop generating a new initialize method for each one
    • If no: generate the one initialize

The downside is that it's a breaking change for anyone using positional arguments to initialize, which I believe we do in the test suite.

What exactly is this trying to solve, though? 🤔

Setting two values in a oneof group. You have to know you can only supply one key from each oneof group no matter what the implementation is, but right now we're just not checking your work on that. Your implementation would check your work, which is awesome, but if you get it wrong we just overwrite all but one of your values. You may have no idea that's happened.

Your idea of raising an exception will let them know, which is arguably a huge improvement over silently dropping values because it lets them know that their code has a bug, but it happens at run time. People will likely discover the bug in production. Generating multiple initialize methods will move that exception to compile time. They won't even be able to deploy the bug because it won't be a valid Crystal program.

Just to be clear: I think incoming messages are not supposed to have multiple values for a oneof.
(the current version of protobuf.cr will indeed create such messages but I don't think they are legal)

Agreed.

From what I gathered from the docs I thought it should be enough to enforce their
exclusivity at object creation time (i.e. in the constructor or when a setter is called).

But could be I missed something 🤔

We don't have a guarantee when we're deserializing anything coming in from an untrusted host, but the risk there is probably relatively low. It should be pretty rare that nobody using of a particular message is using a library that checks that unless they're all using the same Protobuf library to serialize/deserialize — at which point, Protobuf is overkill and they should probably be using something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

breaking change for anyone using positional arguments

Hmm yes, as mentioned in my prev. comment (sorry I had edited it later on - it got late that night)
I'm skeptical about the benefits we'd gain from it.

Having an explicit Foo.new(a: 1, b:2) in the code (where only one value should
be provided) seems like a rare case and should usually be easily caught in testing.

I think the other protobuf impls (at least the ones I've looked at)
don't go to that length either.

But in any case, your scrutiny has led me to an actual bug in the current impl! 🐞

Now fixed here.
Tldr: We do need to call the setters after deserialisation or the instance var
that identifies which oneof has a value doesn't get set.

I've also added a test for it and deduplicated them here.

With this fix I believe the current impl now works as intended and now it
also applies the "keep only first value"-behavior to incoming messages.

I'm not opposed to refactoring into multiple ctor's as you propose
but am still not quite convinced it's worth it. I feel like it may also bring
some undesirable side-effects such as potentially confusing compiler errors
as messages with many fields may end up with dozens of ctors. 🤔

Note:
* This also changes the constructor semantics
  when multiple oneof-values are given (now the *first*
  value is kept instead of the last)

* Using reserved keywords as protobuf field-names remains
  a bad idea regardless. For example a field called `delete`
  will break ts-proto / protoc-gen-js and probably others
@m-o-e
Copy link
Contributor Author

m-o-e commented Jul 17, 2021

(pardon the force-push, got mixed up with my branches here)

Previously deserializing a message containing `oneofs`
would not set the oneof instance var. E.g. you would get:

`{ oneof_member1: 1, oneof_member_2: nil, my_oneof: nil }`

Now it works as intended and you get:

`{ oneof_member1: 1, oneof_member_2: nil, my_oneof: "oneof_member1" }`

{% for tag, field in FIELDS %}
self.{{field[:name].id}} = @{{field[:name].id}}
{% end %}
Copy link
Contributor Author

@m-o-e m-o-e Jul 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why a second {% for .. %}-block is needed here.

It tried to append the self.x = @x to the preceding block
but that gave me:

The instance variable '@is_extension' is initialized in other 'initialize' methods,
and by not initializing it here it's not clear if the variable is supposed
to be nilable or if this is a mistake.

But well, this version works, and I guess it's actually a bit clearer to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: Ah I suppose adding it to the existing {% for %} didn't work because
in the case of a oneof the setter will iterate over FIELDS again - and may thereby attempt to access fields before they have been initialized. So yep, that makes sense.

`myobj.myoneof_value` is more convenient than `myobj[myobj.myoneof]`
esp. when dealing with nested objects.
The return-type of `myobj.myoneof_value` is now restricted
to the types of the possible oneof-values. Previously it would
include the types of *all* message-fields (even those that
are not part of the queried oneof).
Now it should actually work. ;)

Note: Resist the temptation to store ONEOF as Symbol's.
It appears to work at first and saves some String-conversions
but wreaks havoc when JSON::Serializable comes into play.
@m-o-e
Copy link
Contributor Author

m-o-e commented Jul 26, 2021

While hacking on this I realized that this oneof-stuff is actually closely
related to the new proto3 field presence.

So with the last commit I've added experimental support for that.
The generator now signals support for the new feature (proto3-files with optional fields can now be compiled).

And the resulting fields behave as requested by the impl guide - identical to proto2 optionals.
However, I'm afraid this is mostly due to the fact that proto3 support was never completed ;)

All proto3 scalar-fields have always behaved like proto2 optionals. I.e. they would become nilable's
rather than following the proto3 "return default-value when absent" semantics.

I feel like this last bit might actually be easy'ish to fix, perhaps just by assigning
the defaults here (but haven't tried it out yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants