Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit cloning to query2 #119

Open
johan-bjareholt opened this issue May 23, 2020 · 6 comments
Open

Add explicit cloning to query2 #119

johan-bjareholt opened this issue May 23, 2020 · 6 comments
Labels
enhancement New feature or request

Comments

@johan-bjareholt
Copy link
Member

johan-bjareholt commented May 23, 2020

Problem

Currently every time we reference a variable we are cloning the whole value to guarantee that the value of the variable cannot be modified in some function without re-assigning it.
If we had explicit cloning built-in to the language so cloning is only done when necessary querying would probably be insanely faster.

The python implementation seems to be inconsistent as some transforms consumes the data and others don't and the query2 language doesn't seem to clone variables so things can get very complicated. Will have to investigate further on this though.

Suggestions on how it could be solved

1. New default is to assume mutable reference

clone(data) returns cloned data
*data to specify that variable will be consumed and undefined after this. Needed to avoid unnecessary cloning
Will not work because here the same value can be in multiple variables, breaks ownership rules.

events1 = query_bucket("bucket");
events2 = filter_keyvals(events1, "key", ["val"]);
# OK, BUT this will modify events1 so it might be unclear to the user what is occuring
events3 = filter_keyvals(clone(events1), "key", ["val"]);
# OK, at this point both events1 and events2 are defined
events4 = filter_keyvals(*events1, "key", ["val"]);
# OK, at this point events1 is undefined as it was consumed by *events1

2. New default is to assume immutable reference

clone(data) returns cloned data
*data to specify that variable will be consumed and undefined after this. Needed to avoid unnecessary cloning.
Will not work because here the same value can be in multiple variables, breaks ownership rules.

events1 = query_bucket("bucket");
events2 = filter_keyvals(events1, "key", ["val"]);
# Error: Not possible to pass events1 into filter_keyvals as it needs mutable data
events3 = filter_keyvals(clone(events1), "key", ["val"]);
# OK, at this point both events1 and events2 are defined
events4 = filter_keyvals(*events1, "key", ["val"]);
# OK, at this point events1 is undefined as it was consumed by *events1

3. New default is to assume consume

clone(data) returns cloned data
&data to specify that variable will be used as a reference

events1 = query_bucket("bucket");
events2 = filter_keyvals(events1, "key", ["val"]);
# OK as it is consumed by default, but after this line it will be consumed so we need to either fist clone it or query the bucket again
events1 = query_bucket("bucket");
events3 = filter_keyvals(clone(events1), "key", ["val"]);
# OK, at this point both events1 and events2 are defined
events4 = filter_keyvals(*events1, "key", ["val"]);
# OK, at this point events1 is undefined as it was consumed by *events1

4. Keep as is, but introduce some pre-processing step which does reference counting before executing

Since we don't have any loops or non-transform function support this would be possible to do, but still a bit messy to get working. The result might not be perfect either.

@johan-bjareholt
Copy link
Member Author

I updated this issue with ideas of how this could be solved.

My first impression is that 3 might be the best solution. Could even add some logic to tell a friendly error like "The variable you are referencing was defined before but has been consumed, maybe you want to clone it?"

4 is also a tempting solution, but less sure how that would work. I'm pretty sure it's possible though.

@ErikBjare
Copy link
Member

ErikBjare commented May 29, 2020

I think in the Python version of query2 all functions that modify data do deepcopy's on input arguments in order to avoid mutating them.

querying would probably be insanely faster

Really? Are clones that expensive?

Don't have the time to review the best approach right now, but ping me after the 5th of June and I'll look over it. My first impression is that doing Rust-style ownership checks seems like a very overkill approach to a small DSL like this, but idk.

@johan-bjareholt
Copy link
Member Author

@ErikBjare

I think in the Python version of query2 all functions that modify data do deepcopy's on input arguments in order to avoid mutating them.

It seems like at some places we use deepcopy on the events and in other places we iterate on the passed events and create new ones with the slight modifications. In some places such as flood it would be possible to optimize because there we use deepcopy when in fact we wouldn't need to if we had these ownership things in place (because most of the time we don't use the non-flooded events after we have called flood).

Really? Are clones that expensive?

memory allocation and memory copying is expensive and currently we do it a lot of time in query2.

I used the query benchmark I wrote last week and made the simple change that consumes all variable references instead of cloning them (likely breaks aw-webui queries, but not the simple query in the bench).

The query in the bench is the following, so there should only be one clone I believe (when the events are assigned to the events variable). There are 3000 generated events into the "testid" bucket.

events = query_bucket(\"testid\");
return events;

These are the results

# With cloning (not consuming)
test query_benchmarks::bench_many_events ... bench:   7,237,172 ns/iter (+/- 636,117)
# Without cloning (consuming)
test query_benchmarks::bench_many_events ... bench:   5,660,517 ns/iter (+/- 543,986)
min avg max
6601055 7237172 7873289
5116531 5660517 6204503

max perf increase = 7873289/5116531 = ~53%
average perf increase = 7237172/5660517 = ~27.8%
min perf increase = 6601055/6204503 = ~6.4%

While this query is incredibly simple and doesn't do any transforms you also have to take into account that the more transforms the more you pass more variables around which means more cloning of data.

And another big advantage is the fact that aw-server-rust will use much less memory, which was the reason why I found out this issue in the first place, I was looking for where all the memory was used for and realized that there was a lot of copying of events in query2.

I ran the tests again but with 30 000 events instead of 3 000 and here are the results

# With cloning (not consuming)
test query_benchmarks::bench_many_events ... bench:  76,125,955 ns/iter (+/- 26,492,754)
# Without cloning (consuming)
test query_benchmarks::bench_many_events ... bench:  62,858,451 ns/iter (+/- 23,683,203)

average perf increase = ~21%

@johan-bjareholt
Copy link
Member Author

Or maybe just passing by reference and let the transforms decide if they need to clone or not like we do in query2 in python...
Not a perfect solution as there would still be a few unnecessary clones, but an significant improvement and probably the easiest one to implement.

@johan-bjareholt
Copy link
Member Author

I think I'll go to pass-by-reference first as that would allow us to pass by reference at all (which the language does not support today).
Adding more advanced things to perfect the lifetime of variables can be added later.

@johan-bjareholt
Copy link
Member Author

I think I have some ideas about how to implement reference counting, will give that a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants