-
-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add explicit cloning to query2 #119
Comments
I updated this issue with ideas of how this could be solved. My first impression is that 3 might be the best solution. Could even add some logic to tell a friendly error like "The variable you are referencing was defined before but has been consumed, maybe you want to clone it?" 4 is also a tempting solution, but less sure how that would work. I'm pretty sure it's possible though. |
I think in the Python version of query2 all functions that modify data do deepcopy's on input arguments in order to avoid mutating them.
Really? Are clones that expensive? Don't have the time to review the best approach right now, but ping me after the 5th of June and I'll look over it. My first impression is that doing Rust-style ownership checks seems like a very overkill approach to a small DSL like this, but idk. |
It seems like at some places we use deepcopy on the events and in other places we iterate on the passed events and create new ones with the slight modifications. In some places such as flood it would be possible to optimize because there we use deepcopy when in fact we wouldn't need to if we had these ownership things in place (because most of the time we don't use the non-flooded events after we have called flood).
memory allocation and memory copying is expensive and currently we do it a lot of time in query2. I used the query benchmark I wrote last week and made the simple change that consumes all variable references instead of cloning them (likely breaks aw-webui queries, but not the simple query in the bench). The query in the bench is the following, so there should only be one clone I believe (when the events are assigned to the events variable). There are 3000 generated events into the "testid" bucket.
These are the results
max perf increase = 7873289/5116531 = ~53% While this query is incredibly simple and doesn't do any transforms you also have to take into account that the more transforms the more you pass more variables around which means more cloning of data. And another big advantage is the fact that aw-server-rust will use much less memory, which was the reason why I found out this issue in the first place, I was looking for where all the memory was used for and realized that there was a lot of copying of events in query2. I ran the tests again but with 30 000 events instead of 3 000 and here are the results
average perf increase = ~21% |
Or maybe just passing by reference and let the transforms decide if they need to clone or not like we do in query2 in python... |
I think I'll go to pass-by-reference first as that would allow us to pass by reference at all (which the language does not support today). |
I think I have some ideas about how to implement reference counting, will give that a try. |
Problem
Currently every time we reference a variable we are cloning the whole value to guarantee that the value of the variable cannot be modified in some function without re-assigning it.
If we had explicit cloning built-in to the language so cloning is only done when necessary querying would probably be insanely faster.
The python implementation seems to be inconsistent as some transforms consumes the data and others don't and the query2 language doesn't seem to clone variables so things can get very complicated. Will have to investigate further on this though.
Suggestions on how it could be solved
1. New default is to assume mutable referenceclone(data) returns cloned data*data to specify that variable will be consumed and undefined after this. Needed to avoid unnecessary cloningWill not work because here the same value can be in multiple variables, breaks ownership rules.
2. New default is to assume immutable referenceclone(data) returns cloned data*data to specify that variable will be consumed and undefined after this. Needed to avoid unnecessary cloning.Will not work because here the same value can be in multiple variables, breaks ownership rules.
3. New default is to assume consume
clone(data) returns cloned data
&data to specify that variable will be used as a reference
4. Keep as is, but introduce some pre-processing step which does reference counting before executing
Since we don't have any loops or non-transform function support this would be possible to do, but still a bit messy to get working. The result might not be perfect either.
The text was updated successfully, but these errors were encountered: