-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared stack between calls #816
Conversation
You can check performance with cachegrind and docker like this: #797 and additionally, we should write the proper perf test for this. There is maybe bytecode that we can reuse from eth/tests: https://github.com/ethereum/tests/tree/develop/GeneralStateTests |
I agree regarding a proper performance test for this. We need to choose something that goes up and down in call depths to have a proper picture of the gains of this setup. In the meanwhile, I tried with cachegrind as you said (thanks, I'll keep that mind for the future). Here are the results on my machine: Therefore yes there is still some work to do to bring down regression when |
6c4acdf
to
69910ee
Compare
69910ee
to
85c628e
Compare
/// Stack. | ||
pub stack: Stack, | ||
/// Shared stack. | ||
pub shared_stack: &'a mut SharedStack, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we somehow expose the local stack here? With reference, we have two hops one to the Shared stack and the second to the buffer to access the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so what you have in mind is not exposing all the shared stack struct but only a smaller version which is simply a wrapper of buffer: *mut Buffer
.
If I got this right it makes sense, however the call
/create
opcode and call_inner
/create_inner
functions would have some problems because they get the shared stack from the interpreter itself, therefore we cannot hide some methods like new/free_context
.
EDIT: in general I could have problems with the Host
trait and passing around SharedContext
as you suggested above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even *mut Buffer would be a pointer to the Buffer that has a pointer to the stack (Vec). With the new loop call, this is resolved as we fully move the the first structure to the Iterator.
tbh not sure how impactful is this, maybe it is insignificant
Hey @rakita, I tried to ask on the In the meanwhile I tried to implement this approval transfer bench: even if it does not reach a huge depth it loops over context changes, therefore I thought it could be good to see how it performs over those.
Regarding the custom test is there something in particular you had in mind? |
/// heap lookup for basic stack operations. | ||
/// | ||
/// Invariant: it a valid pointer to `self.pages[self.page_idx].buffer` | ||
buffer: *mut Buffer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would put full Page
here not just the Buffer
. and rename pages
field to previous_pages
to represent pages that are not active
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm i see the problem here, maybe we should not look at free pages as a stack. wdyt about idea of having two fields: taken_pages
and free_pages
and they are both Vecs.
if the context is freed you put the Page to the free_pages
vec and on new_context
you pop page from free_pages
or create a new one if there is none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea (not related to the first part) is related to the Buffer
and the context_len
, Buffer is a Vec
here that has its own len
and additionally we have context_len
that we increment.
We could work with *mut U256
which would point us directly to the top of the stack, and we could have context_len
that would tell us the bound of the stack. I like this idea, but we should be careful about the usage of the pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the first idea! It should make it easier to reason about and we can avoid the page_idx
which can be annoying to work with.
Regarding the second idea let me know if I got this right: with a raw pointer to the top of the stack, all operations are not done on the buffer itself (which is preallocated with 1024 capacity) but rather on the pointer, by dereferencing and incrementing/decrementing it. If that's the case, do you think it provides a performance improvement over dealing with the Buffer
pointer or it is a matter of ergonomics?
Lastly, but slighly unrelated: wdyt of putting both shared stack and shared memory under the EVMContext
struct? Then when creating the interpreter we can pass a raw pointer to the current context buffer of stack and memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usage of the pointer should be slightly faster, it should be less ergonomical as you directly handle the pointer.
Would be good to put it in EvmContext
but I think borrowing is going to kill us there, and it would not matter a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll try to modify the shared stack with suggestions. It will be some work to do, I hope some wait is not a problem!
Regarding EVMContext
maybe I can think of it in a separate PR after this and see if it is reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take your time and pace yourself, I am fine to take over if you feel burdened to finish it, and I am fine with waiting. Both things work for me.
We can maybe saparate that pointer idea to another PR to not clog this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if you want I'd be very happy to work on the pointer idea on a separate PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a few ideas that could improve the PR, I like the page system and new context, but I wanted to check how loop call will look so we can integrate this inside it so I apologize for the long overdue review.
Thanks for the feedback and no problem for the overdue! |
Hey there, I pushed some changes regarding the free and taken pages model. Also, now both stack and memory are under a SharedContext struct. |
pub const EMPTY_SHARED_CONTEXT: SharedContext = SharedContext { | ||
stack: EMPTY_SHARED_STACK, | ||
memory: EMPTY_SHARED_MEMORY, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a const fn empty() -> Self { ... }
or const EMPTY: Self = ...;
on all the related structs. Const item initializers are weird to me, so I'd prefer the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! I kept this for consistency with rakita's work, if he also agrees with this change for me it's perfectly fine
@@ -96,14 +97,18 @@ impl Interpreter { | |||
instruction_result: InstructionResult::Continue, | |||
is_static, | |||
return_data_buffer: Bytes::new(), | |||
shared_memory: EMPTY_SHARED_MEMORY, | |||
stack: Stack::new(), | |||
shared_context: EMPTY_SHARED_CONTEXT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be ::new()
to pre-allocate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't think so, because we want to manually give the created context to then interpreter when it calls the run
method of EVMImpl
. See https://github.com/bluealloy/revm/pull/816/files/1de2425fed99f33422e0fa37911f7695a2a39c5b#diff-1d478ba44ccc56e3b1142bd3723bf97f3e254c25dd18323481aedadce0803e91R165-R170. Maybe I am missing something from what you said
60640ad
to
b61d2c1
Compare
b61d2c1
to
e89b92f
Compare
Hey guys, I resolved some comments to clean this up: is there something left to do? More perf tests? Right now perf isn't ideal if you remain always in the same context. Maybe I can try to use the pointer to the top of the stack as @rakita said |
Hey, mostly focusing on delivering the EvmBuilder and refactor around it, so will look at this after that. |
Hey @rakita, I've seen that EVM Context-Builder PR is merged. Great work! I tried to take a look at the changes and I was wondering if it makes sense to take a look at this from scratch, if you feel a shared stack would be beneficial for this evm. Both the pages model can be revisited (imo, it seems that a more complex strategy that allocates less is worse than a simpler one with more allocations) and also the |
Thanks @lorenzofero! My view of shared context is we want it is some way, I am open to new ideas, but just to say this is very low priority, as there is no impact by doing this. |
Closing this as I think it should be re-visited from scratch. Might do that in the future but I'd prefer to focus on other contributions on revm, time permitting :) |
This is an experiment for a shared stack between calls. Let me know what you think about it and if it is feasible to bring it in!
The model is very similar to last one developed in #445 (with checkpoints), although there are some differences for the allocation strategy.
Allocations
There are three different strategies here:
push
operation in the stack -> does the minimum amount of memory allocations, however it adds a small but non-negligible overhead forpush
operations, which results in a performance regression in a very stress test like the snailtracer benchSTACK_LIMIT
every time you enter new context -> Although this is very similar to what happens normally (i.e., on new context we allocate space for the stack), it still keeps peak memory usage allocated for next contexts and allows for fasterunsafe
operations, reducingpush
operations overhead. This is the approach I kept after all the experimentsPerformance
I managed to keep regression to the minimum, however benches don't really the shared stack at all. Even
bench_eval
on the snailtracer remains always on the same context, therefore the shared mechanism doesn't apply well.There is of course a minimum of overhead because this abstraction ain't free, but it seems feasible.