Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared stack between calls #816

Closed
wants to merge 15 commits into from
Closed

Conversation

thedevbirb
Copy link
Contributor

This is an experiment for a shared stack between calls. Let me know what you think about it and if it is feasible to bring it in!
The model is very similar to last one developed in #445 (with checkpoints), although there are some differences for the allocation strategy.

Allocations

There are three different strategies here:

  • do one single 32MB allocation -> we've seen with the shared memory that this approach is not good and overkill most of the times
  • check for capacity at every push operation in the stack -> does the minimum amount of memory allocations, however it adds a small but non-negligible overhead for push operations, which results in a performance regression in a very stress test like the snailtracer bench
  • ensure a capacity of STACK_LIMIT every time you enter new context -> Although this is very similar to what happens normally (i.e., on new context we allocate space for the stack), it still keeps peak memory usage allocated for next contexts and allows for faster unsafe operations, reducing push operations overhead. This is the approach I kept after all the experiments

Performance

I managed to keep regression to the minimum, however benches don't really the shared stack at all. Even bench_eval on the snailtracer remains always on the same context, therefore the shared mechanism doesn't apply well.
There is of course a minimum of overhead because this abstraction ain't free, but it seems feasible.

analysis/transact/raw   time:   [8.2504 µs 8.3154 µs 8.4299 µs]
                        change: [+2.8469% +3.7488% +4.8213%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
analysis/transact/checked
                        time:   [8.2338 µs 8.3507 µs 8.4382 µs]
                        change: [+1.0517% +1.9726% +2.9546%] (p = 0.00 < 0.05)
                        Change within noise threshold.
analysis/transact/analysed
                        time:   [5.6613 µs 5.6720 µs 5.6838 µs]
                        change: [+0.6835% +2.5449% +4.1668%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

snailtracer/transact/analysed
                        time:   [67.868 ms 68.022 ms 68.191 ms]
                        change: [+12.395% +13.393% +14.409%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
snailtracer/eval        time:   [61.284 ms 61.386 ms 61.590 ms]
                        change: [+0.6965% +5.5300% +9.5800%] (p = 0.03 < 0.05)
                        Change within noise threshold.

transfer/transact/analysed
                        time:   [1.2470 µs 1.2484 µs 1.2505 µs]
                        change: [+6.3084% +6.6428% +6.9314%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

@rakita
Copy link
Member

rakita commented Oct 18, 2023

You can check performance with cachegrind and docker like this: #797

and additionally, we should write the proper perf test for this. There is maybe bytecode that we can reuse from eth/tests: https://github.com/ethereum/tests/tree/develop/GeneralStateTests
There are fillers that are more readable: https://github.com/ethereum/tests/tree/develop/src/GeneralStateTestsFiller/stCallCodes

@thedevbirb
Copy link
Contributor Author

thedevbirb commented Oct 18, 2023

You can check performance with cachegrind and docker like this: #797

and additionally, we should write the proper perf test for this. There is maybe bytecode that we can reuse from eth/tests: https://github.com/ethereum/tests/tree/develop/GeneralStateTests There are fillers that are more readable: https://github.com/ethereum/tests/tree/develop/src/GeneralStateTestsFiller/stCallCodes

I agree regarding a proper performance test for this. We need to choose something that goes up and down in call depths to have a proper picture of the gains of this setup.

In the meanwhile, I tried with cachegrind as you said (thanks, I'll keep that mind for the future). Here are the results on my machine:

Therefore yes there is still some work to do to bring down regression when this the shared setup is not used

crates/interpreter/src/host.rs Outdated Show resolved Hide resolved
/// Stack.
pub stack: Stack,
/// Shared stack.
pub shared_stack: &'a mut SharedStack,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we somehow expose the local stack here? With reference, we have two hops one to the Shared stack and the second to the buffer to access the stack.

Copy link
Contributor Author

@thedevbirb thedevbirb Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so what you have in mind is not exposing all the shared stack struct but only a smaller version which is simply a wrapper of buffer: *mut Buffer.

If I got this right it makes sense, however the call/create opcode and call_inner/create_inner functions would have some problems because they get the shared stack from the interpreter itself, therefore we cannot hide some methods like new/free_context.
EDIT: in general I could have problems with the Host trait and passing around SharedContext as you suggested above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even *mut Buffer would be a pointer to the Buffer that has a pointer to the stack (Vec). With the new loop call, this is resolved as we fully move the the first structure to the Iterator.

tbh not sure how impactful is this, maybe it is insignificant

crates/interpreter/src/interpreter/shared_stack.rs Outdated Show resolved Hide resolved
@thedevbirb
Copy link
Contributor Author

thedevbirb commented Oct 31, 2023

Hey @rakita, I tried to ask on the ethereum/tests repository about a stress test but the suggestion was to write a custom one.

In the meanwhile I tried to implement this approval transfer bench: even if it does not reach a huge depth it loops over context changes, therefore I thought it could be good to see how it performs over those.
Here are the result (cargo bench --all against main 0d78d1e):

approval_transfer/transact/analysed
                        time:   [4.7078 µs 4.7261 µs 4.7508 µs]
                        change: [+10.900% +12.297% +13.797%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
approval_transfer/eval  time:   [55.583 ns 55.690 ns 55.817 ns]
                        change: [-84.656% -84.591% -84.545%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Regarding the custom test is there something in particular you had in mind?

/// heap lookup for basic stack operations.
///
/// Invariant: it a valid pointer to `self.pages[self.page_idx].buffer`
buffer: *mut Buffer,
Copy link
Member

@rakita rakita Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would put full Page here not just the Buffer. and rename pages field to previous_pages to represent pages that are not active

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm i see the problem here, maybe we should not look at free pages as a stack. wdyt about idea of having two fields: taken_pages and free_pages and they are both Vecs.

if the context is freed you put the Page to the free_pages vec and on new_context you pop page from free_pages or create a new one if there is none.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea (not related to the first part) is related to the Buffer and the context_len, Buffer is a Vec here that has its own len and additionally we have context_len that we increment.

We could work with *mut U256 which would point us directly to the top of the stack, and we could have context_len that would tell us the bound of the stack. I like this idea, but we should be careful about the usage of the pointer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the first idea! It should make it easier to reason about and we can avoid the page_idx which can be annoying to work with.

Regarding the second idea let me know if I got this right: with a raw pointer to the top of the stack, all operations are not done on the buffer itself (which is preallocated with 1024 capacity) but rather on the pointer, by dereferencing and incrementing/decrementing it. If that's the case, do you think it provides a performance improvement over dealing with the Buffer pointer or it is a matter of ergonomics?

Lastly, but slighly unrelated: wdyt of putting both shared stack and shared memory under the EVMContext struct? Then when creating the interpreter we can pass a raw pointer to the current context buffer of stack and memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of the pointer should be slightly faster, it should be less ergonomical as you directly handle the pointer.

Would be good to put it in EvmContext but I think borrowing is going to kill us there, and it would not matter a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll try to modify the shared stack with suggestions. It will be some work to do, I hope some wait is not a problem!

Regarding EVMContext maybe I can think of it in a separate PR after this and see if it is reasonable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take your time and pace yourself, I am fine to take over if you feel burdened to finish it, and I am fine with waiting. Both things work for me.

We can maybe saparate that pointer idea to another PR to not clog this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if you want I'd be very happy to work on the pointer idea on a separate PR!

Copy link
Member

@rakita rakita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few ideas that could improve the PR, I like the page system and new context, but I wanted to check how loop call will look so we can integrate this inside it so I apologize for the long overdue review.

@thedevbirb
Copy link
Contributor Author

left a few ideas that could improve the PR, I like the page system and new context, but I wanted to check how loop call will look so we can integrate this inside it so I apologize for the long overdue review.

Thanks for the feedback and no problem for the overdue!

crates/interpreter/src/interpreter/shared_stack.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_stack.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_stack.rs Outdated Show resolved Hide resolved
crates/interpreter/src/interpreter/shared_stack.rs Outdated Show resolved Hide resolved
@thedevbirb
Copy link
Contributor Author

Hey there, I pushed some changes regarding the free and taken pages model. Also, now both stack and memory are under a SharedContext struct.
I still need to address some of the comments regarding UB and upstream sync, will do it shortly!

Comment on lines 3 to 6
pub const EMPTY_SHARED_CONTEXT: SharedContext = SharedContext {
stack: EMPTY_SHARED_STACK,
memory: EMPTY_SHARED_MEMORY,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a const fn empty() -> Self { ... } or const EMPTY: Self = ...; on all the related structs. Const item initializers are weird to me, so I'd prefer the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! I kept this for consistency with rakita's work, if he also agrees with this change for me it's perfectly fine

crates/interpreter/src/interpreter.rs Outdated Show resolved Hide resolved
@@ -96,14 +97,18 @@ impl Interpreter {
instruction_result: InstructionResult::Continue,
is_static,
return_data_buffer: Bytes::new(),
shared_memory: EMPTY_SHARED_MEMORY,
stack: Stack::new(),
shared_context: EMPTY_SHARED_CONTEXT,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be ::new() to pre-allocate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't think so, because we want to manually give the created context to then interpreter when it calls the run method of EVMImpl. See https://github.com/bluealloy/revm/pull/816/files/1de2425fed99f33422e0fa37911f7695a2a39c5b#diff-1d478ba44ccc56e3b1142bd3723bf97f3e254c25dd18323481aedadce0803e91R165-R170. Maybe I am missing something from what you said

@rakita rakita mentioned this pull request Nov 22, 2023
@thedevbirb
Copy link
Contributor Author

Hey guys, I resolved some comments to clean this up: is there something left to do? More perf tests? Right now perf isn't ideal if you remain always in the same context. Maybe I can try to use the pointer to the top of the stack as @rakita said

@rakita
Copy link
Member

rakita commented Nov 30, 2023

Hey guys, I resolved some comments to clean this up: is there something left to do? More perf tests? Right now perf isn't ideal if you remain always in the same context. Maybe I can try to use the pointer to the top of the stack as @rakita said

Hey, mostly focusing on delivering the EvmBuilder and refactor around it, so will look at this after that.

@thedevbirb
Copy link
Contributor Author

thedevbirb commented Jan 14, 2024

Hey @rakita, I've seen that EVM Context-Builder PR is merged. Great work!

I tried to take a look at the changes and I was wondering if it makes sense to take a look at this from scratch, if you feel a shared stack would be beneficial for this evm.

Both the pages model can be revisited (imo, it seems that a more complex strategy that allocates less is worse than a simpler one with more allocations) and also the SharedContext doesn't seem to play too nicely with the current setup. An example is passing manually SharedContext taken from the interpreter to sub_create and down to other functions such as insert_create_output which is currently used by inspector logic too.

@rakita
Copy link
Member

rakita commented Jan 15, 2024

Hey @rakita, I've seen that EVM Context-Builder PR is merged. Great work!

I tried to take a look at the changes and I was wondering if it makes sense to take a look at this from scratch, if you feel a shared stack would be beneficial for this evm.

Both the pages model can be revisited (imo, it seems that a more complex strategy that allocates less is worse than a simpler one with more allocations) and also the SharedContext doesn't seem to play too nicely with the current setup. An example is passing manually SharedContext taken from the interpreter to sub_create and down to other functions such as insert_create_output which is currently used by inspector logic too.

Thanks @lorenzofero!

My view of shared context is we want it is some way, I am open to new ideas, but just to say this is very low priority, as there is no impact by doing this.

@thedevbirb
Copy link
Contributor Author

Closing this as I think it should be re-visited from scratch. Might do that in the future but I'd prefer to focus on other contributions on revm, time permitting :)

@thedevbirb thedevbirb closed this Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants