core/vm: use uint256.Bytes32 and builtin copy to make MSTORE faster #637

minh-bq · 2024-11-28T07:55:03Z

In commit f791124 ("core/vm: optimize the mstore opcode with loop unrolling"), we optimize the loop that copies each byte by manually unrolling the loop as it seems like Go cannot do that at this time. This makes the code quite ugly and might increase the number of unique instructions executed, creates more pressure to the instruction cache.

This commit instead follows the go-ethereum commit e0a1fd5 ("core/vm: optimize Memory.Set32") by using uint256.Bytes32 and builtin copy. The uint256.Bytes32 is inlined and is compiled into fewer instructions 4x (load, bswap, store). The builtin copy can copy 32 bytes by just 2 load-store pairs using 128-bit (16-byte) xmm register.

goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/core/vm
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
                          │   old.txt    │               new.txt                │
                          │    sec/op    │    sec/op     vs base                │
EvmInsertionSort-8          108.7m ±  6%   104.6m ± 26%        ~ (p=0.631 n=10)
EvmQuickSort-8              6.848m ± 14%   6.633m ±  3%        ~ (p=0.089 n=10)
EvmSignatureValidation-8    15.96µ ±  3%   15.48µ ±  3%        ~ (p=0.052 n=10)
EvmMulticallErcTransfer-8   6.503m ± 15%   6.562m ±  6%        ~ (p=0.912 n=10)
EvmRedBlackTree-8           302.5m ±  4%   305.0m ±  2%        ~ (p=0.684 n=10)
OpMstore-8                  33.47n ± 10%   30.09n ±  6%  -10.07% (p=0.000 n=10)
geomean                     959.9µ         930.0µ         -3.11%

In commit f791124 ("core/vm: optimize the mstore opcode with loop unrolling"), we optimize the loop that copies each byte by manually unrolling the loop as it seems like Go cannot do that at this time. This makes the code quite ugly and might increase the number of unique instructions executed, creates more pressure to the instruction cache. This commit instead follows the go-ethereum commit e0a1fd5 ("core/vm: optimize Memory.Set32") by using uint256.Bytes32 and builtin copy. The uint256.Bytes32 is inlined and is compiled into fewer instructions 4x (load, bswap, store). The builtin copy can copy 32 bytes by just 2 load-store pairs using 128-bit (16-byte) xmm register. goos: linux goarch: amd64 pkg: github.com/ethereum/go-ethereum/core/vm cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ EvmInsertionSort-8 108.7m ± 6% 104.6m ± 26% ~ (p=0.631 n=10) EvmQuickSort-8 6.848m ± 14% 6.633m ± 3% ~ (p=0.089 n=10) EvmSignatureValidation-8 15.96µ ± 3% 15.48µ ± 3% ~ (p=0.052 n=10) EvmMulticallErcTransfer-8 6.503m ± 15% 6.562m ± 6% ~ (p=0.912 n=10) EvmRedBlackTree-8 302.5m ± 4% 305.0m ± 2% ~ (p=0.684 n=10) OpMstore-8 33.47n ± 10% 30.09n ± 6% -10.07% (p=0.000 n=10) geomean 959.9µ 930.0µ -3.11%

minh-bq closed this Nov 29, 2024

minh-bq deleted the optimize-mstore branch November 29, 2024 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core/vm: use uint256.Bytes32 and builtin copy to make MSTORE faster #637

core/vm: use uint256.Bytes32 and builtin copy to make MSTORE faster #637

minh-bq commented Nov 28, 2024

core/vm: use uint256.Bytes32 and builtin copy to make MSTORE faster #637

core/vm: use uint256.Bytes32 and builtin copy to make MSTORE faster #637

Conversation

minh-bq commented Nov 28, 2024