-
-
Notifications
You must be signed in to change notification settings - Fork 624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize base:runtime.memory_*
#4063
base: master
Are you sure you want to change the base?
Conversation
At first glance it looks like the SIMD version is slower until somewhere between 64 and 256 bytes, so it should probably remain scalar for small sizes. |
Do you mean to add in a specific path for < I was wondering if it might be a good idea to have variable-width SIMD scanners for something like this too, or defaulting to a smaller register size, like 128 bits. |
Well, it would be great if the SIMD timings were formatted as relative to the scalar results, e.g.: (made up factors)
At first blush it looks like the SIMD paths aren't really any faster until around 64 bytes. Once we can compare the numbers side by side instead of scrolling up 2 pages, it'll be possible to see where the scalar cut-off should be. At a glance it appears that it would be sensible to stay scalar until the next aligned size that's greater or equal to 64 bytes and go wide after that. Or alternatively, Would it be a lot of work to perform the two sets of timing first, and then present them in a Because as you intimated, it's not a clear 10x win across the board, so we need to know where the sensible cut-off actually is. |
That's to be expected, since it's iterating as a scalar up until
I'll see what I can do about a more sensible output later. It's just a matter of getting all the data points at once, then composing the output. |
This depends on "what do we want to do with AVX2", as that is the only currently supported SIMD instruction set that is not 128-bits. ARM SVE/SVE2 allows wider, but we do not support it (and neither does Apple silicon). AVX2 has some.... "interesting" properties on pre-Ice Lake Intel hardware.
We currently do not test on targets that are picky about non-SIMD alignment as ARM starting with v8 (both 32 and 64-bit) is fine with it, and x86/amd64 has always been fine with it. and WASM is fine with it. Looking forward, off the top of my head, unaligned loads would cause problems on ARMv7 (Raspi <= 3), PPC64/POWER, and probably RISC-V. As a side note, while unapplicable to this PR, I'm kind of unconvinced that forcing alignment before entering the bulk path is worth the added complexity/codesize (at least on Intel architectures) as:
|
I should clarify that when I said register, I was referring to chunk. So how we use
Interesting.
If not applicable to this PR, are you saying this comment applies more to the Are you suggesting to instead iterate as a scalar until the length is aligned to the scanner width, then do unaligned loads the full way through? Either way, I find this sort of information fascinating, so thank you for your input. I'm aware of Agner Fog's work, though I haven't read into it too deeply yet. Do you have any other resources on this topic to share? |
nb: I will be explicitly ignoring i386 for the purpose of this discussion, along with ARM SVE2 (because of lack of Odin support).
So the way this PR (and the previous PR) are written, we are telling LLVM to do 512-bit vector operations. In most situations this causes a disconnect between the chunk size and what the actual hardware is capable of (128-bit SIMD on wasm/ARM/vanilla amd64, 256-bit SIMD on modern amd64, 512-bit SIMD on certain Intel/Ryzen). So this is roughly what happens, depending on the target microarch:
This is both good and bad. It is good in the sense that "the best possible thing gets used", bad in the sense that "we probably should be avoiding AVX512 for now" (Anything that predates Ice Lake/Zen 4 downclocks etc). This also depends on quite a bit of LLVM magic, and as a matter of preference I tend to favor writing code like this that mirrors the underlying hardware.
Referring to the previous PR, and my comment about "writing code that is true to the underlying hardware" something like this is what I would have done. https://gist.github.com/Yawning/2f7826f10a5d7af813d9b9b7121eb117 In each of the SIMD while loops, I just do unaligned loads. On anything recent, intuitively, I doubt the difference is large enough to matter, and simplicity is nice, but this depends on "what is the expected typical input size". With a sufficiently non-suck This does give up on AVX512, but support for that is spotty and it's a foot+gun on things that aren't Ice Lake/Rocket Lake/Zen 4.
Intel's optimization reference is good (https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html) ps: Hopefully this doesn't come off as too nitpicky. |
The speedups here aren't as profound as with
index_byte
, but there are some noticeable improvements with larger data sizes.I'm going to leave this as a draft to await feedback. I'm mostly interested to hear if this has any impact in real-world programs.
I want to note that the original
memory_compare*
procs had no notion of unaligned loads. Were we just getting by without checking for alignment somehow?odin test . -o:aggressive -disable-assert -no-bounds-check -define:ODIN_TEST_THREADS=1 -microarch:alderlake
Plain results
SIMD results